All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/13] Nested EPT
@ 2013-07-25 10:59 Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

After changing hands several times I proud to present a new version of
Nested EPT patches. Nothing groundbreaking here comparing to v3: all 
review comment are addressed, some by Yang Zhang and some by Yours Truly.

Gleb Natapov (1):
  nEPT: make guest's A/D bits depends on guest's paging mode

Nadav Har'El (10):
  nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  nEPT: Fix cr3 handling in nested exit and entry
  nEPT: Fix wrong test in kvm_set_cr3
  nEPT: Move common code to paging_tmpl.h
  nEPT: Add EPT tables support to paging_tmpl.h
  nEPT: Nested INVEPT
  nEPT: MMU context for nested EPT
  nEPT: Advertise EPT to L1
  nEPT: Some additional comments
  nEPT: Miscelleneous cleanups

Yang Zhang (2):
  nEPT: Redefine EPT-specific link_shadow_page()
  nEPT: Add nEPT violation/misconfigration support

 arch/x86/include/asm/kvm_host.h |    4 +
 arch/x86/include/asm/vmx.h      |    3 +
 arch/x86/include/uapi/asm/vmx.h |    1 +
 arch/x86/kvm/mmu.c              |  134 ++++++++++++++-----------
 arch/x86/kvm/mmu.h              |    2 +
 arch/x86/kvm/paging_tmpl.h      |  175 ++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx.c              |  210 ++++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |   11 --
 8 files changed, 436 insertions(+), 104 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-29  8:32   ` Paolo Bonzini
  2013-07-25 10:59 ` [PATCH v4 02/13] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e999dc7..1e9437f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #else
 	nested_vmx_exit_ctls_high = 0;
 #endif
-	nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
+	nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
+				      VM_EXIT_LOAD_IA32_EFER);
 
 	/* entry controls */
 	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 	nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 	nested_vmx_entry_ctls_high &=
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
-	nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
-
+	nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
+				       VM_ENTRY_LOAD_IA32_EFER);
 	/* cpu-based controls */
 	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
 		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
@@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 
-	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
-	vmcs_write32(VM_EXIT_CONTROLS,
-		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
-	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
+	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
+	 * bits are further modified by vmx_set_efer() below.
+	 */
+	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+	 * emulated by vmx_set_efer(), below.
+	 */
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
+			~VM_ENTRY_IA32E_MODE) |
 		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
 
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 02/13] nEPT: Fix cr3 handling in nested exit and entry
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1e9437f..89b15df 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7595,6 +7595,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
 	kvm_mmu_reset_context(vcpu);
 
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+	}
+
 	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
 	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -7917,6 +7928,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vmcs12->guest_pending_dbg_exceptions =
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+	/*
+	 * In some cases (usually, nested EPT), L2 is allowed to change its
+	 * own CR3 without exiting. If it has changed it, we must keep it.
+	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+	 */
+	if (enable_ept)
+		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so save their PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+
 	vmcs12->vm_entry_controls =
 		(vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
 		(vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 02/13] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-29  8:36   ` Paolo Bonzini
  2013-07-31  8:02   ` Xiao Guangrong
  2013-07-25 10:59 ` [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h Gleb Natapov
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   12 ++++--------
 arch/x86/kvm/x86.c |   11 -----------
 2 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 89b15df..56d0066 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7596,8 +7596,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	kvm_mmu_reset_context(vcpu);
 
 	/*
-	 * Additionally, except when L0 is using shadow page tables, L1 or
-	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+	 * L1 may access the L2's PDPTR, so save them to construct vmcs12
 	 */
 	if (enable_ept) {
 		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
@@ -7933,14 +7932,11 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	 * own CR3 without exiting. If it has changed it, we must keep it.
 	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
 	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
-	 */
-	if (enable_ept)
-		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
-	/*
-	 * Additionally, except when L0 is using shadow page tables, L1 or
-	 * L2 control guest_cr3 for L2, so save their PDPTEs
+	 *
+	 * Additionally, restore L2's PDPTR to vmcs12.
 	 */
 	if (enable_ept) {
+		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
 		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
 		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
 		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2caeb9..e2fef8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		 */
 	}
 
-	/*
-	 * Does the new cr3 value map to physical memory? (Note, we
-	 * catch an invalid cr3 even in real-mode, because it would
-	 * cause trouble later on when we turn on paging anyway.)
-	 *
-	 * A real CPU would silently accept an invalid cr3 and would
-	 * attempt to use it - with largely undefined (and often hard
-	 * to debug) behavior on the guest side.
-	 */
-	if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-		return 1;
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	vcpu->arch.mmu.new_cr3(vcpu);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (2 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-31  8:02   ` Xiao Guangrong
  2013-07-25 10:59 ` [PATCH v4 05/13] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
to paging_tmpl.h.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c         |   55 ------------------------------
 arch/x86/kvm/paging_tmpl.h |   80 +++++++++++++++++++++++++++++++++++++-------
 2 files changed, 68 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3a9493a..4c4274d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -331,11 +331,6 @@ static int is_large_pte(u64 pte)
 	return pte & PT_PAGE_SIZE_MASK;
 }
 
-static int is_dirty_gpte(unsigned long pte)
-{
-	return pte & PT_DIRTY_MASK;
-}
-
 static int is_rmap_spte(u64 pte)
 {
 	return is_shadow_present_pte(pte);
@@ -2574,14 +2569,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 	mmu_free_roots(vcpu);
 }
 
-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
-	int bit7;
-
-	bit7 = (gpte >> 7) & 1;
-	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
-}
-
 static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 				     bool no_dirty_log)
 {
@@ -2594,26 +2581,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return gfn_to_pfn_memslot_atomic(slot, gfn);
 }
 
-static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
-				  struct kvm_mmu_page *sp, u64 *spte,
-				  u64 gpte)
-{
-	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
-		goto no_present;
-
-	if (!is_present_gpte(gpte))
-		goto no_present;
-
-	if (!(gpte & PT_ACCESSED_MASK))
-		goto no_present;
-
-	return false;
-
-no_present:
-	drop_spte(vcpu->kvm, spte);
-	return true;
-}
-
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 				    struct kvm_mmu_page *sp,
 				    u64 *start, u64 *end)
@@ -3501,18 +3468,6 @@ static void paging_free(struct kvm_vcpu *vcpu)
 	nonpaging_free(vcpu);
 }
 
-static inline void protect_clean_gpte(unsigned *access, unsigned gpte)
-{
-	unsigned mask;
-
-	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
-
-	mask = (unsigned)~ACC_WRITE_MASK;
-	/* Allow write access to dirty gptes */
-	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
-	*access &= mask;
-}
-
 static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
 			   unsigned access, int *nr_present)
 {
@@ -3530,16 +3485,6 @@ static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
 	return false;
 }
 
-static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
-{
-	unsigned access;
-
-	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
-	access &= ~(gpte >> PT64_NX_SHIFT);
-
-	return access;
-}
-
 static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte)
 {
 	unsigned index;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7769699..fb26ca9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -80,6 +80,31 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
 	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
 }
 
+static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
+{
+	unsigned mask;
+
+	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
+
+	mask = (unsigned)~ACC_WRITE_MASK;
+	/* Allow write access to dirty gptes */
+	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
+	*access &= mask;
+}
+
+static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+	int bit7;
+
+	bit7 = (gpte >> 7) & 1;
+	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
+static inline int FNAME(is_present_gpte)(unsigned long pte)
+{
+	return is_present_gpte(pte);
+}
+
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			       pt_element_t __user *ptep_user, unsigned index,
 			       pt_element_t orig_pte, pt_element_t new_pte)
@@ -103,6 +128,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return (ret != orig_pte);
 }
 
+static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
+				  struct kvm_mmu_page *sp, u64 *spte,
+				  u64 gpte)
+{
+	if (FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+		goto no_present;
+
+	if (!FNAME(is_present_gpte)(gpte))
+		goto no_present;
+
+	if (!(gpte & PT_ACCESSED_MASK))
+		goto no_present;
+
+	return false;
+
+no_present:
+	drop_spte(vcpu->kvm, spte);
+	return true;
+}
+
+static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+	unsigned access;
+
+	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+	access &= ~(gpte >> PT64_NX_SHIFT);
+
+	return access;
+}
+
 static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 					     struct kvm_mmu *mmu,
 					     struct guest_walker *walker,
@@ -123,7 +178,8 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
 			pte |= PT_ACCESSED_MASK;
 		}
-		if (level == walker->level && write_fault && !is_dirty_gpte(pte)) {
+		if (level == walker->level && write_fault &&
+				!(pte & PT_DIRTY_MASK)) {
 			trace_kvm_mmu_set_dirty_bit(table_gfn, index, sizeof(pte));
 			pte |= PT_DIRTY_MASK;
 		}
@@ -170,7 +226,7 @@ retry_walk:
 	if (walker->level == PT32E_ROOT_LEVEL) {
 		pte = mmu->get_pdptr(vcpu, (addr >> 30) & 3);
 		trace_kvm_mmu_paging_element(pte, walker->level);
-		if (!is_present_gpte(pte))
+		if (!FNAME(is_present_gpte)(pte))
 			goto error;
 		--walker->level;
 	}
@@ -215,17 +271,17 @@ retry_walk:
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
 
-		if (unlikely(!is_present_gpte(pte)))
+		if (unlikely(!FNAME(is_present_gpte)(pte)))
 			goto error;
 
-		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
+		if (unlikely(FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, pte,
 					      walker->level))) {
 			errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
 			goto error;
 		}
 
 		accessed_dirty &= pte;
-		pte_access = pt_access & gpte_access(vcpu, pte);
+		pte_access = pt_access & FNAME(gpte_access)(vcpu, pte);
 
 		walker->ptes[walker->level - 1] = pte;
 	} while (!is_last_gpte(mmu, walker->level, pte));
@@ -248,7 +304,7 @@ retry_walk:
 	walker->gfn = real_gpa >> PAGE_SHIFT;
 
 	if (!write_fault)
-		protect_clean_gpte(&pte_access, pte);
+		FNAME(protect_clean_gpte)(&pte_access, pte);
 	else
 		/*
 		 * On a write fault, fold the dirty bit into accessed_dirty by
@@ -309,14 +365,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn_t gfn;
 	pfn_t pfn;
 
-	if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
+	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
 
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
 
 	gfn = gpte_to_gfn(gpte);
-	pte_access = sp->role.access & gpte_access(vcpu, gpte);
-	protect_clean_gpte(&pte_access, gpte);
+	pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+	FNAME(protect_clean_gpte)(&pte_access, gpte);
 	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
 			no_dirty_log && (pte_access & ACC_WRITE_MASK));
 	if (is_error_pfn(pfn))
@@ -785,15 +841,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 					  sizeof(pt_element_t)))
 			return -EINVAL;
 
-		if (prefetch_invalid_gpte(vcpu, sp, &sp->spt[i], gpte)) {
+		if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
 			vcpu->kvm->tlbs_dirty++;
 			continue;
 		}
 
 		gfn = gpte_to_gfn(gpte);
 		pte_access = sp->role.access;
-		pte_access &= gpte_access(vcpu, gpte);
-		protect_clean_gpte(&pte_access, gpte);
+		pte_access &= FNAME(gpte_access)(vcpu, gpte);
+		FNAME(protect_clean_gpte)(&pte_access, gpte);
 
 		if (sync_mmio_spte(vcpu->kvm, &sp->spt[i], gfn, pte_access,
 		      &nr_present))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 05/13] nEPT: make guest's A/D bits depends on guest's paging mode
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (3 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

EPT uses different shifts for A/D bits and first version of nEPT does
not support them at all.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/paging_tmpl.h |   30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index fb26ca9..7581395 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -32,6 +32,10 @@
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+	#define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#ifdef CONFIG_X86_64
 	#define PT_MAX_FULL_LEVELS 4
 	#define CMPXCHG cmpxchg
@@ -49,6 +53,10 @@
 	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
 	#define PT_MAX_FULL_LEVELS 2
+	#define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+	#define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#define CMPXCHG cmpxchg
 #else
 	#error Invalid PTTYPE value
@@ -88,7 +96,8 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 
 	mask = (unsigned)~ACC_WRITE_MASK;
 	/* Allow write access to dirty gptes */
-	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
+	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
+		PT_WRITABLE_MASK;
 	*access &= mask;
 }
 
@@ -138,7 +147,7 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 	if (!FNAME(is_present_gpte)(gpte))
 		goto no_present;
 
-	if (!(gpte & PT_ACCESSED_MASK))
+	if (!(gpte & PT_GUEST_ACCESSED_MASK))
 		goto no_present;
 
 	return false;
@@ -174,14 +183,14 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 		table_gfn = walker->table_gfn[level - 1];
 		ptep_user = walker->ptep_user[level - 1];
 		index = offset_in_page(ptep_user) / sizeof(pt_element_t);
-		if (!(pte & PT_ACCESSED_MASK)) {
+		if (!(pte & PT_GUEST_ACCESSED_MASK)) {
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
-			pte |= PT_ACCESSED_MASK;
+			pte |= PT_GUEST_ACCESSED_MASK;
 		}
 		if (level == walker->level && write_fault &&
-				!(pte & PT_DIRTY_MASK)) {
+				!(pte & PT_GUEST_DIRTY_MASK)) {
 			trace_kvm_mmu_set_dirty_bit(table_gfn, index, sizeof(pte));
-			pte |= PT_DIRTY_MASK;
+			pte |= PT_GUEST_DIRTY_MASK;
 		}
 		if (pte == orig_pte)
 			continue;
@@ -235,7 +244,7 @@ retry_walk:
 	ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
 	       (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
 
-	accessed_dirty = PT_ACCESSED_MASK;
+	accessed_dirty = PT_GUEST_ACCESSED_MASK;
 	pt_access = pte_access = ACC_ALL;
 	++walker->level;
 
@@ -310,7 +319,8 @@ retry_walk:
 		 * On a write fault, fold the dirty bit into accessed_dirty by
 		 * shifting it one place right.
 		 */
-		accessed_dirty &= pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
+		accessed_dirty &= pte >>
+			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
 
 	if (unlikely(!accessed_dirty)) {
 		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
@@ -886,3 +896,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 #undef gpte_to_gfn
 #undef gpte_to_gfn_lvl
 #undef CMPXCHG
+#undef PT_GUEST_ACCESSED_MASK
+#undef PT_GUEST_DIRTY_MASK
+#undef PT_GUEST_DIRTY_SHIFT
+#undef PT_GUEST_ACCESSED_SHIFT
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (4 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 05/13] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-29  9:48   ` Paolo Bonzini
  2013-07-30 10:03   ` Paolo Bonzini
  2013-07-25 10:59 ` [PATCH v4 07/13] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c         |    5 +++++
 arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4c4274d..b5273c3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
 	return mmu->last_pte_bitmap & (1 << index);
 }
 
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7581395..e38b3c0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -58,6 +58,21 @@
 	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
 	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+	#define pt_element_t u64
+	#define guest_walker guest_walkerEPT
+	#define FNAME(name) ept_##name
+	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#define PT_GUEST_ACCESSED_MASK 0
+	#define PT_GUEST_DIRTY_MASK 0
+	#define PT_GUEST_DIRTY_SHIFT 0
+	#define PT_GUEST_ACCESSED_SHIFT 0
+	#define CMPXCHG cmpxchg64
+	#define PT_MAX_FULL_LEVELS 4
 #else
 	#error Invalid PTTYPE value
 #endif
@@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
 
 static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 {
+#if PT_GUEST_DIRTY_MASK == 0
+	/* dirty bit is not supported, so no need to track it */
+	return;
+#else
 	unsigned mask;
 
 	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
@@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
 		PT_WRITABLE_MASK;
 	*access &= mask;
+#endif
 }
 
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
@@ -111,7 +131,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 
 static inline int FNAME(is_present_gpte)(unsigned long pte)
 {
+#if PTTYPE != PTTYPE_EPT
 	return is_present_gpte(pte);
+#else
+	return pte & 7;
+#endif
 }
 
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
@@ -147,7 +171,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 	if (!FNAME(is_present_gpte)(gpte))
 		goto no_present;
 
-	if (!(gpte & PT_GUEST_ACCESSED_MASK))
+	/* if accessed bit is not supported prefetch non accessed gpte */
+	if (PT_GUEST_ACCESSED_MASK && !(gpte & PT_GUEST_ACCESSED_MASK))
 		goto no_present;
 
 	return false;
@@ -160,9 +185,14 @@ no_present:
 static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
 {
 	unsigned access;
-
+#if PTTYPE == PTTYPE_EPT
+	BUILD_BUG_ON(ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK);
+	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
+		((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
+#else
 	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
 	access &= ~(gpte >> PT64_NX_SHIFT);
+#endif
 
 	return access;
 }
@@ -212,7 +242,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 				    gva_t addr, u32 access)
 {
-	int ret;
 	pt_element_t pte;
 	pt_element_t __user *uninitialized_var(ptep_user);
 	gfn_t table_gfn;
@@ -322,7 +351,9 @@ retry_walk:
 		accessed_dirty &= pte >>
 			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
 
-	if (unlikely(!accessed_dirty)) {
+	if (PT_GUEST_DIRTY_MASK && unlikely(!accessed_dirty)) {
+		int ret;
+
 		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
 		if (unlikely(ret < 0))
 			goto error;
@@ -359,6 +390,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 					access);
 }
 
+#if PTTYPE != PTTYPE_EPT
 static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 				   struct kvm_vcpu *vcpu, gva_t addr,
 				   u32 access)
@@ -366,6 +398,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
 					addr, access);
 }
+#endif
 
 static bool
 FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -793,6 +826,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
 	return gpa;
 }
 
+#if PTTYPE != PTTYPE_EPT
 static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 				      u32 access,
 				      struct x86_exception *exception)
@@ -811,6 +845,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 
 	return gpa;
 }
+#endif
 
 /*
  * Using the cached information from sp->gfns is safe because:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 07/13] nEPT: Redefine EPT-specific link_shadow_page()
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (5 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 08/13] nEPT: Nested INVEPT Gleb Natapov
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Yang Zhang <yang.z.zhang@Intel.com>

Since nEPT doesn't support A/D bit, so we should not set those bit
when build shadow page table.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c         |   12 +++++++++---
 arch/x86/kvm/paging_tmpl.h |    4 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b5273c3..9e0f467 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2047,12 +2047,18 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	return __shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
 {
 	u64 spte;
 
+	BUILD_BUG_ON(VMX_EPT_READABLE_MASK != PT_PRESENT_MASK ||
+			VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
+
 	spte = __pa(sp->spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
-	       shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+	       shadow_user_mask | shadow_x_mask;
+
+	if (accessed)
+		spte |= shadow_accessed_mask;
 
 	mmu_spte_set(sptep, spte);
 }
@@ -2677,7 +2683,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
 					      iterator.level - 1,
 					      1, ACC_ALL, iterator.sptep);
 
-			link_shadow_page(iterator.sptep, sp);
+			link_shadow_page(iterator.sptep, sp, true);
 		}
 	}
 	return emulate;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index e38b3c0..23a19a5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -545,7 +545,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			goto out_gpte_changed;
 
 		if (sp)
-			link_shadow_page(it.sptep, sp);
+			link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
 	}
 
 	for (;
@@ -565,7 +565,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access, it.sptep);
-		link_shadow_page(it.sptep, sp);
+		link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
 	}
 
 	clear_sp_write_flooding_count(it.sptep);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 08/13] nEPT: Nested INVEPT
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (6 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 07/13] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table
for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
the course of this modification already calls INVEPT. But if last level
of shadow page is unsync not all L1's changes to EPT12 are intercepted,
which means roots need to be synced when L1 calls INVEPT. Global INVEPT
should not be different since roots are synced by kvm_mmu_load() each
time EPTP02 changes.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/vmx.h      |    3 ++
 arch/x86/include/uapi/asm/vmx.h |    1 +
 arch/x86/kvm/mmu.c              |    2 ++
 arch/x86/kvm/vmx.c              |   68 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 74 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f3e01a2..c3d74b9 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -387,6 +387,7 @@ enum vmcs_field {
 #define VMX_EPT_EXTENT_INDIVIDUAL_ADDR		0
 #define VMX_EPT_EXTENT_CONTEXT			1
 #define VMX_EPT_EXTENT_GLOBAL			2
+#define VMX_EPT_EXTENT_SHIFT			24
 
 #define VMX_EPT_EXECUTE_ONLY_BIT		(1ull)
 #define VMX_EPT_PAGE_WALK_4_BIT			(1ull << 6)
@@ -394,7 +395,9 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT				(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT			(1ull << 17)
+#define VMX_EPT_INVEPT_BIT			(1ull << 20)
 #define VMX_EPT_AD_BIT				    (1ull << 21)
+#define VMX_EPT_EXTENT_INDIVIDUAL_BIT		(1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT		(1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT		(1ull << 26)
 
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index d651082..7a34e8f 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,7 @@
 #define EXIT_REASON_EOI_INDUCED         45
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_INVEPT              50
 #define EXIT_REASON_PREEMPTION_TIMER    52
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9e0f467..3df3ac3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3182,6 +3182,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 	mmu_sync_roots(vcpu);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
 
 static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
 				  u32 access, struct x86_exception *exception)
@@ -3451,6 +3452,7 @@ void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu)
 	++vcpu->stat.tlb_flush;
 	kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_flush_tlb);
 
 static void paging_new_cr3(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 56d0066..fc24370 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2156,6 +2156,7 @@ static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
 static u32 nested_vmx_misc_low, nested_vmx_misc_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
 	/*
@@ -6270,6 +6271,71 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+	u32 vmx_instruction_info;
+	bool ok;
+	unsigned long type;
+	gva_t gva;
+	struct x86_exception e;
+	struct {
+		u64 eptp, gpa;
+	} operand;
+
+	if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
+	    !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/* According to the Intel VMX instruction reference, the memory
+	 * operand is read even if it isn't needed (e.g., for type==global)
+	 */
+	vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmx_instruction_info, &gva))
+		return 1;
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+				sizeof(operand), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+	switch (type) {
+	case VMX_EPT_EXTENT_GLOBAL:
+	case VMX_EPT_EXTENT_CONTEXT:
+	case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
+		ok = !!(nested_vmx_ept_caps &
+				(1UL << (type + VMX_EPT_EXTENT_SHIFT)));
+		break;
+	default:
+		ok = false;
+	}
+
+	if (ok) {
+		kvm_mmu_sync_roots(vcpu);
+		kvm_mmu_flush_tlb(vcpu);
+		nested_vmx_succeed(vcpu);
+	}
+	else
+		nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -6314,6 +6380,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
 	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
 	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
+	[EXIT_REASON_INVEPT]                  = handle_invept,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -6540,6 +6607,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
 	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
 	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
 	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+	case EXIT_REASON_INVEPT:
 		/*
 		 * VMX instructions trap unconditionally. This allows L1 to
 		 * emulate them for its L2 guest, i.e., allows 3-level nesting!
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (7 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 08/13] nEPT: Nested INVEPT Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-29  8:59   ` Paolo Bonzini
  2013-07-25 10:59 ` [PATCH v4 10/13] nEPT: MMU context for nested EPT Gleb Natapov
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Yang Zhang <yang.z.zhang@Intel.com>

Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    4 ++++
 arch/x86/kvm/mmu.c              |   34 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/paging_tmpl.h      |   30 +++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
 4 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 531f47c..58a17c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -286,6 +286,7 @@ struct kvm_mmu {
 	u64 *pae_root;
 	u64 *lm_root;
 	u64 rsvd_bits_mask[2][4];
+	u64 bad_mt_xwr;
 
 	/*
 	 * Bitmap: bit set = last pte in walk
@@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
 	 * instruction.
 	 */
 	bool write_fault_to_shadow_pgtable;
+
+	/* set at EPT violation at this point */
+	unsigned long exit_qualification;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3df3ac3..58ae9db 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 	int maxphyaddr = cpuid_maxphyaddr(vcpu);
 	u64 exb_bit_rsvd = 0;
 
+	context->bad_mt_xwr = 0;
+
 	if (!context->nx)
 		exb_bit_rsvd = rsvd_bits(63, 63);
 	switch (context->root_level) {
@@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 	}
 }
 
+static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
+		struct kvm_mmu *context, bool execonly)
+{
+	int maxphyaddr = cpuid_maxphyaddr(vcpu);
+	int pte;
+
+	context->rsvd_bits_mask[0][3] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+	context->rsvd_bits_mask[0][2] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+	context->rsvd_bits_mask[0][1] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+	context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+
+	/* large page */
+	context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
+	context->rsvd_bits_mask[1][2] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
+	context->rsvd_bits_mask[1][1] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
+	context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+	
+	for (pte = 0; pte < 64; pte++) {
+		int rwx_bits = pte & 7;
+		int mt = pte >> 3;
+		if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
+				rwx_bits == 0x2 || rwx_bits == 0x6 ||
+				(rwx_bits == 0x4 && !execonly))
+			context->bad_mt_xwr |= (1ull << pte);
+	}
+}
+
 static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 {
 	unsigned bit, byte, pfec;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 23a19a5..58d2f87 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -121,14 +121,23 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
 #endif
 }
 
+#if PTTYPE == PTTYPE_EPT
+#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
+#else
+#define CHECK_BAD_MT_XWR(G) 0;
+#endif
+
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 {
 	int bit7;
 
 	bit7 = (gpte >> 7) & 1;
-	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
+		CHECK_BAD_MT_XWR(gpte);
 }
 
+#undef CHECK_BAD_MT_XWR
+
 static inline int FNAME(is_present_gpte)(unsigned long pte)
 {
 #if PTTYPE != PTTYPE_EPT
@@ -376,6 +385,25 @@ error:
 	walker->fault.vector = PF_VECTOR;
 	walker->fault.error_code_valid = true;
 	walker->fault.error_code = errcode;
+
+#if PTTYPE == PTTYPE_EPT
+	/*
+	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
+	 * misconfiguration requires to be injected. The detection is
+	 * done by is_rsvd_bits_set() above.
+	 *
+	 * We set up the value of exit_qualification to inject:
+	 * [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
+	 * [5:3] - Calculated by the page walk of the guest EPT page tables
+	 * [7:8] - Clear to 0.
+	 *
+	 * The other bits are set to 0.
+	 */
+	if (!(errcode & PFERR_RSVD_MASK)) {
+		vcpu->arch.exit_qualification &= 0x7;
+		vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;
+	}
+#endif
 	walker->fault.address = addr;
 	walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fc24370..bbfff8d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5321,6 +5321,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	/* ept page table is present? */
 	error_code |= (exit_qualification >> 3) & 0x1;
 
+	vcpu->arch.exit_qualification = exit_qualification;
+
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -7416,6 +7418,21 @@ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+		struct x86_exception *fault)
+{
+	struct vmcs12 *vmcs12;
+	nested_vmx_vmexit(vcpu);
+	vmcs12 = get_vmcs12(vcpu);
+
+	if (fault->error_code & PFERR_RSVD_MASK)
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+	else
+		vmcs12->vm_exit_reason = EXIT_REASON_EPT_VIOLATION;
+	vmcs12->exit_qualification = vcpu->arch.exit_qualification;
+	vmcs12->guest_physical_address = fault->address;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 10/13] nEPT: MMU context for nested EPT
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (8 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-25 10:59 ` [PATCH v4 11/13] nEPT: Advertise EPT to L1 Gleb Natapov
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/mmu.c |   26 ++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h |    2 ++
 arch/x86/kvm/vmx.c |   41 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 58ae9db..37fff14 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3792,6 +3792,32 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
 
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+		bool execonly)
+{
+	ASSERT(vcpu);
+	ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+
+	context->nx = true;
+	context->new_cr3 = paging_new_cr3;
+	context->page_fault = ept_page_fault;
+	context->gva_to_gpa = ept_gva_to_gpa;
+	context->sync_page = ept_sync_page;
+	context->invlpg = ept_invlpg;
+	context->update_pte = ept_update_pte;
+	context->free = paging_free;
+	context->root_level = context->shadow_root_level;
+	context->root_hpa = INVALID_PAGE;
+	context->direct_map = false;
+
+	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
 	int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5b59c57..77e044a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -71,6 +71,8 @@ enum {
 
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+		bool execonly);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bbfff8d..6b79db7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1046,6 +1046,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
 	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -7433,6 +7438,33 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
 	vmcs12->guest_physical_address = fault->address;
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+	/* return the page table to be shadowed - in our case, EPT12 */
+	return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+	int r = kvm_init_shadow_ept_mmu(vcpu, &vcpu->arch.mmu,
+			nested_vmx_ept_caps & VMX_EPT_EXECUTE_ONLY_BIT);
+
+	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
+	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
+	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
+
+	return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -7653,6 +7685,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 		vmx_flush_tlb(vcpu);
 	}
 
+	if (nested_cpu_has_ept(vmcs12)) {
+		kvm_mmu_unload(vcpu);
+		nested_ept_init_mmu_context(vcpu);
+	}
+
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->guest_ia32_efer;
 	else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -8125,7 +8162,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	kvm_set_cr4(vcpu, vmcs12->host_cr4);
 
-	/* shadow page tables on either EPT or shadow page tables */
+	if (nested_cpu_has_ept(vmcs12))
+		nested_ept_uninit_mmu_context(vcpu);
+
 	kvm_set_cr3(vcpu, vmcs12->host_cr3);
 	kvm_mmu_reset_context(vcpu);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 11/13] nEPT: Advertise EPT to L1
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (9 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 10/13] nEPT: MMU context for nested EPT Gleb Natapov
@ 2013-07-25 10:59 ` Gleb Natapov
  2013-07-29  9:21   ` Paolo Bonzini
  2013-07-25 11:00 ` [PATCH v4 12/13] nEPT: Some additional comments Gleb Natapov
  2013-07-25 11:00 ` [PATCH v4 13/13] nEPT: Miscelleneous cleanups Gleb Natapov
  12 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 10:59 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6b79db7..a77f902 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2249,6 +2249,18 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 		SECONDARY_EXEC_WBINVD_EXITING;
 
+	if (enable_ept) {
+		/* nested EPT: emulate EPT also to L1 */
+		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+		nested_vmx_ept_caps |=
+			VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+			VMX_EPT_EXTENT_CONTEXT_BIT |
+			VMX_EPT_EXTENT_INDIVIDUAL_BIT;
+		nested_vmx_ept_caps &= vmx_capability.ept;
+	} else
+		nested_vmx_ept_caps = 0;
+
 	/* miscellaneous data */
 	rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
 	nested_vmx_misc_low &= VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
@@ -2357,8 +2369,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
 					nested_vmx_secondary_ctls_high);
 		break;
 	case MSR_IA32_VMX_EPT_VPID_CAP:
-		/* Currently, no nested ept or nested vpid */
-		*pdata = 0;
+		/* Currently, no nested vpid support */
+		*pdata = nested_vmx_ept_caps;
 		break;
 	default:
 		return 0;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 12/13] nEPT: Some additional comments
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (10 preceding siblings ...)
  2013-07-25 10:59 ` [PATCH v4 11/13] nEPT: Advertise EPT to L1 Gleb Natapov
@ 2013-07-25 11:00 ` Gleb Natapov
  2013-07-25 11:00 ` [PATCH v4 13/13] nEPT: Miscelleneous cleanups Gleb Natapov
  12 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 11:00 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a77f902..d513ace 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6659,7 +6659,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
 		return nested_cpu_has2(vmcs12,
 			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 	case EXIT_REASON_EPT_VIOLATION:
+		/*
+		 * L0 always deals with the EPT violation. If nested EPT is
+		 * used, and the nested mmu code discovers that the address is
+		 * missing in the guest EPT table (EPT12), the EPT violation
+		 * will be injected with nested_ept_inject_page_fault()
+		 */
+		return 0;
 	case EXIT_REASON_EPT_MISCONFIG:
+		/*
+		 * L2 never uses directly L1's EPT, but rather L0's own EPT
+		 * table (shadow on EPT) or a merged EPT table that L0 built
+		 * (EPT on EPT). So any problems with the structure of the
+		 * table is L0's fault.
+		 */
 		return 0;
 	case EXIT_REASON_PREEMPTION_TIMER:
 		return vmcs12->pin_based_vm_exec_control &
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 13/13] nEPT: Miscelleneous cleanups
  2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
                   ` (11 preceding siblings ...)
  2013-07-25 11:00 ` [PATCH v4 12/13] nEPT: Some additional comments Gleb Natapov
@ 2013-07-25 11:00 ` Gleb Natapov
  12 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-25 11:00 UTC (permalink / raw)
  To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini

From: Nadav Har'El <nyh@il.ibm.com>

Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d513ace..66d9233 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -715,7 +715,6 @@ static void nested_release_page_clean(struct page *page)
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
@@ -1040,8 +1039,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-	struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6760,7 +6758,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
 	    !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-	                                get_vmcs12(vcpu), vcpu)))) {
+					get_vmcs12(vcpu))))) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-25 10:59 ` [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
@ 2013-07-29  8:32   ` Paolo Bonzini
  2013-07-29 13:12     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29  8:32 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> +	 * bits are further modified by vmx_set_efer() below.

IA32E_MODE, but the last sentence is not relevant anyway for this
vmcs_write32; in fact, the next command basically duplicates it.

This can be fixed while applying.

Paolo

> +	 */
> +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> +
> +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
> +	 * emulated by vmx_set_efer(), below.
> +	 */
> +	vmcs_write32(VM_ENTRY_CONTROLS,
> +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> +			~VM_ENTRY_IA32E_MODE) |
>  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>  
>  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> -- 1.7.10.4


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3
  2013-07-25 10:59 ` [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
@ 2013-07-29  8:36   ` Paolo Bonzini
  2013-07-29 10:43     ` Gleb Natapov
  2013-07-31  8:02   ` Xiao Guangrong
  1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29  8:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 89b15df..56d0066 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7596,8 +7596,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	kvm_mmu_reset_context(vcpu);
>  
>  	/*
> -	 * Additionally, except when L0 is using shadow page tables, L1 or
> -	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
> +	 * L1 may access the L2's PDPTR, so save them to construct vmcs12
>  	 */
>  	if (enable_ept) {
>  		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
> @@ -7933,14 +7932,11 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	 * own CR3 without exiting. If it has changed it, we must keep it.
>  	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
>  	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
> -	 */
> -	if (enable_ept)
> -		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
> -	/*
> -	 * Additionally, except when L0 is using shadow page tables, L1 or
> -	 * L2 control guest_cr3 for L2, so save their PDPTEs
> +	 *
> +	 * Additionally, restore L2's PDPTR to vmcs12.
>  	 */
>  	if (enable_ept) {
> +		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
>  		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
>  		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
>  		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);

This part addresses your review comments for v3 patch 6, and should be
squashed in patch 2 of this series.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-25 10:59 ` [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
@ 2013-07-29  8:59   ` Paolo Bonzini
  2013-07-29 10:52     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29  8:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> +#if PTTYPE == PTTYPE_EPT
> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> +#else
> +#define CHECK_BAD_MT_XWR(G) 0;
> +#endif
> +
>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>  {
>  	int bit7;
>  
>  	bit7 = (gpte >> 7) & 1;
> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> +		CHECK_BAD_MT_XWR(gpte);
>  }
>  
> +#undef CHECK_BAD_MT_XWR

Instead of a macro, you can do

	if (...)
		return true;
#if PTTYPE == PTTYPE_EPT
	if (...)
		return true;
#endif
	return false;

The compiler should be smart enough to generate the same code for
non-EPT PTTYPE.

> 
> +	/*
> +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
> +	 * misconfiguration requires to be injected. The detection is
> +	 * done by is_rsvd_bits_set() above.

erorr_code -> error_code

This patch has warnings for unused static functions.  You can squash
them, or split them differently according to file boundaries (i.e. mmu.c
first, vmx.c second).

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 11/13] nEPT: Advertise EPT to L1
  2013-07-25 10:59 ` [PATCH v4 11/13] nEPT: Advertise EPT to L1 Gleb Natapov
@ 2013-07-29  9:21   ` Paolo Bonzini
  2013-07-29 11:11     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29  9:21 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> +	if (enable_ept) {
> +		/* nested EPT: emulate EPT also to L1 */
> +		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
> +		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
> +		nested_vmx_ept_caps |=
> +			VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
> +			VMX_EPT_EXTENT_CONTEXT_BIT |
> +			VMX_EPT_EXTENT_INDIVIDUAL_BIT;

What version of the Intel manual defines bit 24?  Mine is January 2013
and only has 20/25/26.

> +		nested_vmx_ept_caps &= vmx_capability.ept;

This is always missing VMX_EPT_EXECUTE_ONLY_BIT, should it be added
before the "&=".

Also, the three extent bits should always be fine for the MSR,
independent of the host support, because the processor will do the
INVEPT vmexit before checking the INVEPT type against the processor
capabilities.  So they can be added after the "&=".

Related to this, I suppose enable_ept should depend on
VMX_EPT_INVEPT_BIT too (it currently doesn't) since vmx_flush_tlb does
an unconditional invept under "if (enable_ept)".  This is really just
nitpicking though.

Paolo

> +	} else
> +		nested_vmx_ept_caps = 0;


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-25 10:59 ` [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
@ 2013-07-29  9:48   ` Paolo Bonzini
  2013-07-29 11:33     ` Gleb Natapov
  2013-07-30 10:03   ` Paolo Bonzini
  1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29  9:48 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> This is the first patch in a series which adds nested EPT support to KVM's
> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> to set its own cr3 and take its own page faults without either of L0 or L1
> getting involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over EPT, and shadow page
> tables over shadow page tables).
> 
> This patch adds EPT support to paging_tmpl.h.
> 
> paging_tmpl.h contains the code for reading and writing page tables. The code
> for 32-bit and 64-bit tables is very similar, but not identical, so
> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> with PTTYPE=64, and this generates the two sets of similar functions.
> 
> There are subtle but important differences between the format of EPT tables
> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> third set of functions to read the guest EPT table and to write the shadow
> EPT table.
> 
> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> with "EPT") which correctly read and write EPT tables.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/kvm/mmu.c         |    5 +++++
>  arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 4c4274d..b5273c3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>  	return mmu->last_pte_bitmap & (1 << index);
>  }
>  
> +#define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE PTTYPE_EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +
>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 7581395..e38b3c0 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -58,6 +58,21 @@
>  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
>  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
>  	#define CMPXCHG cmpxchg
> +#elif PTTYPE == PTTYPE_EPT
> +	#define pt_element_t u64
> +	#define guest_walker guest_walkerEPT
> +	#define FNAME(name) ept_##name
> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> +	#define PT_GUEST_ACCESSED_MASK 0
> +	#define PT_GUEST_DIRTY_MASK 0
> +	#define PT_GUEST_DIRTY_SHIFT 0
> +	#define PT_GUEST_ACCESSED_SHIFT 0
> +	#define CMPXCHG cmpxchg64
> +	#define PT_MAX_FULL_LEVELS 4
>  #else
>  	#error Invalid PTTYPE value
>  #endif
> @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
>  
>  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>  {
> +#if PT_GUEST_DIRTY_MASK == 0
> +	/* dirty bit is not supported, so no need to track it */
> +	return;
> +#else
>  	unsigned mask;
>  
>  	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>  	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
>  		PT_WRITABLE_MASK;
>  	*access &= mask;
> +#endif

Please put this #if/#else/#endif in the previous patch.  (See also
below on leaving out protect_clean_gpte altogether).

You probably should also have a

	BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_WRITABLE_SHIFT);

in the #else branch.

>  }
>  
>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> @@ -111,7 +131,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>  
>  static inline int FNAME(is_present_gpte)(unsigned long pte)
>  {
> +#if PTTYPE != PTTYPE_EPT
>  	return is_present_gpte(pte);
> +#else
> +	return pte & 7;
> +#endif
>  }
>  
>  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> @@ -147,7 +171,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
>  	if (!FNAME(is_present_gpte)(gpte))
>  		goto no_present;
>  
> -	if (!(gpte & PT_GUEST_ACCESSED_MASK))
> +	/* if accessed bit is not supported prefetch non accessed gpte */
> +	if (PT_GUEST_ACCESSED_MASK && !(gpte & PT_GUEST_ACCESSED_MASK))

Same for this hunk.  Please put it in the previous patch.

>  		goto no_present;
>  
>  	return false;
> @@ -160,9 +185,14 @@ no_present:
>  static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
>  {
>  	unsigned access;
> -
> +#if PTTYPE == PTTYPE_EPT
> +	BUILD_BUG_ON(ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK);
> +	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
> +		((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
> +#else
>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>  	access &= ~(gpte >> PT64_NX_SHIFT);
> +#endif
>  
>  	return access;
>  }
> @@ -212,7 +242,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  				    gva_t addr, u32 access)
>  {
> -	int ret;
>  	pt_element_t pte;
>  	pt_element_t __user *uninitialized_var(ptep_user);
>  	gfn_t table_gfn;
> @@ -322,7 +351,9 @@ retry_walk:
>  		accessed_dirty &= pte >>
>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
> -	if (unlikely(!accessed_dirty)) {
> +	if (PT_GUEST_DIRTY_MASK && unlikely(!accessed_dirty)) {
> +		int ret;
> +
>  		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);

I think the whole block of code starting at

        if (!write_fault)
                protect_clean_gpte(&pte_access, pte);
        else
                /*
                 * On a write fault, fold the dirty bit into accessed_dirty by
                 * shifting it one place right.
                 */
 		accessed_dirty &= pte >>
 			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);

should be compiled out (in the previous patch) if dirty bits are not in use.
The "then" branch does nothing in that case, and the "else" branch is dead
code that makes no sense.

Once you do this, you can add a

	BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_GUEST_ACCESSED_SHIFT);

before the shift.

Please check if, with these changes, you can avoid defining
PT_GUEST_{DIRTY,ACCESSED}_SHIFT altogether in the EPT case.
This is safer because you are sure you left no undefined
behaviors when a bit is being folded onto another.

In principle, with these changes you could leave protect_clean_gpte in mmu.c.
I'm not sure what is the cleanest thing to do there, so I'll leave that to
your judgement.

Paolo

>  		if (unlikely(ret < 0))
>  			goto error;
> @@ -359,6 +390,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
>  					access);
>  }
>  
> +#if PTTYPE != PTTYPE_EPT
>  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  				   struct kvm_vcpu *vcpu, gva_t addr,
>  				   u32 access)
> @@ -366,6 +398,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
>  					addr, access);
>  }
> +#endif
>  
>  static bool
>  FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -793,6 +826,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
>  	return gpa;
>  }
>  
> +#if PTTYPE != PTTYPE_EPT
>  static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>  				      u32 access,
>  				      struct x86_exception *exception)
> @@ -811,6 +845,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>  
>  	return gpa;
>  }
> +#endif
>  
>  /*
>   * Using the cached information from sp->gfns is safe because:
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3
  2013-07-29  8:36   ` Paolo Bonzini
@ 2013-07-29 10:43     ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 10:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 10:36:05AM +0200, Paolo Bonzini wrote:
> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 89b15df..56d0066 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7596,8 +7596,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> >  	kvm_mmu_reset_context(vcpu);
> >  
> >  	/*
> > -	 * Additionally, except when L0 is using shadow page tables, L1 or
> > -	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
> > +	 * L1 may access the L2's PDPTR, so save them to construct vmcs12
> >  	 */
> >  	if (enable_ept) {
> >  		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
> > @@ -7933,14 +7932,11 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> >  	 * own CR3 without exiting. If it has changed it, we must keep it.
> >  	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
> >  	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
> > -	 */
> > -	if (enable_ept)
> > -		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
> > -	/*
> > -	 * Additionally, except when L0 is using shadow page tables, L1 or
> > -	 * L2 control guest_cr3 for L2, so save their PDPTEs
> > +	 *
> > +	 * Additionally, restore L2's PDPTR to vmcs12.
> >  	 */
> >  	if (enable_ept) {
> > +		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
> >  		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> >  		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> >  		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> 
> This part addresses your review comments for v3 patch 6, and should be
> squashed in patch 2 of this series.
> 
Yeah, I noticed it in the wrong patch, but forget to move.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29  8:59   ` Paolo Bonzini
@ 2013-07-29 10:52     ` Gleb Natapov
  2013-07-29 10:59       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 10:52 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> > +#if PTTYPE == PTTYPE_EPT
> > +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> > +#else
> > +#define CHECK_BAD_MT_XWR(G) 0;
> > +#endif
> > +
> >  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >  {
> >  	int bit7;
> >  
> >  	bit7 = (gpte >> 7) & 1;
> > -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> > +		CHECK_BAD_MT_XWR(gpte);
> >  }
> >  
> > +#undef CHECK_BAD_MT_XWR
> 
> Instead of a macro, you can do
> 
> 	if (...)
> 		return true;
> #if PTTYPE == PTTYPE_EPT
> 	if (...)
> 		return true;
> #endif
> 	return false;
> 
> The compiler should be smart enough to generate the same code for
> non-EPT PTTYPE.
> 
The idea behind this rsvd_bits_mask trickery is to  produce code that
does not have conditional branches. I don't want to rely on compiler to do
the right things. On the other hand I am not sure that just dropping this
ifdefs here and checking mmu->bad_mt_xwr for non ept case is not a
good idea. The overhead should not be measurable.

> > 
> > +	/*
> > +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
> > +	 * misconfiguration requires to be injected. The detection is
> > +	 * done by is_rsvd_bits_set() above.
> 
> erorr_code -> error_code
> 
> This patch has warnings for unused static functions.  You can squash
> them, or split them differently according to file boundaries (i.e. mmu.c
> first, vmx.c second).
> 
I prefer to have an in between patch with a warning, but do not divide
code that logically belongs together between patches.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 10:52     ` Gleb Natapov
@ 2013-07-29 10:59       ` Paolo Bonzini
  2013-07-29 11:43         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 10:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 12:52, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
>>> +#if PTTYPE == PTTYPE_EPT
>>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
>>> +#else
>>> +#define CHECK_BAD_MT_XWR(G) 0;

No semicolons here, BTW.

>>> +#endif
>>> +
>>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>>>  {
>>>  	int bit7;
>>>  
>>>  	bit7 = (gpte >> 7) & 1;
>>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
>>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
>>> +		CHECK_BAD_MT_XWR(gpte);
>>>  }
>>>  
>>> +#undef CHECK_BAD_MT_XWR
>>
>> Instead of a macro, you can do
>>
>> 	if (...)
>> 		return true;
>> #if PTTYPE == PTTYPE_EPT
>> 	if (...)
>> 		return true;
>> #endif
>> 	return false;
>>
>> The compiler should be smart enough to generate the same code for
>> non-EPT PTTYPE.
>>
> The idea behind this rsvd_bits_mask trickery is to  produce code that
> does not have conditional branches.

If you want to have no conditional branches, you need to use "|" not
"||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
it, the compiler is most likely to generate exactly the same code that I
suggested.

But I think what you _really_ want is not avoiding conditional branches.
 What you want is always inline is_rsvd_bits_set, so that the compiler
can merge these "if"s with the one where the caller calls
is_rsvd_bits_set.  So just mark is_rsvd_bits_set as inline.

> I don't want to rely on compiler to do
> the right things. On the other hand I am not sure that just dropping this
> ifdefs here and checking mmu->bad_mt_xwr for non ept case is not a
> good idea. The overhead should not be measurable.

That's also a possibility, but I think if you mark is_rsvd_bits_set as
inline it is better to leave the ifs separate.

>>>
>>> +	/*
>>> +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
>>> +	 * misconfiguration requires to be injected. The detection is
>>> +	 * done by is_rsvd_bits_set() above.
>>
>> erorr_code -> error_code
>>
>> This patch has warnings for unused static functions.  You can squash
>> them, or split them differently according to file boundaries (i.e. mmu.c
>> first, vmx.c second).
>>
> I prefer to have an in between patch with a warning, but do not divide
> code that logically belongs together between patches.

Fine.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 11/13] nEPT: Advertise EPT to L1
  2013-07-29  9:21   ` Paolo Bonzini
@ 2013-07-29 11:11     ` Gleb Natapov
  2013-07-29 11:33       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 11:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 11:21:38AM +0200, Paolo Bonzini wrote:
> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> > +	if (enable_ept) {
> > +		/* nested EPT: emulate EPT also to L1 */
> > +		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
> > +		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
> > +		nested_vmx_ept_caps |=
> > +			VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
> > +			VMX_EPT_EXTENT_CONTEXT_BIT |
> > +			VMX_EPT_EXTENT_INDIVIDUAL_BIT;
> 
> What version of the Intel manual defines bit 24?  Mine is January 2013
> and only has 20/25/26.
> 
Good question. Looks like this crept in while Jun rebased Nadavs patch
on more resent master. This bit was defined by initial EPT implementation,
but was dropped by commit 2b3c5cbc0d814437fe4d70cc11ed60550b95b29f not
long time ago.

> > +		nested_vmx_ept_caps &= vmx_capability.ept;
> 
> This is always missing VMX_EPT_EXECUTE_ONLY_BIT, should it be added
> before the "&=".
I am not at all sure our current shadow implementation can support
execute only pages. Best to leave it off for now.

> 
> Also, the three extent bits should always be fine for the MSR,
> independent of the host support, because the processor will do the
> INVEPT vmexit before checking the INVEPT type against the processor
> capabilities.  So they can be added after the "&=".
> 
Good point.

> Related to this, I suppose enable_ept should depend on
> VMX_EPT_INVEPT_BIT too (it currently doesn't) since vmx_flush_tlb does
> an unconditional invept under "if (enable_ept)".  This is really just
> nitpicking though.
> 
> Paolo
> 
> > +	} else
> > +		nested_vmx_ept_caps = 0;

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 11/13] nEPT: Advertise EPT to L1
  2013-07-29 11:11     ` Gleb Natapov
@ 2013-07-29 11:33       ` Paolo Bonzini
  2013-07-29 11:35         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 11:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 13:11, Gleb Natapov ha scritto:
> > > +		nested_vmx_ept_caps &= vmx_capability.ept;
> > 
> > This is always missing VMX_EPT_EXECUTE_ONLY_BIT, should it be added
> > before the "&=".
> 
> I am not at all sure our current shadow implementation can support
> execute only pages. Best to leave it off for now.

Ok, I was tricked by this reference to nested_vmx_ept_caps's execonly bit:

+	int r = kvm_init_shadow_ept_mmu(vcpu, &vcpu->arch.mmu,
+			nested_vmx_ept_caps & VMX_EPT_EXECUTE_ONLY_BIT);

It's probably best to add a comment there, saying that the bit will
always be zero for now.

>> Also, the three extent bits should always be fine for the MSR,
>> independent of the host support, because the processor will do the
>> INVEPT vmexit before checking the INVEPT type against the processor
>> capabilities.  So they can be added after the "&=".
>>
> Good point.

For v5 you probably should leave out individual-addr invalidation from
this and the EPT patch too, though.

Paolo


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29  9:48   ` Paolo Bonzini
@ 2013-07-29 11:33     ` Gleb Natapov
  2013-07-29 11:55       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 11:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 11:48:06AM +0200, Paolo Bonzini wrote:
> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> > From: Nadav Har'El <nyh@il.ibm.com>
> > 
> > This is the first patch in a series which adds nested EPT support to KVM's
> > nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> > EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> > to set its own cr3 and take its own page faults without either of L0 or L1
> > getting involved. This often significanlty improves L2's performance over the
> > previous two alternatives (shadow page tables over EPT, and shadow page
> > tables over shadow page tables).
> > 
> > This patch adds EPT support to paging_tmpl.h.
> > 
> > paging_tmpl.h contains the code for reading and writing page tables. The code
> > for 32-bit and 64-bit tables is very similar, but not identical, so
> > paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> > with PTTYPE=64, and this generates the two sets of similar functions.
> > 
> > There are subtle but important differences between the format of EPT tables
> > and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> > third set of functions to read the guest EPT table and to write the shadow
> > EPT table.
> > 
> > So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> > with "EPT") which correctly read and write EPT tables.
> > 
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/x86/kvm/mmu.c         |    5 +++++
> >  arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 44 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 4c4274d..b5273c3 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> >  	return mmu->last_pte_bitmap & (1 << index);
> >  }
> >  
> > +#define PTTYPE_EPT 18 /* arbitrary */
> > +#define PTTYPE PTTYPE_EPT
> > +#include "paging_tmpl.h"
> > +#undef PTTYPE
> > +
> >  #define PTTYPE 64
> >  #include "paging_tmpl.h"
> >  #undef PTTYPE
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 7581395..e38b3c0 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -58,6 +58,21 @@
> >  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
> >  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
> >  	#define CMPXCHG cmpxchg
> > +#elif PTTYPE == PTTYPE_EPT
> > +	#define pt_element_t u64
> > +	#define guest_walker guest_walkerEPT
> > +	#define FNAME(name) ept_##name
> > +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> > +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> > +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> > +	#define PT_GUEST_ACCESSED_MASK 0
> > +	#define PT_GUEST_DIRTY_MASK 0
> > +	#define PT_GUEST_DIRTY_SHIFT 0
> > +	#define PT_GUEST_ACCESSED_SHIFT 0
> > +	#define CMPXCHG cmpxchg64
> > +	#define PT_MAX_FULL_LEVELS 4
> >  #else
> >  	#error Invalid PTTYPE value
> >  #endif
> > @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> >  
> >  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> >  {
> > +#if PT_GUEST_DIRTY_MASK == 0
> > +	/* dirty bit is not supported, so no need to track it */
> > +	return;
> > +#else
> >  	unsigned mask;
> >  
> >  	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> > @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> >  	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
> >  		PT_WRITABLE_MASK;
> >  	*access &= mask;
> > +#endif
> 
> Please put this #if/#else/#endif in the previous patch.  (See also
> below on leaving out protect_clean_gpte altogether).
> 
Why? This change does not make much sense before EPT is introduced. The
previous patch is just a rename that should be easily verifiable by any
reviewer to be NOP.

> You probably should also have a
> 
> 	BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_WRITABLE_SHIFT);
> 
> in the #else branch.
> 
> >  }
> >  
> >  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > @@ -111,7 +131,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >  
> >  static inline int FNAME(is_present_gpte)(unsigned long pte)
> >  {
> > +#if PTTYPE != PTTYPE_EPT
> >  	return is_present_gpte(pte);
> > +#else
> > +	return pte & 7;
> > +#endif
> >  }
> >  
> >  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > @@ -147,7 +171,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> >  	if (!FNAME(is_present_gpte)(gpte))
> >  		goto no_present;
> >  
> > -	if (!(gpte & PT_GUEST_ACCESSED_MASK))
> > +	/* if accessed bit is not supported prefetch non accessed gpte */
> > +	if (PT_GUEST_ACCESSED_MASK && !(gpte & PT_GUEST_ACCESSED_MASK))
> 
> Same for this hunk.  Please put it in the previous patch.
> 
> >  		goto no_present;
> >  
> >  	return false;
> > @@ -160,9 +185,14 @@ no_present:
> >  static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> >  {
> >  	unsigned access;
> > -
> > +#if PTTYPE == PTTYPE_EPT
> > +	BUILD_BUG_ON(ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK);
> > +	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
> > +		((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
> > +#else
> >  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> >  	access &= ~(gpte >> PT64_NX_SHIFT);
> > +#endif
> >  
> >  	return access;
> >  }
> > @@ -212,7 +242,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> >  				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  				    gva_t addr, u32 access)
> >  {
> > -	int ret;
> >  	pt_element_t pte;
> >  	pt_element_t __user *uninitialized_var(ptep_user);
> >  	gfn_t table_gfn;
> > @@ -322,7 +351,9 @@ retry_walk:
> >  		accessed_dirty &= pte >>
> >  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
> > -	if (unlikely(!accessed_dirty)) {
> > +	if (PT_GUEST_DIRTY_MASK && unlikely(!accessed_dirty)) {
> > +		int ret;
> > +
> >  		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
> 
> I think the whole block of code starting at
> 
>         if (!write_fault)
>                 protect_clean_gpte(&pte_access, pte);
>         else
>                 /*
>                  * On a write fault, fold the dirty bit into accessed_dirty by
>                  * shifting it one place right.
>                  */
>  		accessed_dirty &= pte >>
>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
> 
> should be compiled out (in the previous patch) if dirty bits are not in use.
> The "then" branch does nothing in that case, and the "else" branch is dead
> code that makes no sense.
> 
I disagree, there ifdef was there and it was ugly. protect_clean_gpte
and update_accessed_dirty_bits had to be ifdefed too. Compiler should be
smart enough to get rid of all of this code when PT_GUEST_DIRTY_MASK is 0.
Doing it like that was Xiao idea and it looks much nicer. 

> Once you do this, you can add a
> 
> 	BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_GUEST_ACCESSED_SHIFT);
> 
> before the shift.
> 
> Please check if, with these changes, you can avoid defining
> PT_GUEST_{DIRTY,ACCESSED}_SHIFT altogether in the EPT case.
> This is safer because you are sure you left no undefined
> behaviors when a bit is being folded onto another.
You basically ask me to get back to the patch how it was before I
addressed Xiao comment and add some more idfefs because previously not
all places where A/D bits were used were protected by it. IMO this would
be a step backward especially as the method in this patch series is a
preparation for A/D support for EPT. When those bits are supported with
EPT they are different than in regular page tables.

> 
> In principle, with these changes you could leave protect_clean_gpte in mmu.c.
Only if I ifdef all other uses of in in the file.

> I'm not sure what is the cleanest thing to do there, so I'll leave that to
> your judgement.
> 
> Paolo
> 
> >  		if (unlikely(ret < 0))
> >  			goto error;
> > @@ -359,6 +390,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
> >  					access);
> >  }
> >  
> > +#if PTTYPE != PTTYPE_EPT
> >  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> >  				   struct kvm_vcpu *vcpu, gva_t addr,
> >  				   u32 access)
> > @@ -366,6 +398,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> >  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
> >  					addr, access);
> >  }
> > +#endif
> >  
> >  static bool
> >  FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > @@ -793,6 +826,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
> >  	return gpa;
> >  }
> >  
> > +#if PTTYPE != PTTYPE_EPT
> >  static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> >  				      u32 access,
> >  				      struct x86_exception *exception)
> > @@ -811,6 +845,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> >  
> >  	return gpa;
> >  }
> > +#endif
> >  
> >  /*
> >   * Using the cached information from sp->gfns is safe because:
> > 

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 11/13] nEPT: Advertise EPT to L1
  2013-07-29 11:33       ` Paolo Bonzini
@ 2013-07-29 11:35         ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 11:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 01:33:26PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 13:11, Gleb Natapov ha scritto:
> > > > +		nested_vmx_ept_caps &= vmx_capability.ept;
> > > 
> > > This is always missing VMX_EPT_EXECUTE_ONLY_BIT, should it be added
> > > before the "&=".
> > 
> > I am not at all sure our current shadow implementation can support
> > execute only pages. Best to leave it off for now.
> 
> Ok, I was tricked by this reference to nested_vmx_ept_caps's execonly bit:
> 
> +	int r = kvm_init_shadow_ept_mmu(vcpu, &vcpu->arch.mmu,
> +			nested_vmx_ept_caps & VMX_EPT_EXECUTE_ONLY_BIT);
> 
> It's probably best to add a comment there, saying that the bit will
> always be zero for now.
> 
> >> Also, the three extent bits should always be fine for the MSR,
> >> independent of the host support, because the processor will do the
> >> INVEPT vmexit before checking the INVEPT type against the processor
> >> capabilities.  So they can be added after the "&=".
> >>
> > Good point.
> 
> For v5 you probably should leave out individual-addr invalidation from
> this and the EPT patch too, though.
> 
Of course. The define should not be introduces again.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 10:59       ` Paolo Bonzini
@ 2013-07-29 11:43         ` Gleb Natapov
  2013-07-29 12:05           ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 11:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 12:52, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
> >> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> >>> +#if PTTYPE == PTTYPE_EPT
> >>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> >>> +#else
> >>> +#define CHECK_BAD_MT_XWR(G) 0;
> 
> No semicolons here, BTW.
> 
> >>> +#endif
> >>> +
> >>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >>>  {
> >>>  	int bit7;
> >>>  
> >>>  	bit7 = (gpte >> 7) & 1;
> >>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> >>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> >>> +		CHECK_BAD_MT_XWR(gpte);
> >>>  }
> >>>  
> >>> +#undef CHECK_BAD_MT_XWR
> >>
> >> Instead of a macro, you can do
> >>
> >> 	if (...)
> >> 		return true;
> >> #if PTTYPE == PTTYPE_EPT
> >> 	if (...)
> >> 		return true;
> >> #endif
> >> 	return false;
> >>
> >> The compiler should be smart enough to generate the same code for
> >> non-EPT PTTYPE.
> >>
> > The idea behind this rsvd_bits_mask trickery is to  produce code that
> > does not have conditional branches.
> 
> If you want to have no conditional branches, you need to use "|" not
> "||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
> it, the compiler is most likely to generate exactly the same code that I
> suggested.
OK. I can add that :) I still prefer to have ifdefs outside the
function, not inside.

> 
> But I think what you _really_ want is not avoiding conditional branches.
The idea is that it is hard for branch prediction to predict correct
result when correct result depends on guest's page table that can
contain anything, so in some places shadow paging code uses boolean
logic to avoid branches, in this case it is hard to avoid if() anyway
since the function invocation is in the if().

>  What you want is always inline is_rsvd_bits_set, so that the compiler
> can merge these "if"s with the one where the caller calls
> is_rsvd_bits_set.  So just mark is_rsvd_bits_set as inline.
Will do, but it is inlined regardless. It's very small.


> 
> > I don't want to rely on compiler to do
> > the right things. On the other hand I am not sure that just dropping this
> > ifdefs here and checking mmu->bad_mt_xwr for non ept case is not a
> > good idea. The overhead should not be measurable.
> 
> That's also a possibility, but I think if you mark is_rsvd_bits_set as
> inline it is better to leave the ifs separate.
> 
> >>>
> >>> +	/*
> >>> +	 * Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
> >>> +	 * misconfiguration requires to be injected. The detection is
> >>> +	 * done by is_rsvd_bits_set() above.
> >>
> >> erorr_code -> error_code
> >>
> >> This patch has warnings for unused static functions.  You can squash
> >> them, or split them differently according to file boundaries (i.e. mmu.c
> >> first, vmx.c second).
> >>
> > I prefer to have an in between patch with a warning, but do not divide
> > code that logically belongs together between patches.
> 
> Fine.
> 
> Paolo

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 11:33     ` Gleb Natapov
@ 2013-07-29 11:55       ` Paolo Bonzini
  2013-07-29 12:24         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 11:55 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 13:33, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 11:48:06AM +0200, Paolo Bonzini wrote:
>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>
>>> This is the first patch in a series which adds nested EPT support to KVM's
>>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
>>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
>>> to set its own cr3 and take its own page faults without either of L0 or L1
>>> getting involved. This often significanlty improves L2's performance over the
>>> previous two alternatives (shadow page tables over EPT, and shadow page
>>> tables over shadow page tables).
>>>
>>> This patch adds EPT support to paging_tmpl.h.
>>>
>>> paging_tmpl.h contains the code for reading and writing page tables. The code
>>> for 32-bit and 64-bit tables is very similar, but not identical, so
>>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
>>> with PTTYPE=64, and this generates the two sets of similar functions.
>>>
>>> There are subtle but important differences between the format of EPT tables
>>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
>>> third set of functions to read the guest EPT table and to write the shadow
>>> EPT table.
>>>
>>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
>>> with "EPT") which correctly read and write EPT tables.
>>>
>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>>> ---
>>>  arch/x86/kvm/mmu.c         |    5 +++++
>>>  arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
>>>  2 files changed, 44 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 4c4274d..b5273c3 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>>>  	return mmu->last_pte_bitmap & (1 << index);
>>>  }
>>>  
>>> +#define PTTYPE_EPT 18 /* arbitrary */
>>> +#define PTTYPE PTTYPE_EPT
>>> +#include "paging_tmpl.h"
>>> +#undef PTTYPE
>>> +
>>>  #define PTTYPE 64
>>>  #include "paging_tmpl.h"
>>>  #undef PTTYPE
>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>> index 7581395..e38b3c0 100644
>>> --- a/arch/x86/kvm/paging_tmpl.h
>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>> @@ -58,6 +58,21 @@
>>>  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
>>>  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
>>>  	#define CMPXCHG cmpxchg
>>> +#elif PTTYPE == PTTYPE_EPT
>>> +	#define pt_element_t u64
>>> +	#define guest_walker guest_walkerEPT
>>> +	#define FNAME(name) ept_##name
>>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>>> +	#define PT_GUEST_ACCESSED_MASK 0
>>> +	#define PT_GUEST_DIRTY_MASK 0
>>> +	#define PT_GUEST_DIRTY_SHIFT 0
>>> +	#define PT_GUEST_ACCESSED_SHIFT 0
>>> +	#define CMPXCHG cmpxchg64
>>> +	#define PT_MAX_FULL_LEVELS 4
>>>  #else
>>>  	#error Invalid PTTYPE value
>>>  #endif
>>> @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
>>>  
>>>  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>>>  {
>>> +#if PT_GUEST_DIRTY_MASK == 0
>>> +	/* dirty bit is not supported, so no need to track it */
>>> +	return;
>>> +#else
>>>  	unsigned mask;
>>>  
>>>  	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
>>> @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>>>  	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
>>>  		PT_WRITABLE_MASK;
>>>  	*access &= mask;
>>> +#endif
>>
>> Please put this #if/#else/#endif in the previous patch.  (See also
>> below on leaving out protect_clean_gpte altogether).
>>
> Why? This change does not make much sense before EPT is introduced. The
> previous patch is just a rename that should be easily verifiable by any
> reviewer to be NOP.

My initial impression to this patch was "everything's ready after the
previous patch, you just have to set the mask to 0".  Which is not quite
true.  Maybe you need three patches instead of two.

>>         if (!write_fault)
>>                 protect_clean_gpte(&pte_access, pte);
>>         else
>>                 /*
>>                  * On a write fault, fold the dirty bit into accessed_dirty by
>>                  * shifting it one place right.
>>                  */
>>  		accessed_dirty &= pte >>
>>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
>>
>> should be compiled out (in the previous patch) if dirty bits are not in use.
>> The "then" branch does nothing in that case, and the "else" branch is dead
>> code that makes no sense.
>
> I disagree, there ifdef was there and it was ugly. protect_clean_gpte
> and update_accessed_dirty_bits had to be ifdefed too.

protect_clean_gpte is still #ifdef'ed, so we're talking of a net gain of
2 lines.

Something like this:

+       /* if dirty bit is not supported, no need to track it */
+#if PT_GUEST_DIRTY_MASK == 0
        if (!write_fault)
                 protect_clean_gpte(&pte_access, pte);
...
        if (unlikely(!accessed_dirty)) {
...
        }
+#endif

doesn't look bad at all.  With the old check on EPT it looked ugly, but
with the new check on PT_GUEST_DIRTY_MASK it is quite natural.  Also
because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if".
 If I see

        if (!write_fault)
                protect_clean_gpte(&pte_access, pte);
        else
                /*
                 * On a write fault, fold the dirty bit into
		 * accessed_dirty by
                 * shifting it one place right.
                 */
                accessed_dirty &=
			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);

        if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {

the obvious reaction is "what, is there a case where I'm using
accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
to leave the otherwise dead assignments to accessed_dirty in the loop
(the compiler removes them anyway); but here the assignment is too close
to the user to not raise such questions.

Perhaps it's obvious to you because you're more familiar with this code.

> Compiler should be
> smart enough to get rid of all of this code when PT_GUEST_DIRTY_MASK is 0.
> Doing it like that was Xiao idea and it looks much nicer. 
> 
>> Once you do this, you can add a
>>
>> 	BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_GUEST_ACCESSED_SHIFT);
>>
>> before the shift.
>>
>> Please check if, with these changes, you can avoid defining
>> PT_GUEST_{DIRTY,ACCESSED}_SHIFT altogether in the EPT case.
>> This is safer because you are sure you left no undefined
>> behaviors when a bit is being folded onto another.
> You basically ask me to get back to the patch how it was before I
> addressed Xiao comment and add some more idfefs because previously not
> all places where A/D bits were used were protected by it. IMO this would
> be a step backward especially as the method in this patch series is a
> preparation for A/D support for EPT. When those bits are supported with
> EPT they are different than in regular page tables.

Yes, and if they will put the dirty bit below the accessed bit (rather
than above), you will silently get undefined behavior for a shift by
negative value.  Adding BUILD_BUG_ONs for this, and ensuring
PT_GUEST_{DIRTY,ACCESSED}_SHIFT are never used in the EPT case, is
important in my opinion.

I very much like what you did in the previous patch, allowing
customization of the masks instead of using EPT ifdefs all over the
place.  On the other hand, once you did that the benefits of Xiao's
proposed change pretty much go away.

>> In principle, with these changes you could leave protect_clean_gpte in mmu.c.
> 
> Only if I ifdef all other uses of in in the file.

Yeah, scrap that.

Paolo


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 11:43         ` Gleb Natapov
@ 2013-07-29 12:05           ` Paolo Bonzini
  2013-07-29 12:34             ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 12:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 13:43, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
>> Il 29/07/2013 12:52, Gleb Natapov ha scritto:
>>> On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
>>>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
>>>>> +#if PTTYPE == PTTYPE_EPT
>>>>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
>>>>> +#else
>>>>> +#define CHECK_BAD_MT_XWR(G) 0;
>>
>> No semicolons here, BTW.
>>
>>>>> +#endif
>>>>> +
>>>>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>>>>>  {
>>>>>  	int bit7;
>>>>>  
>>>>>  	bit7 = (gpte >> 7) & 1;
>>>>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
>>>>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
>>>>> +		CHECK_BAD_MT_XWR(gpte);
>>>>>  }
>>>>>  
>>>>> +#undef CHECK_BAD_MT_XWR
>>>>
>>>> Instead of a macro, you can do
>>>>
>>>> 	if (...)
>>>> 		return true;
>>>> #if PTTYPE == PTTYPE_EPT
>>>> 	if (...)
>>>> 		return true;
>>>> #endif
>>>> 	return false;
>>>>
>>>> The compiler should be smart enough to generate the same code for
>>>> non-EPT PTTYPE.
>>>>
>>> The idea behind this rsvd_bits_mask trickery is to  produce code that
>>> does not have conditional branches.
>>
>> If you want to have no conditional branches, you need to use "|" not
>> "||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
>> it, the compiler is most likely to generate exactly the same code that I
>> suggested.
> OK. I can add that :) I still prefer to have ifdefs outside the
> function, not inside.
> 
>>
>> But I think what you _really_ want is not avoiding conditional branches.
> The idea is that it is hard for branch prediction to predict correct
> result when correct result depends on guest's page table that can
> contain anything, so in some places shadow paging code uses boolean
> logic to avoid branches, in this case it is hard to avoid if() anyway
> since the function invocation is in the if().

Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
no?  If the guest has to run, it must use mostly valid ptes. :)

Especially if you change prefetch_invalid_gpte to do the reserved bits
test after the present test (so that is_rsvd_bits_set is only called on
present pagetables), is_rsvd_bits_set's result should be really
well-predicted.  At this point (and especially since function invocation
is always in "if"s), using boolean logic to avoid branches does not make
much sense anymore for this function.

The compiler may still use setCC and boolean logic, even if you write it
as "if"s; and in simple cases like this, after inlining and seeing an
"if" the compiler may undo all your careful choice of "|" vs. "||".  So
it's better anyway to ensure is_rsvd_bits_set is called only when it's
unlikely.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 11:55       ` Paolo Bonzini
@ 2013-07-29 12:24         ` Gleb Natapov
  2013-07-29 13:19           ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 12:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 01:55:46PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 13:33, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 11:48:06AM +0200, Paolo Bonzini wrote:
> >> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> >>> From: Nadav Har'El <nyh@il.ibm.com>
> >>>
> >>> This is the first patch in a series which adds nested EPT support to KVM's
> >>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> >>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> >>> to set its own cr3 and take its own page faults without either of L0 or L1
> >>> getting involved. This often significanlty improves L2's performance over the
> >>> previous two alternatives (shadow page tables over EPT, and shadow page
> >>> tables over shadow page tables).
> >>>
> >>> This patch adds EPT support to paging_tmpl.h.
> >>>
> >>> paging_tmpl.h contains the code for reading and writing page tables. The code
> >>> for 32-bit and 64-bit tables is very similar, but not identical, so
> >>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> >>> with PTTYPE=64, and this generates the two sets of similar functions.
> >>>
> >>> There are subtle but important differences between the format of EPT tables
> >>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> >>> third set of functions to read the guest EPT table and to write the shadow
> >>> EPT table.
> >>>
> >>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> >>> with "EPT") which correctly read and write EPT tables.
> >>>
> >>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> >>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> >>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> >>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> >>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> >>> ---
> >>>  arch/x86/kvm/mmu.c         |    5 +++++
> >>>  arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
> >>>  2 files changed, 44 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>> index 4c4274d..b5273c3 100644
> >>> --- a/arch/x86/kvm/mmu.c
> >>> +++ b/arch/x86/kvm/mmu.c
> >>> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> >>>  	return mmu->last_pte_bitmap & (1 << index);
> >>>  }
> >>>  
> >>> +#define PTTYPE_EPT 18 /* arbitrary */
> >>> +#define PTTYPE PTTYPE_EPT
> >>> +#include "paging_tmpl.h"
> >>> +#undef PTTYPE
> >>> +
> >>>  #define PTTYPE 64
> >>>  #include "paging_tmpl.h"
> >>>  #undef PTTYPE
> >>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >>> index 7581395..e38b3c0 100644
> >>> --- a/arch/x86/kvm/paging_tmpl.h
> >>> +++ b/arch/x86/kvm/paging_tmpl.h
> >>> @@ -58,6 +58,21 @@
> >>>  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
> >>>  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
> >>>  	#define CMPXCHG cmpxchg
> >>> +#elif PTTYPE == PTTYPE_EPT
> >>> +	#define pt_element_t u64
> >>> +	#define guest_walker guest_walkerEPT
> >>> +	#define FNAME(name) ept_##name
> >>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> >>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> >>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> >>> +	#define PT_GUEST_ACCESSED_MASK 0
> >>> +	#define PT_GUEST_DIRTY_MASK 0
> >>> +	#define PT_GUEST_DIRTY_SHIFT 0
> >>> +	#define PT_GUEST_ACCESSED_SHIFT 0
> >>> +	#define CMPXCHG cmpxchg64
> >>> +	#define PT_MAX_FULL_LEVELS 4
> >>>  #else
> >>>  	#error Invalid PTTYPE value
> >>>  #endif
> >>> @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> >>>  
> >>>  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> >>>  {
> >>> +#if PT_GUEST_DIRTY_MASK == 0
> >>> +	/* dirty bit is not supported, so no need to track it */
> >>> +	return;
> >>> +#else
> >>>  	unsigned mask;
> >>>  
> >>>  	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> >>> @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> >>>  	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
> >>>  		PT_WRITABLE_MASK;
> >>>  	*access &= mask;
> >>> +#endif
> >>
> >> Please put this #if/#else/#endif in the previous patch.  (See also
> >> below on leaving out protect_clean_gpte altogether).
> >>
> > Why? This change does not make much sense before EPT is introduced. The
> > previous patch is just a rename that should be easily verifiable by any
> > reviewer to be NOP.
> 
> My initial impression to this patch was "everything's ready after the
> previous patch, you just have to set the mask to 0".  Which is not quite
> true.  Maybe you need three patches instead of two.
> 
Or change commit message for patch 5 to make it more clear that it is a
preparation patch?

> >>         if (!write_fault)
> >>                 protect_clean_gpte(&pte_access, pte);
> >>         else
> >>                 /*
> >>                  * On a write fault, fold the dirty bit into accessed_dirty by
> >>                  * shifting it one place right.
> >>                  */
> >>  		accessed_dirty &= pte >>
> >>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
> >>
> >> should be compiled out (in the previous patch) if dirty bits are not in use.
> >> The "then" branch does nothing in that case, and the "else" branch is dead
> >> code that makes no sense.
> >
> > I disagree, there ifdef was there and it was ugly. protect_clean_gpte
> > and update_accessed_dirty_bits had to be ifdefed too.
> 
> protect_clean_gpte is still #ifdef'ed, so we're talking of a net gain of
> 2 lines.
Please no need to count lines :) protect_clean_gpte() _has_ ifdef, it
is not ifdefed.

> 
> Something like this:
> 
> +       /* if dirty bit is not supported, no need to track it */
> +#if PT_GUEST_DIRTY_MASK == 0
>         if (!write_fault)
>                  protect_clean_gpte(&pte_access, pte);
> ...
>         if (unlikely(!accessed_dirty)) {
> ...
>         }
> +#endif
> 
I will have to do the same for update_accessed_dirty_bits(). The problem
of idfdefs they spread around.

> doesn't look bad at all.  With the old check on EPT it looked ugly, but
> with the new check on PT_GUEST_DIRTY_MASK it is quite natural.  Also
> because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if".
>  If I see
> 
>         if (!write_fault)
>                 protect_clean_gpte(&pte_access, pte);
>         else
>                 /*
>                  * On a write fault, fold the dirty bit into
> 		 * accessed_dirty by
>                  * shifting it one place right.
>                  */
>                 accessed_dirty &=
> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
> 
>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
> 
> the obvious reaction is "what, is there a case where I'm using
> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
In this case accessed_dirty has correct value of 0 :) The if() bellow just
tells you that since A/D is not supported there is nothing to be done
about zero value of accessed_dirty, but the value itself is correct!

> to leave the otherwise dead assignments to accessed_dirty in the loop
> (the compiler removes them anyway); but here the assignment is too close
> to the user to not raise such questions.
> 
I would think since the assignment is close to the loop it is _more_
easy to see that it is dead, not less.

> Perhaps it's obvious to you because you're more familiar with this code.
> 
> > Compiler should be
> > smart enough to get rid of all of this code when PT_GUEST_DIRTY_MASK is 0.
> > Doing it like that was Xiao idea and it looks much nicer. 
> > 
> >> Once you do this, you can add a
> >>
> >> 	BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT < PT_GUEST_ACCESSED_SHIFT);
> >>
> >> before the shift.
> >>
> >> Please check if, with these changes, you can avoid defining
> >> PT_GUEST_{DIRTY,ACCESSED}_SHIFT altogether in the EPT case.
> >> This is safer because you are sure you left no undefined
> >> behaviors when a bit is being folded onto another.
> > You basically ask me to get back to the patch how it was before I
> > addressed Xiao comment and add some more idfefs because previously not
> > all places where A/D bits were used were protected by it. IMO this would
> > be a step backward especially as the method in this patch series is a
> > preparation for A/D support for EPT. When those bits are supported with
> > EPT they are different than in regular page tables.
> 
> Yes, and if they will put the dirty bit below the accessed bit (rather
> than above), you will silently get undefined behavior for a shift by
> negative value.  Adding BUILD_BUG_ONs for this, and ensuring
> PT_GUEST_{DIRTY,ACCESSED}_SHIFT are never used in the EPT case, is
> important in my opinion.
> 
The format is defined. Dirty bit is above Accessed.

> I very much like what you did in the previous patch, allowing
> customization of the masks instead of using EPT ifdefs all over the
> place.  On the other hand, once you did that the benefits of Xiao's
> proposed change pretty much go away.
> 
> >> In principle, with these changes you could leave protect_clean_gpte in mmu.c.
> > 
> > Only if I ifdef all other uses of in in the file.
> 
> Yeah, scrap that.
> 
> Paolo

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 12:05           ` Paolo Bonzini
@ 2013-07-29 12:34             ` Gleb Natapov
  2013-07-29 13:11               ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 12:34 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 02:05:44PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 13:43, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 12:59:54PM +0200, Paolo Bonzini wrote:
> >> Il 29/07/2013 12:52, Gleb Natapov ha scritto:
> >>> On Mon, Jul 29, 2013 at 10:59:31AM +0200, Paolo Bonzini wrote:
> >>>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> >>>>> +#if PTTYPE == PTTYPE_EPT
> >>>>> +#define CHECK_BAD_MT_XWR(G) mmu->bad_mt_xwr & (1ull << ((G) & 0x3f));
> >>>>> +#else
> >>>>> +#define CHECK_BAD_MT_XWR(G) 0;
> >>
> >> No semicolons here, BTW.
> >>
> >>>>> +#endif
> >>>>> +
> >>>>>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >>>>>  {
> >>>>>  	int bit7;
> >>>>>  
> >>>>>  	bit7 = (gpte >> 7) & 1;
> >>>>> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> >>>>> +	return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) ||
> >>>>> +		CHECK_BAD_MT_XWR(gpte);
> >>>>>  }
> >>>>>  
> >>>>> +#undef CHECK_BAD_MT_XWR
> >>>>
> >>>> Instead of a macro, you can do
> >>>>
> >>>> 	if (...)
> >>>> 		return true;
> >>>> #if PTTYPE == PTTYPE_EPT
> >>>> 	if (...)
> >>>> 		return true;
> >>>> #endif
> >>>> 	return false;
> >>>>
> >>>> The compiler should be smart enough to generate the same code for
> >>>> non-EPT PTTYPE.
> >>>>
> >>> The idea behind this rsvd_bits_mask trickery is to  produce code that
> >>> does not have conditional branches.
> >>
> >> If you want to have no conditional branches, you need to use "|" not
> >> "||" (and you also need an "!= 0" in CHECK_BAD_MT_XWR).  As you wrote
> >> it, the compiler is most likely to generate exactly the same code that I
> >> suggested.
> > OK. I can add that :) I still prefer to have ifdefs outside the
> > function, not inside.
> > 
> >>
> >> But I think what you _really_ want is not avoiding conditional branches.
> > The idea is that it is hard for branch prediction to predict correct
> > result when correct result depends on guest's page table that can
> > contain anything, so in some places shadow paging code uses boolean
> > logic to avoid branches, in this case it is hard to avoid if() anyway
> > since the function invocation is in the if().
> 
> Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
> no?  If the guest has to run, it must use mostly valid ptes. :)
> 
You see, you are confused and you want branch prediction not to be? :)
If your guest is KVM is_rsvd_bits_set() will be likely much more then
unlikely because KVM misconfigures EPT entries to cache MMIO addresses,
so all the "unlikely" cases will be fixed by shadow pages and will not
reappear (until shadow pages are zapped), but misconfigured entries will
continue to produces violations.

> Especially if you change prefetch_invalid_gpte to do the reserved bits
> test after the present test (so that is_rsvd_bits_set is only called on
> present pagetables), is_rsvd_bits_set's result should be really
> well-predicted. 
Nope, for ept page tables present is not a single bit, it is three bits
which by themselves can have invalid values. For regular paging you are
probably right.

>                   At this point (and especially since function invocation
> is always in "if"s), using boolean logic to avoid branches does not make
> much sense anymore for this function.
That's true.

> 
> The compiler may still use setCC and boolean logic, even if you write it
> as "if"s; and in simple cases like this, after inlining and seeing an
> "if" the compiler may undo all your careful choice of "|" vs. "||".  So
> it's better anyway to ensure is_rsvd_bits_set is called only when it's
> unlikely.
> 
> Paolo

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 12:34             ` Gleb Natapov
@ 2013-07-29 13:11               ` Paolo Bonzini
  2013-07-29 13:20                 ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 13:11 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 14:34, Gleb Natapov ha scritto:
>>>> But I think what you _really_ want is not avoiding conditional branches.
>>> The idea is that it is hard for branch prediction to predict correct
>>> result when correct result depends on guest's page table that can
>>> contain anything, so in some places shadow paging code uses boolean
>>> logic to avoid branches, in this case it is hard to avoid if() anyway
>>> since the function invocation is in the if().
>>
>> Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
>> no?  If the guest has to run, it must use mostly valid ptes. :)
>>
> You see, you are confused and you want branch prediction not to be? :)
> If your guest is KVM is_rsvd_bits_set() will be likely much more then
> unlikely because KVM misconfigures EPT entries to cache MMIO addresses,
> so all the "unlikely" cases will be fixed by shadow pages and will not
> reappear (until shadow pages are zapped), but misconfigured entries will
> continue to produces violations.

But then:

1) MMIO is a slow path anyway, losing 10 cycles on a mispredicted branch
is not going to help much.  Fast page faults are all I would optimize for.

2) in cases like this you just do not use likely/unlikely; the branch
will be very unlikely in the beginning, and very likely once shadow
pages are filled or in the no-EPT case.  Just let the branch predictor
adjust, it will probably do better than boolean tricks.

>> Especially if you change prefetch_invalid_gpte to do the reserved bits
>> test after the present test (so that is_rsvd_bits_set is only called on
>> present pagetables), is_rsvd_bits_set's result should be really
>> well-predicted. 
> Nope, for ept page tables present is not a single bit, it is three bits
> which by themselves can have invalid values.

We're not checking the validity of the bits in the is_present_gpte test,
we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
just "(pte & 7) != 0").  It doesn't change anything in the outcome of
prefetch_invalid_gpte, and it makes the ordering consistent with
walk_addr_generic which already tests presence before reserved bits.

So doing this swap should be a win anyway.

>>                   At this point (and especially since function invocation
>> is always in "if"s), using boolean logic to avoid branches does not make
>> much sense anymore for this function.
> 
> That's true.

So are you going to change to "if"s?

Paolo


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-29  8:32   ` Paolo Bonzini
@ 2013-07-29 13:12     ` Gleb Natapov
  2013-07-29 14:13       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 13:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 10:32:58AM +0200, Paolo Bonzini wrote:
> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> > +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
> > +	 * bits are further modified by vmx_set_efer() below.
> 
> IA32E_MODE, but the last sentence is not relevant anyway for this
> vmcs_write32; in fact, the next command basically duplicates it.
> 
IA32E_MODE is not relevant, it never changes in VM_EXIT_CONTROLS, but
LOAD_IA32_EFER refers to VM_EXIT_LOAD_IA32_EFER, which can be set by
vmx_set_efer() and is different from VM_ENTRY_LOAD_IA32_EFER that
comment bellow talks about.

> This can be fixed while applying.
> 
> Paolo
> 
> > +	 */
> > +	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> > +
> > +	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
> > +	 * emulated by vmx_set_efer(), below.
> > +	 */
> > +	vmcs_write32(VM_ENTRY_CONTROLS,
> > +		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> > +			~VM_ENTRY_IA32E_MODE) |
> >  		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
> >  
> >  	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> > -- 1.7.10.4

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 12:24         ` Gleb Natapov
@ 2013-07-29 13:19           ` Paolo Bonzini
  2013-07-29 13:27             ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 13:19 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 14:24, Gleb Natapov ha scritto:
>> My initial impression to this patch was "everything's ready after the
>> previous patch, you just have to set the mask to 0".  Which is not quite
>> true.  Maybe you need three patches instead of two.
>>
> Or change commit message for patch 5 to make it more clear that it is a
> preparation patch?

Or both.  Just give it a try.

>>
>> Something like this:
>>
>> +       /* if dirty bit is not supported, no need to track it */
>> +#if PT_GUEST_DIRTY_MASK == 0
>>         if (!write_fault)
>>                  protect_clean_gpte(&pte_access, pte);
>> ...
>>         if (unlikely(!accessed_dirty)) {
>> ...
>>         }
>> +#endif
>>
> I will have to do the same for update_accessed_dirty_bits(). The problem
> of idfdefs they spread around.

Putting update_accessed_dirty_bits() with "#ifdef do we have
accessed_dirty_bits at all" sounds just fine.

But if you do not like #ifdefs you can use __maybe_unused and the
compiler will elide it.

>> doesn't look bad at all.  With the old check on EPT it looked ugly, but
>> with the new check on PT_GUEST_DIRTY_MASK it is quite natural.  Also
>> because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if".
>>  If I see
>>
>>         if (!write_fault)
>>                 protect_clean_gpte(&pte_access, pte);
>>         else
>>                 /*
>>                  * On a write fault, fold the dirty bit into
>> 		 * accessed_dirty by
>>                  * shifting it one place right.
>>                  */
>>                 accessed_dirty &=
>> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
>>
>>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
>>
>> the obvious reaction is "what, is there a case where I'm using
>> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
> In this case accessed_dirty has correct value of 0 :) The if() bellow just
> tells you that since A/D is not supported there is nothing to be done
> about zero value of accessed_dirty, but the value itself is correct!

It is correct because accessed_dirty is initialized to 0.  But the "&"
with a bit taken out of thin air (bit 0 of the PTE)?  That's just
disgusting. :)

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 13:11               ` Paolo Bonzini
@ 2013-07-29 13:20                 ` Gleb Natapov
  2013-07-29 14:12                   ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 13:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 03:11:39PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 14:34, Gleb Natapov ha scritto:
> >>>> But I think what you _really_ want is not avoiding conditional branches.
> >>> The idea is that it is hard for branch prediction to predict correct
> >>> result when correct result depends on guest's page table that can
> >>> contain anything, so in some places shadow paging code uses boolean
> >>> logic to avoid branches, in this case it is hard to avoid if() anyway
> >>> since the function invocation is in the if().
> >>
> >> Yes, I get the idea, but is_rsvd_bits_set should be predicted unlikely,
> >> no?  If the guest has to run, it must use mostly valid ptes. :)
> >>
> > You see, you are confused and you want branch prediction not to be? :)
> > If your guest is KVM is_rsvd_bits_set() will be likely much more then
> > unlikely because KVM misconfigures EPT entries to cache MMIO addresses,
> > so all the "unlikely" cases will be fixed by shadow pages and will not
> > reappear (until shadow pages are zapped), but misconfigured entries will
> > continue to produces violations.
> 
> But then:
> 
> 1) MMIO is a slow path anyway, losing 10 cycles on a mispredicted branch
> is not going to help much.  Fast page faults are all I would optimize for.
> 
Of course, for that the check should be fast.

> 2) in cases like this you just do not use likely/unlikely; the branch
> will be very unlikely in the beginning, and very likely once shadow
> pages are filled or in the no-EPT case.  Just let the branch predictor
> adjust, it will probably do better than boolean tricks.
> 
likely/unlikely are usually useless anyway. If you can avoid if()
altogether this is a win since there is no branch to predict.

> >> Especially if you change prefetch_invalid_gpte to do the reserved bits
> >> test after the present test (so that is_rsvd_bits_set is only called on
> >> present pagetables), is_rsvd_bits_set's result should be really
> >> well-predicted. 
> > Nope, for ept page tables present is not a single bit, it is three bits
> > which by themselves can have invalid values.
> 
> We're not checking the validity of the bits in the is_present_gpte test,
> we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
> just "(pte & 7) != 0").  It doesn't change anything in the outcome of
> prefetch_invalid_gpte, and it makes the ordering consistent with
> walk_addr_generic which already tests presence before reserved bits.
> 
> So doing this swap should be a win anyway.
> 
> >>                   At this point (and especially since function invocation
> >> is always in "if"s), using boolean logic to avoid branches does not make
> >> much sense anymore for this function.
> > 
> > That's true.
> 
> So are you going to change to "if"s?
> 
I think it will be better just to check mmu->bad_mt_xwr always. (I
dislike ifdefs if you haven't noticed :)).

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 13:19           ` Paolo Bonzini
@ 2013-07-29 13:27             ` Gleb Natapov
  2013-07-29 14:15               ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 13:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 03:19:01PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 14:24, Gleb Natapov ha scritto:
> >> My initial impression to this patch was "everything's ready after the
> >> previous patch, you just have to set the mask to 0".  Which is not quite
> >> true.  Maybe you need three patches instead of two.
> >>
> > Or change commit message for patch 5 to make it more clear that it is a
> > preparation patch?
> 
> Or both.  Just give it a try.
> 
It is not hard to imaging without trying :) Will do.

> >>
> >> Something like this:
> >>
> >> +       /* if dirty bit is not supported, no need to track it */
> >> +#if PT_GUEST_DIRTY_MASK == 0
> >>         if (!write_fault)
> >>                  protect_clean_gpte(&pte_access, pte);
> >> ...
> >>         if (unlikely(!accessed_dirty)) {
> >> ...
> >>         }
> >> +#endif
> >>
> > I will have to do the same for update_accessed_dirty_bits(). The problem
> > of idfdefs they spread around.
> 
> Putting update_accessed_dirty_bits() with "#ifdef do we have
> accessed_dirty_bits at all" sounds just fine.
> 
> But if you do not like #ifdefs you can use __maybe_unused and the
> compiler will elide it.
> 
Those this is unnecessary. That's the point. If #ifdef is unavoidable I
have not problem using it even though I dislike it, but in this case it
is just unnecessary.

> >> doesn't look bad at all.  With the old check on EPT it looked ugly, but
> >> with the new check on PT_GUEST_DIRTY_MASK it is quite natural.  Also
> >> because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if".
> >>  If I see
> >>
> >>         if (!write_fault)
> >>                 protect_clean_gpte(&pte_access, pte);
> >>         else
> >>                 /*
> >>                  * On a write fault, fold the dirty bit into
> >> 		 * accessed_dirty by
> >>                  * shifting it one place right.
> >>                  */
> >>                 accessed_dirty &=
> >> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
> >>
> >>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
> >>
> >> the obvious reaction is "what, is there a case where I'm using
> >> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
> > In this case accessed_dirty has correct value of 0 :) The if() bellow just
> > tells you that since A/D is not supported there is nothing to be done
> > about zero value of accessed_dirty, but the value itself is correct!
> 
> It is correct because accessed_dirty is initialized to 0.  But the "&"
> with a bit taken out of thin air (bit 0 of the PTE)?  That's just
> disgusting. :)
> 
Sorry to disgust you, but the code relies on this "&" trick with or
without the patch. It clears all unrelated bits from pte this way. No
new disgusting tricks are added by the patch.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 13:20                 ` Gleb Natapov
@ 2013-07-29 14:12                   ` Paolo Bonzini
  2013-07-29 16:24                     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 14:12 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 15:20, Gleb Natapov ha scritto:
>> 2) in cases like this you just do not use likely/unlikely; the branch
>> will be very unlikely in the beginning, and very likely once shadow
>> pages are filled or in the no-EPT case.  Just let the branch predictor
>> adjust, it will probably do better than boolean tricks.
>>
> likely/unlikely are usually useless anyway. If you can avoid if()
> altogether this is a win since there is no branch to predict.

However, if the branches are dynamically well-predicted,

   if (simple)
       ...
   if (complex)
       ...

is likely faster than

   if (simple | complex)

because the branches then are very very cheap, and it pays off to not
always evaluate the complex branch.

In this case, the reserved bit test is the relatively complex one, it
has a couple memory accesses and a longish chain of dependencies.

>>>> Especially if you change prefetch_invalid_gpte to do the reserved bits
>>>> test after the present test (so that is_rsvd_bits_set is only called on
>>>> present pagetables), is_rsvd_bits_set's result should be really
>>>> well-predicted. 
>>> Nope, for ept page tables present is not a single bit, it is three bits
>>> which by themselves can have invalid values.
>>
>> We're not checking the validity of the bits in the is_present_gpte test,
>> we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
>> just "(pte & 7) != 0").  It doesn't change anything in the outcome of
>> prefetch_invalid_gpte, and it makes the ordering consistent with
>> walk_addr_generic which already tests presence before reserved bits.
>>
>> So doing this swap should be a win anyway.
>>
>>>>                   At this point (and especially since function invocation
>>>> is always in "if"s), using boolean logic to avoid branches does not make
>>>> much sense anymore for this function.
>>>
>>> That's true.
>>
>> So are you going to change to "if"s?
>>
> I think it will be better just to check mmu->bad_mt_xwr always. (I
> dislike ifdefs if you haven't noticed :)).

Yeah, I also thought of always checking bad_mt_xwr and even using it to
subsume the present check too, i.e. turning it into
is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
in the present check (well, a superset).  You can then check for
presence separately if you care, which you don't in
prefetch_invalid_gpte.  It requires small changes in the callers but
nothing major.

But it still seems to me that we're in the above "if (simple ||
complex)" case and having a separate "if (!present)" check will be faster.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2013-07-29 13:12     ` Gleb Natapov
@ 2013-07-29 14:13       ` Paolo Bonzini
  0 siblings, 0 replies; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 14:13 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 15:12, Gleb Natapov ha scritto:
>> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
>>> > > +	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
>>> > > +	 * bits are further modified by vmx_set_efer() below.
>> > 
>> > IA32E_MODE, but the last sentence is not relevant anyway for this
>> > vmcs_write32; in fact, the next command basically duplicates it.
>> > 
> IA32E_MODE is not relevant, it never changes in VM_EXIT_CONTROLS, but
> LOAD_IA32_EFER refers to VM_EXIT_LOAD_IA32_EFER, which can be set by
> vmx_set_efer() and is different from VM_ENTRY_LOAD_IA32_EFER that
> comment bellow talks about.
> 

Yes.  Just fix the comment so that it is clear.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 13:27             ` Gleb Natapov
@ 2013-07-29 14:15               ` Paolo Bonzini
  2013-07-29 16:14                 ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 14:15 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 15:27, Gleb Natapov ha scritto:
>>>> doesn't look bad at all.  With the old check on EPT it looked ugly, but
>>>> with the new check on PT_GUEST_DIRTY_MASK it is quite natural.  Also
>>>> because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if".
>>>>  If I see
>>>>
>>>>         if (!write_fault)
>>>>                 protect_clean_gpte(&pte_access, pte);
>>>>         else
>>>>                 /*
>>>>                  * On a write fault, fold the dirty bit into
>>>> 		 * accessed_dirty by
>>>>                  * shifting it one place right.
>>>>                  */
>>>>                 accessed_dirty &=
>>>> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
>>>>
>>>>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
>>>>
>>>> the obvious reaction is "what, is there a case where I'm using
>>>> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
>>> In this case accessed_dirty has correct value of 0 :) The if() bellow just
>>> tells you that since A/D is not supported there is nothing to be done
>>> about zero value of accessed_dirty, but the value itself is correct!
>>
>> It is correct because accessed_dirty is initialized to 0.  But the "&"
>> with a bit taken out of thin air (bit 0 of the PTE)?  That's just
>> disgusting. :)
>>
> Sorry to disgust you, but the code relies on this "&" trick with or
> without the patch. It clears all unrelated bits from pte this way. No
> new disgusting tricks are added by the patch.

Oh the code is not disgusting at all!  It is very nice to follow.

The new disgusting ;) trick is that here in the EPT case you're
effectively doing

	accessed_dirty &= pte;

where bit 0 is the "R" bit (iirc) and has absolutely nothing to do with
dirty or accessed.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 14:15               ` Paolo Bonzini
@ 2013-07-29 16:14                 ` Gleb Natapov
  2013-07-29 16:28                   ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 16:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 04:15:16PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 15:27, Gleb Natapov ha scritto:
> >>>> doesn't look bad at all.  With the old check on EPT it looked ugly, but
> >>>> with the new check on PT_GUEST_DIRTY_MASK it is quite natural.  Also
> >>>> because you have anyway a reference to PT_GUEST_DIRTY_MASK in the "if".
> >>>>  If I see
> >>>>
> >>>>         if (!write_fault)
> >>>>                 protect_clean_gpte(&pte_access, pte);
> >>>>         else
> >>>>                 /*
> >>>>                  * On a write fault, fold the dirty bit into
> >>>> 		 * accessed_dirty by
> >>>>                  * shifting it one place right.
> >>>>                  */
> >>>>                 accessed_dirty &=
> >>>> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
> >>>>
> >>>>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
> >>>>
> >>>> the obvious reaction is "what, is there a case where I'm using
> >>>> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
> >>> In this case accessed_dirty has correct value of 0 :) The if() bellow just
> >>> tells you that since A/D is not supported there is nothing to be done
> >>> about zero value of accessed_dirty, but the value itself is correct!
> >>
> >> It is correct because accessed_dirty is initialized to 0.  But the "&"
> >> with a bit taken out of thin air (bit 0 of the PTE)?  That's just
> >> disgusting. :)
> >>
> > Sorry to disgust you, but the code relies on this "&" trick with or
> > without the patch. It clears all unrelated bits from pte this way. No
> > new disgusting tricks are added by the patch.
> 
> Oh the code is not disgusting at all!  It is very nice to follow.
> 
> The new disgusting ;) trick is that here in the EPT case you're
> effectively doing
> 
> 	accessed_dirty &= pte;
> 
> where bit 0 is the "R" bit (iirc) and has absolutely nothing to do with
> dirty or accessed.
> 
What bit 0 has to do with anything? Non ept code after shift also has
random bits and random places in ept (R at P place, U at R place), the
trick is that accessed_dirty masks bits we are not interesting in and
capture only those we want to follow (accessed in regular case, non in
ept case). This is exactly what original code is doing, so they are
either both disgusting or both very nice to follow :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 14:12                   ` Paolo Bonzini
@ 2013-07-29 16:24                     ` Gleb Natapov
  2013-07-29 16:36                       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 16:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 04:12:33PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 15:20, Gleb Natapov ha scritto:
> >> 2) in cases like this you just do not use likely/unlikely; the branch
> >> will be very unlikely in the beginning, and very likely once shadow
> >> pages are filled or in the no-EPT case.  Just let the branch predictor
> >> adjust, it will probably do better than boolean tricks.
> >>
> > likely/unlikely are usually useless anyway. If you can avoid if()
> > altogether this is a win since there is no branch to predict.
> 
> However, if the branches are dynamically well-predicted,
> 
>    if (simple)
>        ...
>    if (complex)
>        ...
> 
> is likely faster than
> 
>    if (simple | complex)
> 
> because the branches then are very very cheap, and it pays off to not
> always evaluate the complex branch.
> 
Good point about about "|" always evaluating both. Is this the case
with if (simple !=0 | complex != 0) too where theoretically compiler may
see that if simple !=0 is true no need to evaluate the second one?
 

> In this case, the reserved bit test is the relatively complex one, it
> has a couple memory accesses and a longish chain of dependencies.
> 
> >>>> Especially if you change prefetch_invalid_gpte to do the reserved bits
> >>>> test after the present test (so that is_rsvd_bits_set is only called on
> >>>> present pagetables), is_rsvd_bits_set's result should be really
> >>>> well-predicted. 
> >>> Nope, for ept page tables present is not a single bit, it is three bits
> >>> which by themselves can have invalid values.
> >>
> >> We're not checking the validity of the bits in the is_present_gpte test,
> >> we're checking it in the is_rsvd_bits_set test (is_present_gpte is doing
> >> just "(pte & 7) != 0").  It doesn't change anything in the outcome of
> >> prefetch_invalid_gpte, and it makes the ordering consistent with
> >> walk_addr_generic which already tests presence before reserved bits.
> >>
> >> So doing this swap should be a win anyway.
> >>
> >>>>                   At this point (and especially since function invocation
> >>>> is always in "if"s), using boolean logic to avoid branches does not make
> >>>> much sense anymore for this function.
> >>>
> >>> That's true.
> >>
> >> So are you going to change to "if"s?
> >>
> > I think it will be better just to check mmu->bad_mt_xwr always. (I
> > dislike ifdefs if you haven't noticed :)).
> 
> Yeah, I also thought of always checking bad_mt_xwr and even using it to
> subsume the present check too, i.e. turning it into
> is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
> in the present check (well, a superset).  You can then check for
> presence separately if you care, which you don't in
> prefetch_invalid_gpte.  It requires small changes in the callers but
> nothing major.
I do not get what is_rsvd_bits_set_or_nonpresent() will check exactly
and why do we needed it, there are two places where we check
present/reserved and in one of them we need to know which one it is.

Anyway order of checks in prefetch_invalid_gpte() is not relevant to
that patchset, so lets better leave it to a separate discussion.

> 
> But it still seems to me that we're in the above "if (simple ||
> complex)" case and having a separate "if (!present)" check will be faster.
> 
> Paolo

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 16:14                 ` Gleb Natapov
@ 2013-07-29 16:28                   ` Paolo Bonzini
  2013-07-29 16:43                     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 16:28 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 18:14, Gleb Natapov ha scritto:
>>>>>> > >>>>                 accessed_dirty &=
>>>>>> > >>>> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
>>>>>> > >>>>
>>>>>> > >>>>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
>>>>>> > >>>>
>>>>>> > >>>> the obvious reaction is "what, is there a case where I'm using
>>>>>> > >>>> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
>>>>> > >>> In this case accessed_dirty has correct value of 0 :) The if() bellow just
>>>>> > >>> tells you that since A/D is not supported there is nothing to be done
>>>>> > >>> about zero value of accessed_dirty, but the value itself is correct!
>>>> > >>
>>>> > >> It is correct because accessed_dirty is initialized to 0.  But the "&"
>>>> > >> with a bit taken out of thin air (bit 0 of the PTE)?  That's just
>>>> > >> disgusting. :)
>>>> > >>
>>> > > Sorry to disgust you, but the code relies on this "&" trick with or
>>> > > without the patch. It clears all unrelated bits from pte this way. No
>>> > > new disgusting tricks are added by the patch.
>> > 
>> > Oh the code is not disgusting at all!  It is very nice to follow.
>> > 
>> > The new disgusting ;) trick is that here in the EPT case you're
>> > effectively doing
>> > 
>> > 	accessed_dirty &= pte;
>> > 
>> > where bit 0 is the "R" bit (iirc) and has absolutely nothing to do with
>> > dirty or accessed.
>
> What bit 0 has to do with anything? Non ept code after shift also has
> random bits and random places in ept (R at P place, U at R place), the
> trick is that accessed_dirty masks bits we are not interesting in and
> capture only those we want to follow (accessed in regular case, non in
> ept case). This is exactly what original code is doing, so they are
> either both disgusting or both very nice to follow :)

The comment is clear: "fold the dirty bit into accessed_dirty by
shifting it one place right".  In the EPT case the comment makes no
sense and it is not obvious that you rely on accessed_dirty=0 even
before that line.

That's why I'd rather have that code out of the PT_GUEST_DIRTY_MASK==0 case.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 16:24                     ` Gleb Natapov
@ 2013-07-29 16:36                       ` Paolo Bonzini
  2013-07-29 16:54                         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 16:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 18:24, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 04:12:33PM +0200, Paolo Bonzini wrote:
>> Il 29/07/2013 15:20, Gleb Natapov ha scritto:
>>>> 2) in cases like this you just do not use likely/unlikely; the branch
>>>> will be very unlikely in the beginning, and very likely once shadow
>>>> pages are filled or in the no-EPT case.  Just let the branch predictor
>>>> adjust, it will probably do better than boolean tricks.
>>>>
>>> likely/unlikely are usually useless anyway. If you can avoid if()
>>> altogether this is a win since there is no branch to predict.
>>
>> However, if the branches are dynamically well-predicted,
>>
>>    if (simple)
>>        ...
>>    if (complex)
>>        ...
>>
>> is likely faster than
>>
>>    if (simple | complex)
>>
>> because the branches then are very very cheap, and it pays off to not
>> always evaluate the complex branch.
>
> Good point about about "|" always evaluating both. Is this the case
> with if (simple !=0 | complex != 0) too where theoretically compiler may
> see that if simple !=0 is true no need to evaluate the second one?

Yes (only if complex doesn't have any side effects, which is the case here).

>> Yeah, I also thought of always checking bad_mt_xwr and even using it to
>> subsume the present check too, i.e. turning it into
>> is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
>> in the present check (well, a superset).  You can then check for
>> presence separately if you care, which you don't in
>> prefetch_invalid_gpte.  It requires small changes in the callers but
>> nothing major.
> 
> I do not get what is_rsvd_bits_set_or_nonpresent() will check exactly
> and why do we needed it, there are two places where we check
> present/reserved and in one of them we need to know which one it is.

You can OR bad_mt_xwr with 0x5555555555555555ULL (I think).  Then your
implementation of is_rsvd_bits_set() using bad_mt_xwr will return true
in all cases where the pte is non-present.  You can then call
is_present_pte to discriminate the two cases.

    if (is_rsvd_bits_set_or_nonpresent) {
        if (!present)
            ...
        else
            ...
    }

In more abstract terms this is:

    if (simple)
        ...
    if (complex)
        ...

to

    if (simple_or_complex) {
        if (simple)
            ...
        else
            ...
    }

This can actually make sense if simple is almost always false, because
then you save something from not evaluating it on the fast path.

But in this case, adding bad_mt_xwr to the non-EPT case is a small loss.

> Anyway order of checks in prefetch_invalid_gpte() is not relevant to
> that patchset, so lets better leave it to a separate discussion.

Yes.

Paolo

>>
>> But it still seems to me that we're in the above "if (simple ||
>> complex)" case and having a separate "if (!present)" check will be faster.
>>
>> Paolo
> 
> --
> 			Gleb.
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 16:28                   ` Paolo Bonzini
@ 2013-07-29 16:43                     ` Gleb Natapov
  2013-07-29 17:06                       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 16:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 06:28:37PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 18:14, Gleb Natapov ha scritto:
> >>>>>> > >>>>                 accessed_dirty &=
> >>>>>> > >>>> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
> >>>>>> > >>>>
> >>>>>> > >>>>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
> >>>>>> > >>>>
> >>>>>> > >>>> the obvious reaction is "what, is there a case where I'm using
> >>>>>> > >>>> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
> >>>>> > >>> In this case accessed_dirty has correct value of 0 :) The if() bellow just
> >>>>> > >>> tells you that since A/D is not supported there is nothing to be done
> >>>>> > >>> about zero value of accessed_dirty, but the value itself is correct!
> >>>> > >>
> >>>> > >> It is correct because accessed_dirty is initialized to 0.  But the "&"
> >>>> > >> with a bit taken out of thin air (bit 0 of the PTE)?  That's just
> >>>> > >> disgusting. :)
> >>>> > >>
> >>> > > Sorry to disgust you, but the code relies on this "&" trick with or
> >>> > > without the patch. It clears all unrelated bits from pte this way. No
> >>> > > new disgusting tricks are added by the patch.
> >> > 
> >> > Oh the code is not disgusting at all!  It is very nice to follow.
> >> > 
> >> > The new disgusting ;) trick is that here in the EPT case you're
> >> > effectively doing
> >> > 
> >> > 	accessed_dirty &= pte;
> >> > 
> >> > where bit 0 is the "R" bit (iirc) and has absolutely nothing to do with
> >> > dirty or accessed.
> >
> > What bit 0 has to do with anything? Non ept code after shift also has
> > random bits and random places in ept (R at P place, U at R place), the
> > trick is that accessed_dirty masks bits we are not interesting in and
> > capture only those we want to follow (accessed in regular case, non in
> > ept case). This is exactly what original code is doing, so they are
> > either both disgusting or both very nice to follow :)
> 
> The comment is clear: "fold the dirty bit into accessed_dirty by
> shifting it one place right".  In the EPT case the comment makes no
> sense and it is not obvious that you rely on accessed_dirty=0 even
> before that line.
It is not obvious that the code relies on accessed_dirty been initialized
to the bits the code wants to track at the start of the function. It
wasn't for me. With if() it would have been much clearer, but the
current way is faster. 

> 
> That's why I'd rather have that code out of the PT_GUEST_DIRTY_MASK==0 case.
> 
What problem current code has that you are trying to fix? What _technical_
justification you provide? There is no point adding ifdefs where they
are clearly not needed just because.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support
  2013-07-29 16:36                       ` Paolo Bonzini
@ 2013-07-29 16:54                         ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 16:54 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 06:36:12PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 18:24, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 04:12:33PM +0200, Paolo Bonzini wrote:
> >> Il 29/07/2013 15:20, Gleb Natapov ha scritto:
> >>>> 2) in cases like this you just do not use likely/unlikely; the branch
> >>>> will be very unlikely in the beginning, and very likely once shadow
> >>>> pages are filled or in the no-EPT case.  Just let the branch predictor
> >>>> adjust, it will probably do better than boolean tricks.
> >>>>
> >>> likely/unlikely are usually useless anyway. If you can avoid if()
> >>> altogether this is a win since there is no branch to predict.
> >>
> >> However, if the branches are dynamically well-predicted,
> >>
> >>    if (simple)
> >>        ...
> >>    if (complex)
> >>        ...
> >>
> >> is likely faster than
> >>
> >>    if (simple | complex)
> >>
> >> because the branches then are very very cheap, and it pays off to not
> >> always evaluate the complex branch.
> >
> > Good point about about "|" always evaluating both. Is this the case
> > with if (simple !=0 | complex != 0) too where theoretically compiler may
> > see that if simple !=0 is true no need to evaluate the second one?
> 
> Yes (only if complex doesn't have any side effects, which is the case here).
> 
> >> Yeah, I also thought of always checking bad_mt_xwr and even using it to
> >> subsume the present check too, i.e. turning it into
> >> is_rsvd_bits_set_or_nonpresent.  It checks the same bits that are used
> >> in the present check (well, a superset).  You can then check for
> >> presence separately if you care, which you don't in
> >> prefetch_invalid_gpte.  It requires small changes in the callers but
> >> nothing major.
> > 
> > I do not get what is_rsvd_bits_set_or_nonpresent() will check exactly
> > and why do we needed it, there are two places where we check
> > present/reserved and in one of them we need to know which one it is.
> 
> You can OR bad_mt_xwr with 0x5555555555555555ULL (I think).  Then your
With 0x1010101.

> implementation of is_rsvd_bits_set() using bad_mt_xwr will return true
> in all cases where the pte is non-present.  You can then call
> is_present_pte to discriminate the two cases.
> 
>     if (is_rsvd_bits_set_or_nonpresent) {
>         if (!present)
>             ...
>         else
>             ...
>     }
> 
> In more abstract terms this is:
> 
>     if (simple)
>         ...
>     if (complex)
>         ...
> 
> to
> 
>     if (simple_or_complex) {
>         if (simple)
>             ...
>         else
>             ...
>     }
> 
> This can actually make sense if simple is almost always false, because
> then you save something from not evaluating it on the fast path.
> 
> But in this case, adding bad_mt_xwr to the non-EPT case is a small loss.
> 
> > Anyway order of checks in prefetch_invalid_gpte() is not relevant to
> > that patchset, so lets better leave it to a separate discussion.
> 
> Yes.
> 
> Paolo
> 
> >>
> >> But it still seems to me that we're in the above "if (simple ||
> >> complex)" case and having a separate "if (!present)" check will be faster.
> >>
> >> Paolo
> > 
> > --
> > 			Gleb.
> > 

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 16:43                     ` Gleb Natapov
@ 2013-07-29 17:06                       ` Paolo Bonzini
  2013-07-29 17:11                         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-29 17:06 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 29/07/2013 18:43, Gleb Natapov ha scritto:
> On Mon, Jul 29, 2013 at 06:28:37PM +0200, Paolo Bonzini wrote:
>> Il 29/07/2013 18:14, Gleb Natapov ha scritto:
>>>>>>>>>>>>>                 accessed_dirty &=
>>>>>>>>>>>>> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
>>>>>>>>>>>>>
>>>>>>>>>>>>> the obvious reaction is "what, is there a case where I'm using
>>>>>>>>>>>>> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
>>>>>>>>>>> In this case accessed_dirty has correct value of 0 :) The if() bellow just
>>>>>>>>>>> tells you that since A/D is not supported there is nothing to be done
>>>>>>>>>>> about zero value of accessed_dirty, but the value itself is correct!
>>>>>>>>>
>>>>>>>>> It is correct because accessed_dirty is initialized to 0.  But the "&"
>>>>>>>>> with a bit taken out of thin air (bit 0 of the PTE)?  That's just
>>>>>>>>> disgusting. :)
>>>>>>>>>
>>>>>>> Sorry to disgust you, but the code relies on this "&" trick with or
>>>>>>> without the patch. It clears all unrelated bits from pte this way. No
>>>>>>> new disgusting tricks are added by the patch.
>>>>>
>>>>> Oh the code is not disgusting at all!  It is very nice to follow.
>>>>>
>>>>> The new disgusting ;) trick is that here in the EPT case you're
>>>>> effectively doing
>>>>>
>>>>> 	accessed_dirty &= pte;
>>>>>
>>>>> where bit 0 is the "R" bit (iirc) and has absolutely nothing to do with
>>>>> dirty or accessed.
>>>
>>> What bit 0 has to do with anything? Non ept code after shift also has
>>> random bits and random places in ept (R at P place, U at R place), the
>>> trick is that accessed_dirty masks bits we are not interesting in and
>>> capture only those we want to follow (accessed in regular case, non in
>>> ept case). This is exactly what original code is doing, so they are
>>> either both disgusting or both very nice to follow :)
>>
>> The comment is clear: "fold the dirty bit into accessed_dirty by
>> shifting it one place right".  In the EPT case the comment makes no
>> sense and it is not obvious that you rely on accessed_dirty=0 even
>> before that line.
> It is not obvious that the code relies on accessed_dirty been initialized
> to the bits the code wants to track at the start of the function. It
> wasn't for me. With if() it would have been much clearer, but the
> current way is faster. 

Sure it is not obvious.  But relying on the mask being zero is a whole
different level of non-obviousness.

>> That's why I'd rather have that code out of the PT_GUEST_DIRTY_MASK==0 case.
>>
> What problem current code has that you are trying to fix? What _technical_
> justification you provide?

Of course there is no technical justification.  Did I ever say otherwise?

> There is no point adding ifdefs where they
> are clearly not needed just because.

If you loathe ifdefs so much, you can of course wrap the whole code
we're talking about with an if().  That takes an extra level of
indentation, of course.

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-29 17:06                       ` Paolo Bonzini
@ 2013-07-29 17:11                         ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-29 17:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Mon, Jul 29, 2013 at 07:06:21PM +0200, Paolo Bonzini wrote:
> Il 29/07/2013 18:43, Gleb Natapov ha scritto:
> > On Mon, Jul 29, 2013 at 06:28:37PM +0200, Paolo Bonzini wrote:
> >> Il 29/07/2013 18:14, Gleb Natapov ha scritto:
> >>>>>>>>>>>>>                 accessed_dirty &=
> >>>>>>>>>>>>> 			pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>         if (PT_GUEST_DIRTY_MASK != 0 && unlikely(!accessed_dirty)) {
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the obvious reaction is "what, is there a case where I'm using
> >>>>>>>>>>>>> accessed_dirty if PT_GUEST_DIRTY_MASK == 0?"  Of course it makes sense
> >>>>>>>>>>> In this case accessed_dirty has correct value of 0 :) The if() bellow just
> >>>>>>>>>>> tells you that since A/D is not supported there is nothing to be done
> >>>>>>>>>>> about zero value of accessed_dirty, but the value itself is correct!
> >>>>>>>>>
> >>>>>>>>> It is correct because accessed_dirty is initialized to 0.  But the "&"
> >>>>>>>>> with a bit taken out of thin air (bit 0 of the PTE)?  That's just
> >>>>>>>>> disgusting. :)
> >>>>>>>>>
> >>>>>>> Sorry to disgust you, but the code relies on this "&" trick with or
> >>>>>>> without the patch. It clears all unrelated bits from pte this way. No
> >>>>>>> new disgusting tricks are added by the patch.
> >>>>>
> >>>>> Oh the code is not disgusting at all!  It is very nice to follow.
> >>>>>
> >>>>> The new disgusting ;) trick is that here in the EPT case you're
> >>>>> effectively doing
> >>>>>
> >>>>> 	accessed_dirty &= pte;
> >>>>>
> >>>>> where bit 0 is the "R" bit (iirc) and has absolutely nothing to do with
> >>>>> dirty or accessed.
> >>>
> >>> What bit 0 has to do with anything? Non ept code after shift also has
> >>> random bits and random places in ept (R at P place, U at R place), the
> >>> trick is that accessed_dirty masks bits we are not interesting in and
> >>> capture only those we want to follow (accessed in regular case, non in
> >>> ept case). This is exactly what original code is doing, so they are
> >>> either both disgusting or both very nice to follow :)
> >>
> >> The comment is clear: "fold the dirty bit into accessed_dirty by
> >> shifting it one place right".  In the EPT case the comment makes no
> >> sense and it is not obvious that you rely on accessed_dirty=0 even
> >> before that line.
> > It is not obvious that the code relies on accessed_dirty been initialized
> > to the bits the code wants to track at the start of the function. It
> > wasn't for me. With if() it would have been much clearer, but the
> > current way is faster. 
> 
> Sure it is not obvious.  But relying on the mask being zero is a whole
> different level of non-obviousness.
> 
I disagree.

> >> That's why I'd rather have that code out of the PT_GUEST_DIRTY_MASK==0 case.
> >>
> > What problem current code has that you are trying to fix? What _technical_
> > justification you provide?
> 
> Of course there is no technical justification.  Did I ever say otherwise?
> 
I just want to be absolutely sure that we are bikeshedding here.

> > There is no point adding ifdefs where they
> > are clearly not needed just because.
> 
> If you loathe ifdefs so much, you can of course wrap the whole code
> we're talking about with an if().  That takes an extra level of
> indentation, of course.
> 
And that point of doing it is...?

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-25 10:59 ` [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
  2013-07-29  9:48   ` Paolo Bonzini
@ 2013-07-30 10:03   ` Paolo Bonzini
  2013-07-30 11:56     ` Gleb Natapov
  1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-30 10:03 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> This is the first patch in a series which adds nested EPT support to KVM's
> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> to set its own cr3 and take its own page faults without either of L0 or L1
> getting involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over EPT, and shadow page
> tables over shadow page tables).
> 
> This patch adds EPT support to paging_tmpl.h.
> 
> paging_tmpl.h contains the code for reading and writing page tables. The code
> for 32-bit and 64-bit tables is very similar, but not identical, so
> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> with PTTYPE=64, and this generates the two sets of similar functions.
> 
> There are subtle but important differences between the format of EPT tables
> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> third set of functions to read the guest EPT table and to write the shadow
> EPT table.
> 
> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> with "EPT") which correctly read and write EPT tables.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/kvm/mmu.c         |    5 +++++
>  arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 44 insertions(+), 4 deletions(-)

Ok, let's rewind and retry.

> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 4c4274d..b5273c3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>  	return mmu->last_pte_bitmap & (1 << index);
>  }
>  
> +#define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE PTTYPE_EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +
>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 7581395..e38b3c0 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -58,6 +58,21 @@
>  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
>  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
>  	#define CMPXCHG cmpxchg
> +#elif PTTYPE == PTTYPE_EPT
> +	#define pt_element_t u64
> +	#define guest_walker guest_walkerEPT
> +	#define FNAME(name) ept_##name
> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> +	#define PT_GUEST_ACCESSED_MASK 0
> +	#define PT_GUEST_DIRTY_MASK 0
> +	#define PT_GUEST_DIRTY_SHIFT 0
> +	#define PT_GUEST_ACCESSED_SHIFT 0
> +	#define CMPXCHG cmpxchg64
> +	#define PT_MAX_FULL_LEVELS 4
>  #else
>  	#error Invalid PTTYPE value
>  #endif

Please add a

BUILD_BUG_ON(!!PT_GUEST_ACCESSED_MASK != !!PT_GUEST_DIRTY_MASK);
#if PT_GUEST_ACCESSED_MASK
BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_GUEST_ACCESSED_SHIFT);
BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_WRITABLE_SHIFT);
BUILD_BUG_ON(PT_GUEST_DIRTY_MASK != (1 << PT_GUEST_DIRTY_SHIFT));
BUILD_BUG_ON(PT_GUEST_ACCESSED_MASK != (1 << PT_GUEST_ACCESSED_SHIFT));
#endif

here.

> @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
>  
>  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>  {
> +#if PT_GUEST_DIRTY_MASK == 0
> +	/* dirty bit is not supported, so no need to track it */
> +	return;
> +#else

Here you can use a regular "if" instead of the preprocessor if.  Also
please move this and all other PT_GUEST_*-related changes to a separate
patch.

>  	unsigned mask;
>  
>  	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
>  	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
>  		PT_WRITABLE_MASK;
>  	*access &= mask;
> +#endif
>  }
>  
>  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> @@ -111,7 +131,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>  
>  static inline int FNAME(is_present_gpte)(unsigned long pte)
>  {
> +#if PTTYPE != PTTYPE_EPT
>  	return is_present_gpte(pte);
> +#else
> +	return pte & 7;
> +#endif
>  }
>  
>  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> @@ -147,7 +171,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
>  	if (!FNAME(is_present_gpte)(gpte))
>  		goto no_present;
>  
> -	if (!(gpte & PT_GUEST_ACCESSED_MASK))
> +	/* if accessed bit is not supported prefetch non accessed gpte */
> +	if (PT_GUEST_ACCESSED_MASK && !(gpte & PT_GUEST_ACCESSED_MASK))
>  		goto no_present;
>  
>  	return false;
> @@ -160,9 +185,14 @@ no_present:
>  static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
>  {
>  	unsigned access;
> -
> +#if PTTYPE == PTTYPE_EPT
> +	BUILD_BUG_ON(ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK);
> +	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
> +		((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);

You can use a ?: here even for writable.  The compiler will simplify (a
& b ? b : 0) to just "a & b" for single-bit b.

> +#else
>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>  	access &= ~(gpte >> PT64_NX_SHIFT);
> +#endif
>  
>  	return access;
>  }
> @@ -212,7 +242,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  				    gva_t addr, u32 access)
>  {
> -	int ret;

This change is not needed anymore since you removed the #ifdef.

>  	pt_element_t pte;
>  	pt_element_t __user *uninitialized_var(ptep_user);
>  	gfn_t table_gfn;
> @@ -322,7 +351,9 @@ retry_walk:
>  		accessed_dirty &= pte >>
>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);

This shift is one of two things that bugged me.  I dislike including
"wrong" code just because it is dead.  Perhaps you can #define the
shifts to 8 and 9 already now, even if the masks stay 0?

Then when you implement nEPT A/D bits you only have to flip the masks
from 0 to (1 << PT_GUEST_*_SHIFT).

>  
> -	if (unlikely(!accessed_dirty)) {
> +	if (PT_GUEST_DIRTY_MASK && unlikely(!accessed_dirty)) {

If we want to drop the dead-code elimination burden on the compiler,
let's go all the way.

So, instead of this "if", please make update_accessed_dirty_bits return
0 at once if PT_GUEST_DIRTY_MASK == 0; this way you can do the same
thing for protect_clean_gpte and update_accessed_dirty_bits.

Paolo

> +		int ret;
> +
>  		ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
>  		if (unlikely(ret < 0))
>  			goto error;
> @@ -359,6 +390,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
>  					access);
>  }
>  
> +#if PTTYPE != PTTYPE_EPT
>  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  				   struct kvm_vcpu *vcpu, gva_t addr,
>  				   u32 access)
> @@ -366,6 +398,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
>  					addr, access);
>  }
> +#endif
>  
>  static bool
>  FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -793,6 +826,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
>  	return gpa;
>  }
>  
> +#if PTTYPE != PTTYPE_EPT
>  static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>  				      u32 access,
>  				      struct x86_exception *exception)
> @@ -811,6 +845,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>  
>  	return gpa;
>  }
> +#endif
>  
>  /*
>   * Using the cached information from sp->gfns is safe because:
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-30 10:03   ` Paolo Bonzini
@ 2013-07-30 11:56     ` Gleb Natapov
  2013-07-30 12:13       ` Paolo Bonzini
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-30 11:56 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Tue, Jul 30, 2013 at 12:03:07PM +0200, Paolo Bonzini wrote:
> Il 25/07/2013 12:59, Gleb Natapov ha scritto:
> > From: Nadav Har'El <nyh@il.ibm.com>
> > 
> > This is the first patch in a series which adds nested EPT support to KVM's
> > nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> > EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> > to set its own cr3 and take its own page faults without either of L0 or L1
> > getting involved. This often significanlty improves L2's performance over the
> > previous two alternatives (shadow page tables over EPT, and shadow page
> > tables over shadow page tables).
> > 
> > This patch adds EPT support to paging_tmpl.h.
> > 
> > paging_tmpl.h contains the code for reading and writing page tables. The code
> > for 32-bit and 64-bit tables is very similar, but not identical, so
> > paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> > with PTTYPE=64, and this generates the two sets of similar functions.
> > 
> > There are subtle but important differences between the format of EPT tables
> > and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> > third set of functions to read the guest EPT table and to write the shadow
> > EPT table.
> > 
> > So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> > with "EPT") which correctly read and write EPT tables.
> > 
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/x86/kvm/mmu.c         |    5 +++++
> >  arch/x86/kvm/paging_tmpl.h |   43 +++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 44 insertions(+), 4 deletions(-)
> 
> Ok, let's rewind and retry.
> 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 4c4274d..b5273c3 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> >  	return mmu->last_pte_bitmap & (1 << index);
> >  }
> >  
> > +#define PTTYPE_EPT 18 /* arbitrary */
> > +#define PTTYPE PTTYPE_EPT
> > +#include "paging_tmpl.h"
> > +#undef PTTYPE
> > +
> >  #define PTTYPE 64
> >  #include "paging_tmpl.h"
> >  #undef PTTYPE
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 7581395..e38b3c0 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -58,6 +58,21 @@
> >  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
> >  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
> >  	#define CMPXCHG cmpxchg
> > +#elif PTTYPE == PTTYPE_EPT
> > +	#define pt_element_t u64
> > +	#define guest_walker guest_walkerEPT
> > +	#define FNAME(name) ept_##name
> > +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> > +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> > +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> > +	#define PT_GUEST_ACCESSED_MASK 0
> > +	#define PT_GUEST_DIRTY_MASK 0
> > +	#define PT_GUEST_DIRTY_SHIFT 0
> > +	#define PT_GUEST_ACCESSED_SHIFT 0
> > +	#define CMPXCHG cmpxchg64
> > +	#define PT_MAX_FULL_LEVELS 4
> >  #else
> >  	#error Invalid PTTYPE value
> >  #endif
> 
> Please add a
> 
> BUILD_BUG_ON(!!PT_GUEST_ACCESSED_MASK != !!PT_GUEST_DIRTY_MASK);
> #if PT_GUEST_ACCESSED_MASK
> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_GUEST_ACCESSED_SHIFT);
> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_WRITABLE_SHIFT);
> BUILD_BUG_ON(PT_GUEST_DIRTY_MASK != (1 << PT_GUEST_DIRTY_SHIFT));
> BUILD_BUG_ON(PT_GUEST_ACCESSED_MASK != (1 << PT_GUEST_ACCESSED_SHIFT));
This will not work if I define _SHIFT to be 8/9 now. But we do not use
BUILD_BUG_ON() to check values from the same define "namespace". It is
implied that they are correct and when they change all "namespace"
remains to be consistent. If you look at BUILD_BUG_ON() that we have
(and this patch adds) they are from the form:
  PT_WRITABLE_MASK != ACC_WRITE_MASK
  ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK
  VMX_EPT_READABLE_MASK != PT_PRESENT_MASK
  VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK

i.e they compare value from different "namespaces".

> #endif
> 
> here.
> 
> > @@ -90,6 +105,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> >  
> >  static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> >  {
> > +#if PT_GUEST_DIRTY_MASK == 0
> > +	/* dirty bit is not supported, so no need to track it */
> > +	return;
> > +#else
> 
> Here you can use a regular "if" instead of the preprocessor if.  Also
Yeah, for some reason I thought BUILD_BUG_ON() prevent me from doing so,
but now I see that it should not.

> please move this and all other PT_GUEST_*-related changes to a separate
> patch.
Did already.

> 
> >  	unsigned mask;
> >  
> >  	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> > @@ -99,6 +118,7 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> >  	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
> >  		PT_WRITABLE_MASK;
> >  	*access &= mask;
> > +#endif
> >  }
> >  
> >  static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > @@ -111,7 +131,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >  
> >  static inline int FNAME(is_present_gpte)(unsigned long pte)
> >  {
> > +#if PTTYPE != PTTYPE_EPT
> >  	return is_present_gpte(pte);
> > +#else
> > +	return pte & 7;
> > +#endif
> >  }
> >  
> >  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > @@ -147,7 +171,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> >  	if (!FNAME(is_present_gpte)(gpte))
> >  		goto no_present;
> >  
> > -	if (!(gpte & PT_GUEST_ACCESSED_MASK))
> > +	/* if accessed bit is not supported prefetch non accessed gpte */
> > +	if (PT_GUEST_ACCESSED_MASK && !(gpte & PT_GUEST_ACCESSED_MASK))
> >  		goto no_present;
> >  
> >  	return false;
> > @@ -160,9 +185,14 @@ no_present:
> >  static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> >  {
> >  	unsigned access;
> > -
> > +#if PTTYPE == PTTYPE_EPT
> > +	BUILD_BUG_ON(ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK);
> > +	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
> > +		((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
> 
> You can use a ?: here even for writable.  The compiler will simplify (a
> & b ? b : 0) to just "a & b" for single-bit b.
> 
Good, one less BUILD_BUG_ON().

> > +#else
> >  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> >  	access &= ~(gpte >> PT64_NX_SHIFT);
> > +#endif
> >  
> >  	return access;
> >  }
> > @@ -212,7 +242,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> >  				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  				    gva_t addr, u32 access)
> >  {
> > -	int ret;
> 
> This change is not needed anymore since you removed the #ifdef.
> 
Ack.

> >  	pt_element_t pte;
> >  	pt_element_t __user *uninitialized_var(ptep_user);
> >  	gfn_t table_gfn;
> > @@ -322,7 +351,9 @@ retry_walk:
> >  		accessed_dirty &= pte >>
> >  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
> 
> This shift is one of two things that bugged me.  I dislike including
> "wrong" code just because it is dead.  Perhaps you can #define the
Meanwhile I changed comment above this to be:
           /*
            * On a write fault, fold the dirty bit into accessed_dirty.
            * For modes without A/D bits support accessed_dirty will be
            * always clear.
            */

> shifts to 8 and 9 already now, even if the masks stay 0?
> 
Currently I do not see any problem with that, but we will have to be careful
that *_SHIFT values will not leak into a code where it could matter.

> Then when you implement nEPT A/D bits you only have to flip the masks
> from 0 to (1 << PT_GUEST_*_SHIFT).
> 
> >  
> > -	if (unlikely(!accessed_dirty)) {
> > +	if (PT_GUEST_DIRTY_MASK && unlikely(!accessed_dirty)) {
> 
> If we want to drop the dead-code elimination burden on the compiler,
> let's go all the way.
> 
> So, instead of this "if", please make update_accessed_dirty_bits return
> 0 at once if PT_GUEST_DIRTY_MASK == 0; this way you can do the same
> thing for protect_clean_gpte and update_accessed_dirty_bits.
> 
OK.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-30 11:56     ` Gleb Natapov
@ 2013-07-30 12:13       ` Paolo Bonzini
  2013-07-30 14:22         ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2013-07-30 12:13 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

Il 30/07/2013 13:56, Gleb Natapov ha scritto:
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 4c4274d..b5273c3 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>>>  	return mmu->last_pte_bitmap & (1 << index);
>>>  }
>>>  
>>> +#define PTTYPE_EPT 18 /* arbitrary */
>>> +#define PTTYPE PTTYPE_EPT
>>> +#include "paging_tmpl.h"
>>> +#undef PTTYPE
>>> +
>>>  #define PTTYPE 64
>>>  #include "paging_tmpl.h"
>>>  #undef PTTYPE
>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>> index 7581395..e38b3c0 100644
>>> --- a/arch/x86/kvm/paging_tmpl.h
>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>> @@ -58,6 +58,21 @@
>>>  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
>>>  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
>>>  	#define CMPXCHG cmpxchg
>>> +#elif PTTYPE == PTTYPE_EPT
>>> +	#define pt_element_t u64
>>> +	#define guest_walker guest_walkerEPT
>>> +	#define FNAME(name) ept_##name
>>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>>> +	#define PT_GUEST_ACCESSED_MASK 0
>>> +	#define PT_GUEST_DIRTY_MASK 0
>>> +	#define PT_GUEST_DIRTY_SHIFT 0
>>> +	#define PT_GUEST_ACCESSED_SHIFT 0
>>> +	#define CMPXCHG cmpxchg64
>>> +	#define PT_MAX_FULL_LEVELS 4
>>>  #else
>>>  	#error Invalid PTTYPE value
>>>  #endif
>>
>> Please add a
>>
>> BUILD_BUG_ON(!!PT_GUEST_ACCESSED_MASK != !!PT_GUEST_DIRTY_MASK);
>> #if PT_GUEST_ACCESSED_MASK
>> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_GUEST_ACCESSED_SHIFT);
>> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_WRITABLE_SHIFT);
>> BUILD_BUG_ON(PT_GUEST_DIRTY_MASK != (1 << PT_GUEST_DIRTY_SHIFT));
>> BUILD_BUG_ON(PT_GUEST_ACCESSED_MASK != (1 << PT_GUEST_ACCESSED_SHIFT));
>
> This will not work if I define _SHIFT to be 8/9 now.

They will because I put them under an appropriate "#if". :)

OTOH if you define _SHIFT to be 8/9 you can move the #if so that it only
covers the last two checks.

> But we do not use
> BUILD_BUG_ON() to check values from the same define "namespace". It is
> implied that they are correct and when they change all "namespace"
> remains to be consistent. If you look at BUILD_BUG_ON() that we have
> (and this patch adds) they are from the form:
>   PT_WRITABLE_MASK != ACC_WRITE_MASK
>   ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK
>   VMX_EPT_READABLE_MASK != PT_PRESENT_MASK
>   VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK
> 
> i.e they compare value from different "namespaces".

Yes, all these BUILD_BUG_ONs make sense.

But I think BUILD_BUG_ON() is more generically for consistency checks
and enforcing invariants that the code expects.  Our invariants are:

* A/D bits are enabled or disabled in pairs

* dirty is the left of accessed and writable

* masks should either be zero or match the corresponding shifts

The alternative to BUILD_BUG_ON would be a comment that explains the
invariants, but there's no need to use English if C can do it better! :)

>>>  	pt_element_t pte;
>>>  	pt_element_t __user *uninitialized_var(ptep_user);
>>>  	gfn_t table_gfn;
>>> @@ -322,7 +351,9 @@ retry_walk:
>>>  		accessed_dirty &= pte >>
>>>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
>>
>> This shift is one of two things that bugged me.  I dislike including
>> "wrong" code just because it is dead.  Perhaps you can #define the
> Meanwhile I changed comment above this to be:
>            /*
>             * On a write fault, fold the dirty bit into accessed_dirty.
>             * For modes without A/D bits support accessed_dirty will be
>             * always clear.
>             */

Good.

>> shifts to 8 and 9 already now, even if the masks stay 0?
>>
> Currently I do not see any problem with that, but we will have to be careful
> that *_SHIFT values will not leak into a code where it could matter.

They would leak with PT_GUEST_*_SHIFT == 0 too, I think?  (And with
worse effects, because they would use bit 0 and/or do shifts with
negative counts).

Perhaps PT_GUEST_*_SHIFT can instead be defined to a function like
__using_nonexistent_pte_bit(), so that compilation fails unless all such
references are optimized away.  See arch/x86/include/asm/cmpxchg.h for
an example:

/*
 * Non-existant functions to indicate usage errors at link time
 * (or compile-time if the compiler implements __compiletime_error().
 */
extern void __xchg_wrong_size(void)
        __compiletime_error("Bad argument size for xchg");

But using 8/9 would be fine for me.  Choose the one that you prefer.

>> Then when you implement nEPT A/D bits you only have to flip the masks
>> from 0 to (1 << PT_GUEST_*_SHIFT).
>>
>>>  
>>> -	if (unlikely(!accessed_dirty)) {
>>> +	if (PT_GUEST_DIRTY_MASK && unlikely(!accessed_dirty)) {
>>
>> If we want to drop the dead-code elimination burden on the compiler,
>> let's go all the way.
>>
>> So, instead of this "if", please make update_accessed_dirty_bits return
>> 0 at once if PT_GUEST_DIRTY_MASK == 0; this way you can do the same
>> thing for protect_clean_gpte and update_accessed_dirty_bits.
>
> OK.

Good. :)

Paolo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-30 12:13       ` Paolo Bonzini
@ 2013-07-30 14:22         ` Gleb Natapov
  2013-07-30 14:36           ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Gleb Natapov @ 2013-07-30 14:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Tue, Jul 30, 2013 at 02:13:38PM +0200, Paolo Bonzini wrote:
> Il 30/07/2013 13:56, Gleb Natapov ha scritto:
> >>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>> index 4c4274d..b5273c3 100644
> >>> --- a/arch/x86/kvm/mmu.c
> >>> +++ b/arch/x86/kvm/mmu.c
> >>> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> >>>  	return mmu->last_pte_bitmap & (1 << index);
> >>>  }
> >>>  
> >>> +#define PTTYPE_EPT 18 /* arbitrary */
> >>> +#define PTTYPE PTTYPE_EPT
> >>> +#include "paging_tmpl.h"
> >>> +#undef PTTYPE
> >>> +
> >>>  #define PTTYPE 64
> >>>  #include "paging_tmpl.h"
> >>>  #undef PTTYPE
> >>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >>> index 7581395..e38b3c0 100644
> >>> --- a/arch/x86/kvm/paging_tmpl.h
> >>> +++ b/arch/x86/kvm/paging_tmpl.h
> >>> @@ -58,6 +58,21 @@
> >>>  	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
> >>>  	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
> >>>  	#define CMPXCHG cmpxchg
> >>> +#elif PTTYPE == PTTYPE_EPT
> >>> +	#define pt_element_t u64
> >>> +	#define guest_walker guest_walkerEPT
> >>> +	#define FNAME(name) ept_##name
> >>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> >>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> >>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> >>> +	#define PT_GUEST_ACCESSED_MASK 0
> >>> +	#define PT_GUEST_DIRTY_MASK 0
> >>> +	#define PT_GUEST_DIRTY_SHIFT 0
> >>> +	#define PT_GUEST_ACCESSED_SHIFT 0
> >>> +	#define CMPXCHG cmpxchg64
> >>> +	#define PT_MAX_FULL_LEVELS 4
> >>>  #else
> >>>  	#error Invalid PTTYPE value
> >>>  #endif
> >>
> >> Please add a
> >>
> >> BUILD_BUG_ON(!!PT_GUEST_ACCESSED_MASK != !!PT_GUEST_DIRTY_MASK);
> >> #if PT_GUEST_ACCESSED_MASK
> >> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_GUEST_ACCESSED_SHIFT);
> >> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_WRITABLE_SHIFT);
> >> BUILD_BUG_ON(PT_GUEST_DIRTY_MASK != (1 << PT_GUEST_DIRTY_SHIFT));
> >> BUILD_BUG_ON(PT_GUEST_ACCESSED_MASK != (1 << PT_GUEST_ACCESSED_SHIFT));
> >
> > This will not work if I define _SHIFT to be 8/9 now.
> 
> They will because I put them under an appropriate "#if". :)
> 
True, missed that.

> OTOH if you define _SHIFT to be 8/9 you can move the #if so that it only
> covers the last two checks.
> 
> > But we do not use
> > BUILD_BUG_ON() to check values from the same define "namespace". It is
> > implied that they are correct and when they change all "namespace"
> > remains to be consistent. If you look at BUILD_BUG_ON() that we have
> > (and this patch adds) they are from the form:
> >   PT_WRITABLE_MASK != ACC_WRITE_MASK
> >   ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK
> >   VMX_EPT_READABLE_MASK != PT_PRESENT_MASK
> >   VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK
> > 
> > i.e they compare value from different "namespaces".
> 
> Yes, all these BUILD_BUG_ONs make sense.
> 
> But I think BUILD_BUG_ON() is more generically for consistency checks
> and enforcing invariants that the code expects.  Our invariants are:
> 
> * A/D bits are enabled or disabled in pairs
> 
> * dirty is the left of accessed and writable
> 
> * masks should either be zero or match the corresponding shifts
> 
> The alternative to BUILD_BUG_ON would be a comment that explains the
> invariants, but there's no need to use English if C can do it better! :)
> 
OK, will add this in separate patch.

> >>>  	pt_element_t pte;
> >>>  	pt_element_t __user *uninitialized_var(ptep_user);
> >>>  	gfn_t table_gfn;
> >>> @@ -322,7 +351,9 @@ retry_walk:
> >>>  		accessed_dirty &= pte >>
> >>>  			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
> >>
> >> This shift is one of two things that bugged me.  I dislike including
> >> "wrong" code just because it is dead.  Perhaps you can #define the
> > Meanwhile I changed comment above this to be:
> >            /*
> >             * On a write fault, fold the dirty bit into accessed_dirty.
> >             * For modes without A/D bits support accessed_dirty will be
> >             * always clear.
> >             */
> 
> Good.
> 
> >> shifts to 8 and 9 already now, even if the masks stay 0?
> >>
> > Currently I do not see any problem with that, but we will have to be careful
> > that *_SHIFT values will not leak into a code where it could matter.
> 
> They would leak with PT_GUEST_*_SHIFT == 0 too, I think?  (And with
> worse effects, because they would use bit 0 and/or do shifts with
> negative counts).
> 
> Perhaps PT_GUEST_*_SHIFT can instead be defined to a function like
> __using_nonexistent_pte_bit(), so that compilation fails unless all such
> references are optimized away.  See arch/x86/include/asm/cmpxchg.h for
> an example:
> 
> /*
>  * Non-existant functions to indicate usage errors at link time
>  * (or compile-time if the compiler implements __compiletime_error().
>  */
> extern void __xchg_wrong_size(void)
>         __compiletime_error("Bad argument size for xchg");
> 
Nice! But not all of them are optimized in the correct stage. The one
in accessed_dirty calculation is reachable, but result is thrown away.
If I define *_SHIFT to be a function the code cannot be optimized since
compiler thinks that function call has side effects. Adding pure function
attribute solves that though, so I'll go with that.
 
--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h
  2013-07-30 14:22         ` Gleb Natapov
@ 2013-07-30 14:36           ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-30 14:36 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang

On Tue, Jul 30, 2013 at 05:22:32PM +0300, Gleb Natapov wrote:
> > >>
> > >> BUILD_BUG_ON(!!PT_GUEST_ACCESSED_MASK != !!PT_GUEST_DIRTY_MASK);
> > >> #if PT_GUEST_ACCESSED_MASK
> > >> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_GUEST_ACCESSED_SHIFT);
> > >> BUILD_BUG_ON(PT_GUEST_DIRTY_SHIFT <= PT_WRITABLE_SHIFT);
> > >> BUILD_BUG_ON(PT_GUEST_DIRTY_MASK != (1 << PT_GUEST_DIRTY_SHIFT));
> > >> BUILD_BUG_ON(PT_GUEST_ACCESSED_MASK != (1 << PT_GUEST_ACCESSED_SHIFT));
> > >
> > > This will not work if I define _SHIFT to be 8/9 now.
> > 
> > They will because I put them under an appropriate "#if". :)
> > 
> True, missed that.
> 
> > OTOH if you define _SHIFT to be 8/9 you can move the #if so that it only
> > covers the last two checks.
> > 
> > > But we do not use
> > > BUILD_BUG_ON() to check values from the same define "namespace". It is
> > > implied that they are correct and when they change all "namespace"
> > > remains to be consistent. If you look at BUILD_BUG_ON() that we have
> > > (and this patch adds) they are from the form:
> > >   PT_WRITABLE_MASK != ACC_WRITE_MASK
> > >   ACC_WRITE_MASK != VMX_EPT_WRITABLE_MASK
> > >   VMX_EPT_READABLE_MASK != PT_PRESENT_MASK
> > >   VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK
> > > 
> > > i.e they compare value from different "namespaces".
> > 
> > Yes, all these BUILD_BUG_ONs make sense.
> > 
> > But I think BUILD_BUG_ON() is more generically for consistency checks
> > and enforcing invariants that the code expects.  Our invariants are:
> > 
> > * A/D bits are enabled or disabled in pairs
> > 
> > * dirty is the left of accessed and writable
> > 
> > * masks should either be zero or match the corresponding shifts
> > 
> > The alternative to BUILD_BUG_ON would be a comment that explains the
> > invariants, but there's no need to use English if C can do it better! :)
> > 
> OK, will add this in separate patch.
> 
Not really, BUILD_BUG_ON() is not meant to be used outside of a
function.

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3
  2013-07-25 10:59 ` [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
  2013-07-29  8:36   ` Paolo Bonzini
@ 2013-07-31  8:02   ` Xiao Guangrong
  1 sibling, 0 replies; 55+ messages in thread
From: Xiao Guangrong @ 2013-07-31  8:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 07/25/2013 06:59 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
> address. The problem is that with nested EPT, cr3 is an *L2* physical
> address, not an L1 physical address as this test expects.
> 
> As the comment above this test explains, it isn't necessary, and doesn't
> correspond to anything a real processor would do. So this patch removes it.
> 
> Note that this wrong test could have also theoretically caused problems
> in nested NPT, not just in nested EPT. However, in practice, the problem
> was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
> nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
> circumventing the problem. Additional potential calls to the buggy function
> are avoided in that we don't trap cr3 modifications when nested NPT is
> enabled. However, because in nested VMX we did want to use kvm_set_cr3()
> (as requested in Avi Kivity's review of the original nested VMX patches),
> we can't avoid this problem and need to fix it.
> 

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h
  2013-07-25 10:59 ` [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h Gleb Natapov
@ 2013-07-31  8:02   ` Xiao Guangrong
  2013-07-31  8:36     ` Gleb Natapov
  0 siblings, 1 reply; 55+ messages in thread
From: Xiao Guangrong @ 2013-07-31  8:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On 07/25/2013 06:59 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
> 
> For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
> s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
> to paging_tmpl.h.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/x86/kvm/mmu.c         |   55 ------------------------------
>  arch/x86/kvm/paging_tmpl.h |   80 +++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 68 insertions(+), 67 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3a9493a..4c4274d 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -331,11 +331,6 @@ static int is_large_pte(u64 pte)
>  	return pte & PT_PAGE_SIZE_MASK;
>  }
> 
> -static int is_dirty_gpte(unsigned long pte)
> -{
> -	return pte & PT_DIRTY_MASK;
> -}
> -
>  static int is_rmap_spte(u64 pte)
>  {
>  	return is_shadow_present_pte(pte);
> @@ -2574,14 +2569,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
>  	mmu_free_roots(vcpu);
>  }
> 
> -static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
> -{
> -	int bit7;
> -
> -	bit7 = (gpte >> 7) & 1;
> -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> -}
> -
>  static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
>  				     bool no_dirty_log)
>  {
> @@ -2594,26 +2581,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	return gfn_to_pfn_memslot_atomic(slot, gfn);
>  }
> 
> -static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
> -				  struct kvm_mmu_page *sp, u64 *spte,
> -				  u64 gpte)
> -{
> -	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> -		goto no_present;
> -
> -	if (!is_present_gpte(gpte))
> -		goto no_present;
> -
> -	if (!(gpte & PT_ACCESSED_MASK))
> -		goto no_present;
> -
> -	return false;
> -
> -no_present:
> -	drop_spte(vcpu->kvm, spte);
> -	return true;
> -}
> -
>  static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>  				    struct kvm_mmu_page *sp,
>  				    u64 *start, u64 *end)
> @@ -3501,18 +3468,6 @@ static void paging_free(struct kvm_vcpu *vcpu)
>  	nonpaging_free(vcpu);
>  }
> 
> -static inline void protect_clean_gpte(unsigned *access, unsigned gpte)
> -{
> -	unsigned mask;
> -
> -	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> -
> -	mask = (unsigned)~ACC_WRITE_MASK;
> -	/* Allow write access to dirty gptes */
> -	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
> -	*access &= mask;
> -}
> -
>  static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
>  			   unsigned access, int *nr_present)
>  {
> @@ -3530,16 +3485,6 @@ static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
>  	return false;
>  }
> 
> -static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
> -{
> -	unsigned access;
> -
> -	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> -	access &= ~(gpte >> PT64_NX_SHIFT);
> -
> -	return access;
> -}
> -
>  static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte)
>  {
>  	unsigned index;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 7769699..fb26ca9 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -80,6 +80,31 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
>  	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
>  }
> 
> +static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> +{
> +	unsigned mask;
> +
> +	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> +
> +	mask = (unsigned)~ACC_WRITE_MASK;
> +	/* Allow write access to dirty gptes */
> +	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
> +	*access &= mask;
> +}
> +
> +static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> +{
> +	int bit7;
> +
> +	bit7 = (gpte >> 7) & 1;
> +	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> +}
> +
> +static inline int FNAME(is_present_gpte)(unsigned long pte)
> +{
> +	return is_present_gpte(pte);
> +}

It is adding a new function not just move and failed to mention it in the
change log. :)

> +
>  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			       pt_element_t __user *ptep_user, unsigned index,
>  			       pt_element_t orig_pte, pt_element_t new_pte)
> @@ -103,6 +128,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  	return (ret != orig_pte);
>  }
> 
> +static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> +				  struct kvm_mmu_page *sp, u64 *spte,
> +				  u64 gpte)
> +{
> +	if (FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> +		goto no_present;
> +
> +	if (!FNAME(is_present_gpte)(gpte))
> +		goto no_present;
> +
> +	if (!(gpte & PT_ACCESSED_MASK))
> +		goto no_present;
> +
> +	return false;
> +
> +no_present:
> +	drop_spte(vcpu->kvm, spte);
> +	return true;
> +}
> +
> +static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> +{
> +	unsigned access;
> +
> +	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> +	access &= ~(gpte >> PT64_NX_SHIFT);
> +
> +	return access;
> +}
> +
>  static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>  					     struct kvm_mmu *mmu,
>  					     struct guest_walker *walker,
> @@ -123,7 +178,8 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>  			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
>  			pte |= PT_ACCESSED_MASK;
>  		}
> -		if (level == walker->level && write_fault && !is_dirty_gpte(pte)) {
> +		if (level == walker->level && write_fault &&
> +				!(pte & PT_DIRTY_MASK)) {

Why use the raw code instead of the function? Since it is the only user of
this function, so drop the function?

Others look good to me.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h
  2013-07-31  8:02   ` Xiao Guangrong
@ 2013-07-31  8:36     ` Gleb Natapov
  0 siblings, 0 replies; 55+ messages in thread
From: Gleb Natapov @ 2013-07-31  8:36 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini

On Wed, Jul 31, 2013 at 04:02:49PM +0800, Xiao Guangrong wrote:
> On 07/25/2013 06:59 PM, Gleb Natapov wrote:
> > From: Nadav Har'El <nyh@il.ibm.com>
> > 
> > For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
> > s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
> > to paging_tmpl.h.
> > 
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/x86/kvm/mmu.c         |   55 ------------------------------
> >  arch/x86/kvm/paging_tmpl.h |   80 +++++++++++++++++++++++++++++++++++++-------
> >  2 files changed, 68 insertions(+), 67 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 3a9493a..4c4274d 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -331,11 +331,6 @@ static int is_large_pte(u64 pte)
> >  	return pte & PT_PAGE_SIZE_MASK;
> >  }
> > 
> > -static int is_dirty_gpte(unsigned long pte)
> > -{
> > -	return pte & PT_DIRTY_MASK;
> > -}
> > -
> >  static int is_rmap_spte(u64 pte)
> >  {
> >  	return is_shadow_present_pte(pte);
> > @@ -2574,14 +2569,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
> >  	mmu_free_roots(vcpu);
> >  }
> > 
> > -static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
> > -{
> > -	int bit7;
> > -
> > -	bit7 = (gpte >> 7) & 1;
> > -	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > -}
> > -
> >  static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
> >  				     bool no_dirty_log)
> >  {
> > @@ -2594,26 +2581,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
> >  	return gfn_to_pfn_memslot_atomic(slot, gfn);
> >  }
> > 
> > -static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
> > -				  struct kvm_mmu_page *sp, u64 *spte,
> > -				  u64 gpte)
> > -{
> > -	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> > -		goto no_present;
> > -
> > -	if (!is_present_gpte(gpte))
> > -		goto no_present;
> > -
> > -	if (!(gpte & PT_ACCESSED_MASK))
> > -		goto no_present;
> > -
> > -	return false;
> > -
> > -no_present:
> > -	drop_spte(vcpu->kvm, spte);
> > -	return true;
> > -}
> > -
> >  static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
> >  				    struct kvm_mmu_page *sp,
> >  				    u64 *start, u64 *end)
> > @@ -3501,18 +3468,6 @@ static void paging_free(struct kvm_vcpu *vcpu)
> >  	nonpaging_free(vcpu);
> >  }
> > 
> > -static inline void protect_clean_gpte(unsigned *access, unsigned gpte)
> > -{
> > -	unsigned mask;
> > -
> > -	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> > -
> > -	mask = (unsigned)~ACC_WRITE_MASK;
> > -	/* Allow write access to dirty gptes */
> > -	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
> > -	*access &= mask;
> > -}
> > -
> >  static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
> >  			   unsigned access, int *nr_present)
> >  {
> > @@ -3530,16 +3485,6 @@ static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
> >  	return false;
> >  }
> > 
> > -static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
> > -{
> > -	unsigned access;
> > -
> > -	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> > -	access &= ~(gpte >> PT64_NX_SHIFT);
> > -
> > -	return access;
> > -}
> > -
> >  static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte)
> >  {
> >  	unsigned index;
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 7769699..fb26ca9 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -80,6 +80,31 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> >  	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> >  }
> > 
> > +static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> > +{
> > +	unsigned mask;
> > +
> > +	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> > +
> > +	mask = (unsigned)~ACC_WRITE_MASK;
> > +	/* Allow write access to dirty gptes */
> > +	mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
> > +	*access &= mask;
> > +}
> > +
> > +static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > +{
> > +	int bit7;
> > +
> > +	bit7 = (gpte >> 7) & 1;
> > +	return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > +}
> > +
> > +static inline int FNAME(is_present_gpte)(unsigned long pte)
> > +{
> > +	return is_present_gpte(pte);
> > +}
> 
> It is adding a new function not just move and failed to mention it in the
> change log. :)
> 
> > +
> >  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  			       pt_element_t __user *ptep_user, unsigned index,
> >  			       pt_element_t orig_pte, pt_element_t new_pte)
> > @@ -103,6 +128,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> >  	return (ret != orig_pte);
> >  }
> > 
> > +static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
> > +				  struct kvm_mmu_page *sp, u64 *spte,
> > +				  u64 gpte)
> > +{
> > +	if (FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
> > +		goto no_present;
> > +
> > +	if (!FNAME(is_present_gpte)(gpte))
> > +		goto no_present;
> > +
> > +	if (!(gpte & PT_ACCESSED_MASK))
> > +		goto no_present;
> > +
> > +	return false;
> > +
> > +no_present:
> > +	drop_spte(vcpu->kvm, spte);
> > +	return true;
> > +}
> > +
> > +static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> > +{
> > +	unsigned access;
> > +
> > +	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> > +	access &= ~(gpte >> PT64_NX_SHIFT);
> > +
> > +	return access;
> > +}
> > +
> >  static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
> >  					     struct kvm_mmu *mmu,
> >  					     struct guest_walker *walker,
> > @@ -123,7 +178,8 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
> >  			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
> >  			pte |= PT_ACCESSED_MASK;
> >  		}
> > -		if (level == walker->level && write_fault && !is_dirty_gpte(pte)) {
> > +		if (level == walker->level && write_fault &&
> > +				!(pte & PT_DIRTY_MASK)) {
> 
> Why use the raw code instead of the function? Since it is the only user of
> this function, so drop the function?
> 
Yes, I have two choses either move the function to paging_tmpl.h and
make it FNAME(is_dirty_gpte)() or just drop it since other bits we check
in pte are open coded anyway. Dropping function seems more consistent
with the rest of the code.


> Others look good to me.
> 
> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>

--
			Gleb.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2013-07-31  8:37 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-25 10:59 [PATCH v4 00/13] Nested EPT Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
2013-07-29  8:32   ` Paolo Bonzini
2013-07-29 13:12     ` Gleb Natapov
2013-07-29 14:13       ` Paolo Bonzini
2013-07-25 10:59 ` [PATCH v4 02/13] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 03/13] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
2013-07-29  8:36   ` Paolo Bonzini
2013-07-29 10:43     ` Gleb Natapov
2013-07-31  8:02   ` Xiao Guangrong
2013-07-25 10:59 ` [PATCH v4 04/13] nEPT: Move common code to paging_tmpl.h Gleb Natapov
2013-07-31  8:02   ` Xiao Guangrong
2013-07-31  8:36     ` Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 05/13] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 06/13] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
2013-07-29  9:48   ` Paolo Bonzini
2013-07-29 11:33     ` Gleb Natapov
2013-07-29 11:55       ` Paolo Bonzini
2013-07-29 12:24         ` Gleb Natapov
2013-07-29 13:19           ` Paolo Bonzini
2013-07-29 13:27             ` Gleb Natapov
2013-07-29 14:15               ` Paolo Bonzini
2013-07-29 16:14                 ` Gleb Natapov
2013-07-29 16:28                   ` Paolo Bonzini
2013-07-29 16:43                     ` Gleb Natapov
2013-07-29 17:06                       ` Paolo Bonzini
2013-07-29 17:11                         ` Gleb Natapov
2013-07-30 10:03   ` Paolo Bonzini
2013-07-30 11:56     ` Gleb Natapov
2013-07-30 12:13       ` Paolo Bonzini
2013-07-30 14:22         ` Gleb Natapov
2013-07-30 14:36           ` Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 07/13] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 08/13] nEPT: Nested INVEPT Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 09/13] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
2013-07-29  8:59   ` Paolo Bonzini
2013-07-29 10:52     ` Gleb Natapov
2013-07-29 10:59       ` Paolo Bonzini
2013-07-29 11:43         ` Gleb Natapov
2013-07-29 12:05           ` Paolo Bonzini
2013-07-29 12:34             ` Gleb Natapov
2013-07-29 13:11               ` Paolo Bonzini
2013-07-29 13:20                 ` Gleb Natapov
2013-07-29 14:12                   ` Paolo Bonzini
2013-07-29 16:24                     ` Gleb Natapov
2013-07-29 16:36                       ` Paolo Bonzini
2013-07-29 16:54                         ` Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 10/13] nEPT: MMU context for nested EPT Gleb Natapov
2013-07-25 10:59 ` [PATCH v4 11/13] nEPT: Advertise EPT to L1 Gleb Natapov
2013-07-29  9:21   ` Paolo Bonzini
2013-07-29 11:11     ` Gleb Natapov
2013-07-29 11:33       ` Paolo Bonzini
2013-07-29 11:35         ` Gleb Natapov
2013-07-25 11:00 ` [PATCH v4 12/13] nEPT: Some additional comments Gleb Natapov
2013-07-25 11:00 ` [PATCH v4 13/13] nEPT: Miscelleneous cleanups Gleb Natapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.