All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits
@ 2016-10-27  2:19 Junaid Shahid
  2016-10-27  2:19 ` [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
                   ` (5 more replies)
  0 siblings, 6 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-10-27  2:19 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, andreslc, pfeiner

Hi,

This patch series implements a lockless access tracking mechanism for KVM
when running on Intel CPUs that do not have EPT A/D bits. 

Currently, KVM tracks accesses on these machines by just clearing the PTEs
and then remapping them when they are accessed again. However, the remapping
requires acquiring the MMU lock in order to lookup the information needed to
construct the PTE. On high core count VMs, this can result in significant MMU
lock contention when running some memory-intesive workloads.

This new mechanism just marks the PTEs as not-present, but keeps all the
information within the PTE instead of clearing it. When the page is accessed
again, the PTE can thus be restored without needing to acquire the MMU lock.


Junaid Shahid (4):
  kvm: x86: mmu: Use symbolic constants for EPT Violation Exit
    Qualifications
  kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  kvm: x86: mmu: Fast Page Fault path retries
  kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A
    bits.

 arch/x86/include/asm/vmx.h |  55 +++++++
 arch/x86/kvm/mmu.c         | 399 +++++++++++++++++++++++++++++++++------------
 arch/x86/kvm/mmu.h         |   2 +
 arch/x86/kvm/vmx.c         |  40 ++++-
 4 files changed, 382 insertions(+), 114 deletions(-)

-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-10-27  2:19 [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-10-27  2:19 ` Junaid Shahid
  2016-11-02 18:03   ` Paolo Bonzini
  2016-10-27  2:19 ` [PATCH 2/4] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-10-27  2:19 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, andreslc, pfeiner

This change adds some symbolic constants for VM Exit Qualifications
related to EPT Violations and updates handle_ept_violation() to use
these constants instead of hard-coded numbers.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
 arch/x86/kvm/vmx.c         | 20 ++++++++++++--------
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index a002b07..60991fb 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -465,6 +465,22 @@ struct vmx_msr_entry {
 #define ENTRY_FAIL_VMCS_LINK_PTR	4
 
 /*
+ * Exit Qualifications for EPT Violations
+ */
+#define EPT_VIOLATION_READ_BIT		0
+#define EPT_VIOLATION_WRITE_BIT		1
+#define EPT_VIOLATION_INSTR_BIT		2
+#define EPT_VIOLATION_READABLE_BIT	3
+#define EPT_VIOLATION_WRITABLE_BIT	4
+#define EPT_VIOLATION_EXECUTABLE_BIT	5
+#define EPT_VIOLATION_READ		(1 << EPT_VIOLATION_READ_BIT)
+#define EPT_VIOLATION_WRITE		(1 << EPT_VIOLATION_WRITE_BIT)
+#define EPT_VIOLATION_INSTR		(1 << EPT_VIOLATION_INSTR_BIT)
+#define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
+#define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
+#define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
+
+/*
  * VM-instruction error numbers
  */
 enum vm_instruction_error_number {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cf1b16d..859da8e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6170,14 +6170,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(gpa, exit_qualification);
 
-	/* it is a read fault? */
-	error_code = (exit_qualification << 2) & PFERR_USER_MASK;
-	/* it is a write fault? */
-	error_code |= exit_qualification & PFERR_WRITE_MASK;
-	/* It is a fetch fault? */
-	error_code |= (exit_qualification << 2) & PFERR_FETCH_MASK;
-	/* ept page table is present? */
-	error_code |= (exit_qualification & 0x38) != 0;
+	/* Is it a read fault? */
+	error_code = ((exit_qualification >> EPT_VIOLATION_READ_BIT) & 1)
+		     << PFERR_USER_BIT;
+	/* Is it a write fault? */
+	error_code |= ((exit_qualification >> EPT_VIOLATION_WRITE_BIT) & 1)
+		      << PFERR_WRITE_BIT;
+	/* Is it a fetch fault? */
+	error_code |= ((exit_qualification >> EPT_VIOLATION_INSTR_BIT) & 1)
+		      << PFERR_FETCH_BIT;
+	/* ept page table entry is present? */
+	error_code |= ((exit_qualification >> EPT_VIOLATION_READABLE_BIT) & 1)
+		      << PFERR_PRESENT_BIT;
 
 	vcpu->arch.exit_qualification = exit_qualification;
 
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/4] kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  2016-10-27  2:19 [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
  2016-10-27  2:19 ` [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
@ 2016-10-27  2:19 ` Junaid Shahid
  2016-10-27  2:19 ` [PATCH 3/4] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-10-27  2:19 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, andreslc, pfeiner

This change renames spte_is_locklessly_modifiable() to
spte_can_locklessly_be_made_writable() to distinguish it from other
forms of lockless modifications. The full set of lockless modifications
is covered by spte_has_volatile_bits().

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7e98..e580134 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -473,7 +473,7 @@ retry:
 }
 #endif
 
-static bool spte_is_locklessly_modifiable(u64 spte)
+static bool spte_can_locklessly_be_made_writable(u64 spte)
 {
 	return (spte & (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) ==
 		(SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE);
@@ -487,7 +487,7 @@ static bool spte_has_volatile_bits(u64 spte)
 	 * also, it can help us to get a stable is_writable_pte()
 	 * to ensure tlb flush is not missed.
 	 */
-	if (spte_is_locklessly_modifiable(spte))
+	if (spte_can_locklessly_be_made_writable(spte))
 		return true;
 
 	if (!shadow_accessed_mask)
@@ -556,7 +556,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 * we always atomically update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
-	if (spte_is_locklessly_modifiable(old_spte) &&
+	if (spte_can_locklessly_be_made_writable(old_spte) &&
 	      !is_writable_pte(new_spte))
 		ret = true;
 
@@ -1212,7 +1212,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	u64 spte = *sptep;
 
 	if (!is_writable_pte(spte) &&
-	      !(pt_protect && spte_is_locklessly_modifiable(spte)))
+	      !(pt_protect && spte_can_locklessly_be_made_writable(spte)))
 		return false;
 
 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
@@ -2973,7 +2973,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	 * Currently, to simplify the code, only the spte write-protected
 	 * by dirty-log can be fast fixed.
 	 */
-	if (!spte_is_locklessly_modifiable(spte))
+	if (!spte_can_locklessly_be_made_writable(spte))
 		goto exit;
 
 	/*
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/4] kvm: x86: mmu: Fast Page Fault path retries
  2016-10-27  2:19 [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
  2016-10-27  2:19 ` [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
  2016-10-27  2:19 ` [PATCH 2/4] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
@ 2016-10-27  2:19 ` Junaid Shahid
  2016-10-27  2:19 ` [PATCH 4/4] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-10-27  2:19 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, andreslc, pfeiner

This change adds retries into the Fast Page Fault path. Without the
retries, the code still works, but if a retry does end up being needed,
then it will result in a second page fault for the same memory access,
which will cause much more overhead compared to just retrying within the
original fault.

This would be especially useful with the upcoming fast access tracking
change, as that would make it more likely for retries to be needed
(e.g. due to read and write faults happening on different CPUs at
the same time).

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 117 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 69 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e580134..a22a8a2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2889,6 +2889,10 @@ static bool page_fault_can_be_fast(u32 error_code)
 	return true;
 }
 
+/*
+ * Returns true if the SPTE was fixed successfully. Otherwise,
+ * someone else modified the SPTE from its original value.
+ */
 static bool
 fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 			u64 *sptep, u64 spte)
@@ -2915,8 +2919,10 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
-		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
+		return false;
+
+	kvm_vcpu_mark_page_dirty(vcpu, gfn);
 
 	return true;
 }
@@ -2933,6 +2939,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	struct kvm_mmu_page *sp;
 	bool ret = false;
 	u64 spte = 0ull;
+	uint retry_count = 0;
 
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return false;
@@ -2945,57 +2952,71 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		if (!is_shadow_present_pte(spte) || iterator.level < level)
 			break;
 
-	/*
-	 * If the mapping has been changed, let the vcpu fault on the
-	 * same address again.
-	 */
-	if (!is_shadow_present_pte(spte)) {
-		ret = true;
-		goto exit;
-	}
+	do {
+		/*
+		 * If the mapping has been changed, let the vcpu fault on the
+		 * same address again.
+		 */
+		if (!is_shadow_present_pte(spte)) {
+			ret = true;
+			break;
+		}
 
-	sp = page_header(__pa(iterator.sptep));
-	if (!is_last_spte(spte, sp->role.level))
-		goto exit;
+		sp = page_header(__pa(iterator.sptep));
+		if (!is_last_spte(spte, sp->role.level))
+			break;
 
-	/*
-	 * Check if it is a spurious fault caused by TLB lazily flushed.
-	 *
-	 * Need not check the access of upper level table entries since
-	 * they are always ACC_ALL.
-	 */
-	 if (is_writable_pte(spte)) {
-		ret = true;
-		goto exit;
-	}
+		/*
+		 * Check if it is a spurious fault caused by TLB lazily flushed.
+		 *
+		 * Need not check the access of upper level table entries since
+		 * they are always ACC_ALL.
+		 */
+		if (is_writable_pte(spte)) {
+			ret = true;
+			break;
+		}
 
-	/*
-	 * Currently, to simplify the code, only the spte write-protected
-	 * by dirty-log can be fast fixed.
-	 */
-	if (!spte_can_locklessly_be_made_writable(spte))
-		goto exit;
+		/*
+		 * Currently, to simplify the code, only the spte
+		 * write-protected by dirty-log can be fast fixed.
+		 */
+		if (!spte_can_locklessly_be_made_writable(spte))
+			break;
 
-	/*
-	 * Do not fix write-permission on the large spte since we only dirty
-	 * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
-	 * that means other pages are missed if its slot is dirty-logged.
-	 *
-	 * Instead, we let the slow page fault path create a normal spte to
-	 * fix the access.
-	 *
-	 * See the comments in kvm_arch_commit_memory_region().
-	 */
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		goto exit;
+		/*
+		 * Do not fix write-permission on the large spte since we only
+		 * dirty the first page into the dirty-bitmap in
+		 * fast_pf_fix_direct_spte() that means other pages are missed
+		 * if its slot is dirty-logged.
+		 *
+		 * Instead, we let the slow page fault path create a normal spte
+		 * to fix the access.
+		 *
+		 * See the comments in kvm_arch_commit_memory_region().
+		 */
+		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+			break;
+
+		/*
+		 * Currently, fast page fault only works for direct mapping
+		 * since the gfn is not stable for indirect shadow page. See
+		 * Documentation/virtual/kvm/locking.txt to get more detail.
+		 */
+		ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
+		if (ret)
+			break;
+
+		if (++retry_count > 4) {
+			printk_once(KERN_WARNING
+				    "Fast #PF retrying more than 4 times.\n");
+			break;
+		}
+
+		spte = mmu_spte_get_lockless(iterator.sptep);
+
+	} while (true);
 
-	/*
-	 * Currently, fast page fault only works for direct mapping since
-	 * the gfn is not stable for indirect shadow page.
-	 * See Documentation/virtual/kvm/locking.txt to get more detail.
-	 */
-	ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
-exit:
 	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
 			      spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/4] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-10-27  2:19 [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                   ` (2 preceding siblings ...)
  2016-10-27  2:19 ` [PATCH 3/4] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
@ 2016-10-27  2:19 ` Junaid Shahid
  2016-11-02 18:01   ` Paolo Bonzini
  2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
  5 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-10-27  2:19 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, andreslc, pfeiner

This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/include/asm/vmx.h |  39 ++++++
 arch/x86/kvm/mmu.c         | 314 ++++++++++++++++++++++++++++++++++-----------
 arch/x86/kvm/mmu.h         |   2 +
 arch/x86/kvm/vmx.c         |  20 ++-
 4 files changed, 301 insertions(+), 74 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 60991fb..3d63098 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -434,6 +434,45 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
 #define VMX_EPT_ACCESS_BIT				(1ull << 8)
 #define VMX_EPT_DIRTY_BIT				(1ull << 9)
+#define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
+						 VMX_EPT_WRITABLE_MASK |       \
+						 VMX_EPT_EXECUTABLE_MASK)
+#define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
+
+/* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
+#define VMX_EPT_MISCONFIG_WX_VALUE		(VMX_EPT_WRITABLE_MASK |       \
+						 VMX_EPT_EXECUTABLE_MASK)
+
+/*
+ * The shift to use for saving the original RWX value when marking the PTE as
+ * not-present for tracking purposes.
+ */
+#define VMX_EPT_RWX_SAVE_SHIFT			52
+
+/*
+ * The shift/mask for determining the type of tracking (if any) being used for a
+ * not-present PTE. Currently, only two bits are used, but more can be added.
+ *
+ * NOTE: Bit 63 is an architecturally ignored bit (and hence can be used for our
+ *       purpose) when the EPT PTE is in a misconfigured state. However, it is
+ *       not necessarily an ignored bit otherwise (even in a not-present state).
+ *       Since the existing MMIO code already uses this bit and since KVM
+ *       doesn't use #VEs currently (where this bit comes into play), so we can
+ *       continue to use it for storing the type. But to be on the safe side,
+ *       we should not set it to 1 in those TRACK_TYPEs where the tracking is
+ *       done via EPT Violations instead of EPT Misconfigurations.
+ */
+#define VMX_EPT_TRACK_TYPE_SHIFT		62
+#define VMX_EPT_TRACK_TYPE_MASK			(3ull <<                       \
+						 VMX_EPT_TRACK_TYPE_SHIFT)
+
+/* Sets only bit 62 as the tracking is done by EPT Violations. See note above */
+#define VMX_EPT_TRACK_ACCESS			(1ull <<                       \
+						 VMX_EPT_TRACK_TYPE_SHIFT)
+/* Sets bits 62 and 63. See note above */
+#define VMX_EPT_TRACK_MMIO			(3ull <<                       \
+						 VMX_EPT_TRACK_TYPE_SHIFT)
+
 
 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR		0xfffbc000ul
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a22a8a2..8ea1618 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -37,6 +37,7 @@
 #include <linux/srcu.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/kern_levels.h>
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -177,6 +178,10 @@ static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mmio_mask;
 static u64 __read_mostly shadow_present_mask;
+static u64 __read_mostly shadow_acc_track_mask;
+static u64 __read_mostly shadow_acc_track_value;
+static u64 __read_mostly shadow_acc_track_saved_bits_mask;
+static u64 __read_mostly shadow_acc_track_saved_bits_shift;
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
 static void mmu_free_roots(struct kvm_vcpu *vcpu);
@@ -187,6 +192,26 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
+void kvm_mmu_set_access_track_masks(u64 acc_track_mask, u64 acc_track_value,
+				    u64 saved_bits_mask, u64 saved_bits_shift)
+{
+	shadow_acc_track_mask = acc_track_mask;
+	shadow_acc_track_value = acc_track_value;
+	shadow_acc_track_saved_bits_mask = saved_bits_mask;
+	shadow_acc_track_saved_bits_shift = saved_bits_shift;
+
+	BUG_ON((~acc_track_mask & acc_track_value) != 0);
+	BUG_ON((~acc_track_mask & saved_bits_mask) != 0);
+	BUG_ON(shadow_accessed_mask != 0);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_access_track_masks);
+
+static inline bool is_access_track_spte(u64 spte)
+{
+	return shadow_acc_track_mask != 0 &&
+	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
+}
+
 /*
  * the low bit of the generation number is always presumed to be zero.
  * This disables mmio caching during memslot updates.  The concept is
@@ -292,9 +317,25 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 	shadow_nx_mask = nx_mask;
 	shadow_x_mask = x_mask;
 	shadow_present_mask = p_mask;
+	BUG_ON(shadow_accessed_mask != 0 && shadow_acc_track_mask != 0);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
 
+void kvm_mmu_clear_all_pte_masks(void)
+{
+	shadow_user_mask = 0;
+	shadow_accessed_mask = 0;
+	shadow_dirty_mask = 0;
+	shadow_nx_mask = 0;
+	shadow_x_mask = 0;
+	shadow_mmio_mask = 0;
+	shadow_present_mask = 0;
+	shadow_acc_track_mask = 0;
+	shadow_acc_track_value = 0;
+	shadow_acc_track_saved_bits_mask = 0;
+	shadow_acc_track_saved_bits_shift = 0;
+}
+
 static int is_cpuid_PSE36(void)
 {
 	return 1;
@@ -307,7 +348,8 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-	return (pte & 0xFFFFFFFFull) && !is_mmio_spte(pte);
+	return ((pte & 0xFFFFFFFFull) && !is_mmio_spte(pte)) ||
+	       is_access_track_spte(pte);
 }
 
 static int is_large_pte(u64 pte)
@@ -490,6 +532,9 @@ static bool spte_has_volatile_bits(u64 spte)
 	if (spte_can_locklessly_be_made_writable(spte))
 		return true;
 
+	if (is_access_track_spte(spte))
+		return true;
+
 	if (!shadow_accessed_mask)
 		return false;
 
@@ -533,17 +578,21 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
  * will find a read-only spte, even though the writable spte
  * might be cached on a CPU's TLB, the return value indicates this
  * case.
+ *
+ * Returns true if the TLB needs to be flushed
  */
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
 	u64 old_spte = *sptep;
-	bool ret = false;
+	bool flush = false;
+	bool writable_cleared;
+	bool acc_track_enabled;
 
 	WARN_ON(!is_shadow_present_pte(new_spte));
 
 	if (!is_shadow_present_pte(old_spte)) {
 		mmu_spte_set(sptep, new_spte);
-		return ret;
+		return flush;
 	}
 
 	if (!spte_has_volatile_bits(old_spte))
@@ -551,24 +600,16 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	else
 		old_spte = __update_clear_spte_slow(sptep, new_spte);
 
+	BUG_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
+
 	/*
 	 * For the spte updated out of mmu-lock is safe, since
 	 * we always atomically update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
 	if (spte_can_locklessly_be_made_writable(old_spte) &&
-	      !is_writable_pte(new_spte))
-		ret = true;
-
-	if (!shadow_accessed_mask) {
-		/*
-		 * We don't set page dirty when dropping non-writable spte.
-		 * So do it now if the new spte is becoming non-writable.
-		 */
-		if (ret)
-			kvm_set_pfn_dirty(spte_to_pfn(old_spte));
-		return ret;
-	}
+	    !is_writable_pte(new_spte))
+		flush = true;
 
 	/*
 	 * Flush TLB when accessed/dirty bits are changed in the page tables,
@@ -576,20 +617,34 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 */
 	if (spte_is_bit_changed(old_spte, new_spte,
                                 shadow_accessed_mask | shadow_dirty_mask))
-		ret = true;
+		flush = true;
 
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
+	writable_cleared = is_writable_pte(old_spte) &&
+			   !is_writable_pte(new_spte);
+	acc_track_enabled = !is_access_track_spte(old_spte) &&
+			    is_access_track_spte(new_spte);
+
+	if (writable_cleared || acc_track_enabled)
+		flush = true;
+
+	if (shadow_accessed_mask ?
+	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
+	    acc_track_enabled)
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
+
+	if (shadow_dirty_mask ?
+	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
+	    writable_cleared)
 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
 
-	return ret;
+	return flush;
 }
 
 /*
  * Rules for using mmu_spte_clear_track_bits:
  * It sets the sptep from present to nonpresent, and track the
  * state bits, it is used to clear the last level sptep.
+ * Returns non-zero if the PTE was previously valid.
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
@@ -604,6 +659,13 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	if (!is_shadow_present_pte(old_spte))
 		return 0;
 
+	/*
+	 * For access tracking SPTEs, the pfn was already marked accessed/dirty
+	 * when the SPTE was marked for access tracking, so nothing to do here.
+	 */
+	if (is_access_track_spte(old_spte))
+		return 1;
+
 	pfn = spte_to_pfn(old_spte);
 
 	/*
@@ -618,6 +680,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	if (old_spte & (shadow_dirty_mask ? shadow_dirty_mask :
 					    PT_WRITABLE_MASK))
 		kvm_set_pfn_dirty(pfn);
+
 	return 1;
 }
 
@@ -636,6 +699,52 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
+static u64 mark_spte_for_access_track(u64 spte)
+{
+	if (shadow_acc_track_mask == 0)
+		return spte;
+
+	/*
+	 * Verify that the write-protection that we do below will be fixable
+	 * via the fast page fault path. Currently, that is always the case, at
+	 * least when using EPT (which is when access tracking would be used).
+	 */
+	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
+		  !spte_can_locklessly_be_made_writable(spte),
+		  "Writable SPTE is not locklessly dirty-trackable\n");
+
+	/*
+	 * Any PTE marked for access tracking should also be marked for dirty
+	 * tracking (by being non-writable)
+	 */
+	spte &= ~PT_WRITABLE_MASK;
+
+	spte &= ~(shadow_acc_track_saved_bits_mask <<
+		  shadow_acc_track_saved_bits_shift);
+	spte |= (spte & shadow_acc_track_saved_bits_mask) <<
+		shadow_acc_track_saved_bits_shift;
+	spte &= ~shadow_acc_track_mask;
+	spte |= shadow_acc_track_value;
+
+	return spte;
+}
+
+/* Returns true if the TLB needs to be flushed */
+static bool mmu_spte_enable_access_track(u64 *sptep)
+{
+	u64 spte = mmu_spte_get_lockless(sptep);
+
+	if (is_access_track_spte(spte))
+		return false;
+
+	/* Access tracking should not be enabled if CPU supports A/D bits */
+	BUG_ON(shadow_accessed_mask != 0);
+
+	spte = mark_spte_for_access_track(spte);
+
+	return mmu_spte_update(sptep, spte);
+}
+
 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 {
 	/*
@@ -1403,6 +1512,25 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return kvm_zap_rmapp(kvm, rmap_head);
 }
 
+static int kvm_acc_track_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			       struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level, unsigned long data)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	int need_tlb_flush = 0;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
+
+		rmap_printk("kvm_acc_track_rmapp: spte %p %llx gfn %llx (%d)\n",
+			    sptep, *sptep, gfn, level);
+
+		need_tlb_flush |= mmu_spte_enable_access_track(sptep);
+	}
+
+	return need_tlb_flush;
+}
+
 static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
 			     unsigned long data)
@@ -1419,8 +1547,9 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 restart:
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
+
 		rmap_printk("kvm_set_pte_rmapp: spte %p %llx gfn %llx (%d)\n",
-			     sptep, *sptep, gfn, level);
+			    sptep, *sptep, gfn, level);
 
 		need_flush = 1;
 
@@ -1435,6 +1564,8 @@ restart:
 			new_spte &= ~SPTE_HOST_WRITEABLE;
 			new_spte &= ~shadow_accessed_mask;
 
+			new_spte = mark_spte_for_access_track(new_spte);
+
 			mmu_spte_clear_track_bits(sptep);
 			mmu_spte_set(sptep, new_spte);
 		}
@@ -1615,24 +1746,14 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
-	int young = 0;
-
-	/*
-	 * If there's no access bit in the secondary pte set by the
-	 * hardware it's up to gup-fast/gup to set the access bit in
-	 * the primary pte or in the page structure.
-	 */
-	if (!shadow_accessed_mask)
-		goto out;
 
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
-		if (*sptep & shadow_accessed_mask) {
-			young = 1;
-			break;
-		}
+		if ((*sptep & shadow_accessed_mask) ||
+		    (!shadow_accessed_mask && !is_access_track_spte(*sptep)))
+			return 1;
 	}
-out:
-	return young;
+
+	return 0;
 }
 
 #define RMAP_RECYCLE_THRESHOLD 1000
@@ -1669,7 +1790,9 @@ int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 		 */
 		kvm->mmu_notifier_seq++;
 		return kvm_handle_hva_range(kvm, start, end, 0,
-					    kvm_unmap_rmapp);
+					    shadow_acc_track_mask != 0
+					    ? kvm_acc_track_rmapp
+					    : kvm_unmap_rmapp);
 	}
 
 	return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
@@ -2591,6 +2714,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		spte |= shadow_dirty_mask;
 	}
 
+	if (speculative)
+		spte = mark_spte_for_access_track(spte);
+
 set_pte:
 	if (mmu_spte_update(sptep, spte))
 		kvm_flush_remote_tlbs(vcpu->kvm);
@@ -2644,7 +2770,7 @@ static bool mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
 	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
 	pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
 		 is_large_pte(*sptep)? "2MB" : "4kB",
-		 *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
+		 *sptep & PT_WRITABLE_MASK ? "RW" : "R", gfn,
 		 *sptep, sptep);
 	if (!was_rmapped && is_large_pte(*sptep))
 		++vcpu->kvm->stat.lpages;
@@ -2877,16 +3003,27 @@ static bool page_fault_can_be_fast(u32 error_code)
 	if (unlikely(error_code & PFERR_RSVD_MASK))
 		return false;
 
-	/*
-	 * #PF can be fast only if the shadow page table is present and it
-	 * is caused by write-protect, that means we just need change the
-	 * W bit of the spte which can be done out of mmu-lock.
-	 */
-	if (!(error_code & PFERR_PRESENT_MASK) ||
-	      !(error_code & PFERR_WRITE_MASK))
+	/* See if the page fault is due to an NX violation */
+	if (unlikely(((error_code & (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))
+		      == (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))))
 		return false;
 
-	return true;
+	/*
+	 * #PF can be fast if:
+	 * 1. The shadow page table entry is not present, which could mean that
+	 *    the fault is potentially caused by access tracking (if enabled).
+	 * 2. The shadow page table entry is present and the fault
+	 *    is caused by write-protect, that means we just need change the W
+	 *    bit of the spte which can be done out of mmu-lock.
+	 *
+	 * However, if Access Tracking is disabled, then the first condition
+	 * above cannot be handled by the fast path. So if access tracking is
+	 * disabled, we return true only if the second condition is met.
+	 */
+
+	return shadow_acc_track_mask != 0 ||
+	       ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
+		== (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));
 }
 
 /*
@@ -2895,17 +3032,24 @@ static bool page_fault_can_be_fast(u32 error_code)
  */
 static bool
 fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-			u64 *sptep, u64 spte)
+			u64 *sptep, u64 old_spte,
+			bool remove_write_prot, bool remove_acc_track)
 {
 	gfn_t gfn;
+	u64 new_spte = old_spte;
 
 	WARN_ON(!sp->role.direct);
 
-	/*
-	 * The gfn of direct spte is stable since it is calculated
-	 * by sp->gfn.
-	 */
-	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+	if (remove_acc_track) {
+		u64 saved_bits = old_spte & (shadow_acc_track_saved_bits_mask <<
+					     shadow_acc_track_saved_bits_shift);
+
+		new_spte &= ~shadow_acc_track_mask;
+		new_spte |= saved_bits >> shadow_acc_track_saved_bits_shift;
+	}
+
+	if (remove_write_prot)
+		new_spte |= PT_WRITABLE_MASK;
 
 	/*
 	 * Theoretically we could also set dirty bit (and flush TLB) here in
@@ -2919,10 +3063,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
+	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
 		return false;
 
-	kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	if (remove_write_prot) {
+		/*
+		 * The gfn of direct spte is stable since it is
+		 * calculated by sp->gfn.
+		 */
+		gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	}
 
 	return true;
 }
@@ -2937,7 +3088,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
-	bool ret = false;
+	bool fault_handled = false;
 	u64 spte = 0ull;
 	uint retry_count = 0;
 
@@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 			break;
 
 	do {
-		/*
-		 * If the mapping has been changed, let the vcpu fault on the
-		 * same address again.
-		 */
-		if (!is_shadow_present_pte(spte)) {
-			ret = true;
-			break;
-		}
+		bool remove_write_prot = (error_code & PFERR_WRITE_MASK) &&
+					 !(spte & PT_WRITABLE_MASK);
+		bool remove_acc_track;
+		bool valid_exec_access = (error_code & PFERR_FETCH_MASK) &&
+					 (spte & shadow_x_mask);
 
 		sp = page_header(__pa(iterator.sptep));
 		if (!is_last_spte(spte, sp->role.level))
 			break;
 
 		/*
-		 * Check if it is a spurious fault caused by TLB lazily flushed.
+		 * Check whether the memory access that caused the fault would
+		 * still cause it if it were to be performed right now. If not,
+		 * then this is a spurious fault caused by TLB lazily flushed,
+		 * or some other CPU has already fixed the PTE after the
+		 * current CPU took the fault.
+		 *
+		 * If Write-Only mappings ever become supported, then the
+		 * condition below would need to be changed appropriately.
 		 *
 		 * Need not check the access of upper level table entries since
 		 * they are always ACC_ALL.
 		 */
-		if (is_writable_pte(spte)) {
-			ret = true;
+		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
+		    valid_exec_access) {
+			fault_handled = true;
 			break;
 		}
 
+		remove_acc_track = is_access_track_spte(spte);
+
 		/*
-		 * Currently, to simplify the code, only the spte
-		 * write-protected by dirty-log can be fast fixed.
+		 * Currently, to simplify the code, write-protection can be
+		 * removed in the fast path only if the SPTE was write-protected
+		 * for dirty-logging.
 		 */
-		if (!spte_can_locklessly_be_made_writable(spte))
-			break;
+		remove_write_prot &= spte_can_locklessly_be_made_writable(spte);
 
 		/*
 		 * Do not fix write-permission on the large spte since we only
@@ -2998,13 +3156,20 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
 			break;
 
+		/* Verify that the fault can be handled in the fast path */
+		if (!remove_acc_track && !remove_write_prot)
+			break;
+
 		/*
 		 * Currently, fast page fault only works for direct mapping
 		 * since the gfn is not stable for indirect shadow page. See
 		 * Documentation/virtual/kvm/locking.txt to get more detail.
 		 */
-		ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
-		if (ret)
+		fault_handled = fast_pf_fix_direct_spte(vcpu, sp,
+							iterator.sptep, spte,
+							remove_write_prot,
+							remove_acc_track);
+		if (fault_handled)
 			break;
 
 		if (++retry_count > 4) {
@@ -3018,10 +3183,10 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	} while (true);
 
 	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
-			      spte, ret);
+			      spte, fault_handled);
 	walk_shadow_page_lockless_end(vcpu);
 
-	return ret;
+	return fault_handled;
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
@@ -4300,6 +4465,7 @@ static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
 	vcpu->arch.mmu.update_pte(vcpu, sp, spte, new);
 }
 
+/* This is only supposed to be used for non-EPT mappings */
 static bool need_remote_flush(u64 old, u64 new)
 {
 	if (!is_shadow_present_pte(old))
@@ -5067,6 +5233,8 @@ static void mmu_destroy_caches(void)
 
 int kvm_mmu_module_init(void)
 {
+	kvm_mmu_clear_all_pte_masks();
+
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
 					    sizeof(struct pte_list_desc),
 					    0, 0, NULL);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ddc56e9..dfd3056 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -52,6 +52,8 @@ static inline u64 rsvd_bits(int s, int e)
 }
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
+void kvm_mmu_set_access_track_masks(u64 acc_track_mask, u64 acc_track_value,
+				    u64 saved_bits_mask, u64 saved_bits_shift);
 
 void
 reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 859da8e..9cbfc56 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5019,7 +5019,22 @@ static void ept_set_mmio_spte_mask(void)
 	 * Also, magic bits (0x3ull << 62) is set to quickly identify mmio
 	 * spte.
 	 */
-	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
+	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE |
+				   VMX_EPT_TRACK_MMIO);
+}
+
+static void ept_set_acc_track_spte_mask(void)
+{
+	/*
+	 * For access track PTEs we use a non-present PTE to trigger an EPT
+	 * Violation. The original RWX value is saved in some unused bits in
+	 * the PTE and restored when the violation is fixed.
+	 */
+	kvm_mmu_set_access_track_masks(VMX_EPT_RWX_MASK |
+				       VMX_EPT_TRACK_TYPE_MASK,
+				       VMX_EPT_TRACK_ACCESS,
+				       VMX_EPT_RWX_MASK,
+				       VMX_EPT_RWX_SAVE_SHIFT);
 }
 
 #define VMX_XSS_EXIT_BITMAP 0
@@ -6549,6 +6564,9 @@ static __init int hardware_setup(void)
 				      0ull : VMX_EPT_READABLE_MASK);
 		ept_set_mmio_spte_mask();
 		kvm_enable_tdp();
+
+		if (!enable_ept_ad_bits)
+			ept_set_acc_track_spte_mask();
 	} else
 		kvm_disable_tdp();
 
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 4/4] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-10-27  2:19 ` [PATCH 4/4] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-11-02 18:01   ` Paolo Bonzini
  2016-11-02 21:42     ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-02 18:01 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, Xiao Guangrong



On 27/10/2016 04:19, Junaid Shahid wrote:
> This change implements lockless access tracking for Intel CPUs without EPT
> A bits. This is achieved by marking the PTEs as not-present (but not
> completely clearing them) when clear_flush_young() is called after marking
> the pages as accessed. When an EPT Violation is generated as a result of
> the VM accessing those pages, the PTEs are restored to their original values.
> 
> Signed-off-by: Junaid Shahid <junaids@google.com>

Can you please modify Documentation/virtual/kvm/locking.txt to document
this?  Also CCing Guangrong to get a review from the king. :)

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-10-27  2:19 ` [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
@ 2016-11-02 18:03   ` Paolo Bonzini
  2016-11-02 21:40     ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-02 18:03 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner



On 27/10/2016 04:19, Junaid Shahid wrote:
> This change adds some symbolic constants for VM Exit Qualifications
> related to EPT Violations and updates handle_ept_violation() to use
> these constants instead of hard-coded numbers.
> 
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
>  arch/x86/kvm/vmx.c         | 20 ++++++++++++--------
>  2 files changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index a002b07..60991fb 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -465,6 +465,22 @@ struct vmx_msr_entry {
>  #define ENTRY_FAIL_VMCS_LINK_PTR	4
>  
>  /*
> + * Exit Qualifications for EPT Violations
> + */
> +#define EPT_VIOLATION_READ_BIT		0
> +#define EPT_VIOLATION_WRITE_BIT		1
> +#define EPT_VIOLATION_INSTR_BIT		2
> +#define EPT_VIOLATION_READABLE_BIT	3
> +#define EPT_VIOLATION_WRITABLE_BIT	4
> +#define EPT_VIOLATION_EXECUTABLE_BIT	5
> +#define EPT_VIOLATION_READ		(1 << EPT_VIOLATION_READ_BIT)
> +#define EPT_VIOLATION_WRITE		(1 << EPT_VIOLATION_WRITE_BIT)
> +#define EPT_VIOLATION_INSTR		(1 << EPT_VIOLATION_INSTR_BIT)
> +#define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
> +#define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
> +#define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
> +
> +/*
>   * VM-instruction error numbers
>   */
>  enum vm_instruction_error_number {
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index cf1b16d..859da8e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6170,14 +6170,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>  	trace_kvm_page_fault(gpa, exit_qualification);
>  
> -	/* it is a read fault? */
> -	error_code = (exit_qualification << 2) & PFERR_USER_MASK;
> -	/* it is a write fault? */
> -	error_code |= exit_qualification & PFERR_WRITE_MASK;
> -	/* It is a fetch fault? */
> -	error_code |= (exit_qualification << 2) & PFERR_FETCH_MASK;
> -	/* ept page table is present? */
> -	error_code |= (exit_qualification & 0x38) != 0;
> +	/* Is it a read fault? */
> +	error_code = ((exit_qualification >> EPT_VIOLATION_READ_BIT) & 1)
> +		     << PFERR_USER_BIT;
> +	/* Is it a write fault? */
> +	error_code |= ((exit_qualification >> EPT_VIOLATION_WRITE_BIT) & 1)
> +		      << PFERR_WRITE_BIT;
> +	/* Is it a fetch fault? */
> +	error_code |= ((exit_qualification >> EPT_VIOLATION_INSTR_BIT) & 1)
> +		      << PFERR_FETCH_BIT;
> +	/* ept page table entry is present? */
> +	error_code |= ((exit_qualification >> EPT_VIOLATION_READABLE_BIT) & 1)

This last line is not enough now that nested VMX supports execute-only
pages.

Paolo

> +		      << PFERR_PRESENT_BIT;
>  
>  	vcpu->arch.exit_qualification = exit_qualification;
>  
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-11-02 18:03   ` Paolo Bonzini
@ 2016-11-02 21:40     ` Junaid Shahid
  0 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-11-02 21:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner

On Wednesday, November 02, 2016 07:03:45 PM Paolo Bonzini wrote:
> This last line is not enough now that nested VMX supports execute-only
> pages.

Yes, I missed that while rebasing the change. I’ll update it.

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 4/4] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-02 18:01   ` Paolo Bonzini
@ 2016-11-02 21:42     ` Junaid Shahid
  0 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-11-02 21:42 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner, Xiao Guangrong

On Wednesday, November 02, 2016 07:01:57 PM Paolo Bonzini wrote:
> Can you please modify Documentation/virtual/kvm/locking.txt to document
> this?

Sure, I’ll add another patch with updated documentation.

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v2 0/5] Lockless Access Tracking for Intel CPUs without EPT A bits
  2016-10-27  2:19 [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                   ` (3 preceding siblings ...)
  2016-10-27  2:19 ` [PATCH 4/4] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-11-08 23:00 ` Junaid Shahid
  2016-11-08 23:00   ` [PATCH v2 1/5] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
                     ` (4 more replies)
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
  5 siblings, 5 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-11-08 23:00 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

Changes from v1:
* Patch 1 correctly maps to the current codebase by setting the Present bit
  in the page fault error code if any of the Readable, Writeable or Executable
  bits are set in the Exit Qualification.
* Added Patch 5 to update Documentation/virtual/kvm/locking.txt

This patch series implements a lockless access tracking mechanism for KVM
when running on Intel CPUs that do not have EPT A/D bits. 

Currently, KVM tracks accesses on these machines by just clearing the PTEs
and then remapping them when they are accessed again. However, the remapping
requires acquiring the MMU lock in order to lookup the information needed to
construct the PTE. On high core count VMs, this can result in significant MMU
lock contention when running some memory-intesive workloads.

This new mechanism just marks the PTEs as not-present, but keeps all the
information within the PTE instead of clearing it. When the page is accessed
again, the PTE can thus be restored without needing to acquire the MMU lock.

Junaid Shahid (5):
  kvm: x86: mmu: Use symbolic constants for EPT Violation Exit
    Qualifications
  kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  kvm: x86: mmu: Fast Page Fault path retries
  kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A
    bits.
  kvm: x86: mmu: Update documentation for fast page fault mechanism

 Documentation/virtual/kvm/locking.txt |  27 ++-
 arch/x86/include/asm/vmx.h            |  55 +++++
 arch/x86/kvm/mmu.c                    | 399 +++++++++++++++++++++++++---------
 arch/x86/kvm/mmu.h                    |   2 +
 arch/x86/kvm/vmx.c                    |  42 +++-
 5 files changed, 407 insertions(+), 118 deletions(-)

-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v2 1/5] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
@ 2016-11-08 23:00   ` Junaid Shahid
  2016-11-21 13:06     ` Paolo Bonzini
  2016-11-08 23:00   ` [PATCH v2 2/5] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-11-08 23:00 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change adds some symbolic constants for VM Exit Qualifications
related to EPT Violations and updates handle_ept_violation() to use
these constants instead of hard-coded numbers.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
 arch/x86/kvm/vmx.c         | 22 ++++++++++++++--------
 2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index a002b07..60991fb 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -465,6 +465,22 @@ struct vmx_msr_entry {
 #define ENTRY_FAIL_VMCS_LINK_PTR	4
 
 /*
+ * Exit Qualifications for EPT Violations
+ */
+#define EPT_VIOLATION_READ_BIT		0
+#define EPT_VIOLATION_WRITE_BIT		1
+#define EPT_VIOLATION_INSTR_BIT		2
+#define EPT_VIOLATION_READABLE_BIT	3
+#define EPT_VIOLATION_WRITABLE_BIT	4
+#define EPT_VIOLATION_EXECUTABLE_BIT	5
+#define EPT_VIOLATION_READ		(1 << EPT_VIOLATION_READ_BIT)
+#define EPT_VIOLATION_WRITE		(1 << EPT_VIOLATION_WRITE_BIT)
+#define EPT_VIOLATION_INSTR		(1 << EPT_VIOLATION_INSTR_BIT)
+#define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
+#define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
+#define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
+
+/*
  * VM-instruction error numbers
  */
 enum vm_instruction_error_number {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cf1b16d..88e3b02 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6170,14 +6170,20 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(gpa, exit_qualification);
 
-	/* it is a read fault? */
-	error_code = (exit_qualification << 2) & PFERR_USER_MASK;
-	/* it is a write fault? */
-	error_code |= exit_qualification & PFERR_WRITE_MASK;
-	/* It is a fetch fault? */
-	error_code |= (exit_qualification << 2) & PFERR_FETCH_MASK;
-	/* ept page table is present? */
-	error_code |= (exit_qualification & 0x38) != 0;
+	/* Is it a read fault? */
+	error_code = ((exit_qualification >> EPT_VIOLATION_READ_BIT) & 1)
+		     << PFERR_USER_BIT;
+	/* Is it a write fault? */
+	error_code |= ((exit_qualification >> EPT_VIOLATION_WRITE_BIT) & 1)
+		      << PFERR_WRITE_BIT;
+	/* Is it a fetch fault? */
+	error_code |= ((exit_qualification >> EPT_VIOLATION_INSTR_BIT) & 1)
+		      << PFERR_FETCH_BIT;
+	/* ept page table entry is present? */
+	error_code |= (((exit_qualification >> EPT_VIOLATION_READABLE_BIT) |
+			(exit_qualification >> EPT_VIOLATION_WRITABLE_BIT) |
+			(exit_qualification >> EPT_VIOLATION_EXECUTABLE_BIT))
+		       & 1) << PFERR_PRESENT_BIT;
 
 	vcpu->arch.exit_qualification = exit_qualification;
 
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 2/5] kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
  2016-11-08 23:00   ` [PATCH v2 1/5] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
@ 2016-11-08 23:00   ` Junaid Shahid
  2016-11-21 13:07     ` Paolo Bonzini
  2016-11-08 23:00   ` [PATCH v2 3/5] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-11-08 23:00 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change renames spte_is_locklessly_modifiable() to
spte_can_locklessly_be_made_writable() to distinguish it from other
forms of lockless modifications. The full set of lockless modifications
is covered by spte_has_volatile_bits().

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7e98..e580134 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -473,7 +473,7 @@ retry:
 }
 #endif
 
-static bool spte_is_locklessly_modifiable(u64 spte)
+static bool spte_can_locklessly_be_made_writable(u64 spte)
 {
 	return (spte & (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) ==
 		(SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE);
@@ -487,7 +487,7 @@ static bool spte_has_volatile_bits(u64 spte)
 	 * also, it can help us to get a stable is_writable_pte()
 	 * to ensure tlb flush is not missed.
 	 */
-	if (spte_is_locklessly_modifiable(spte))
+	if (spte_can_locklessly_be_made_writable(spte))
 		return true;
 
 	if (!shadow_accessed_mask)
@@ -556,7 +556,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 * we always atomically update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
-	if (spte_is_locklessly_modifiable(old_spte) &&
+	if (spte_can_locklessly_be_made_writable(old_spte) &&
 	      !is_writable_pte(new_spte))
 		ret = true;
 
@@ -1212,7 +1212,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	u64 spte = *sptep;
 
 	if (!is_writable_pte(spte) &&
-	      !(pt_protect && spte_is_locklessly_modifiable(spte)))
+	      !(pt_protect && spte_can_locklessly_be_made_writable(spte)))
 		return false;
 
 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
@@ -2973,7 +2973,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	 * Currently, to simplify the code, only the spte write-protected
 	 * by dirty-log can be fast fixed.
 	 */
-	if (!spte_is_locklessly_modifiable(spte))
+	if (!spte_can_locklessly_be_made_writable(spte))
 		goto exit;
 
 	/*
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 3/5] kvm: x86: mmu: Fast Page Fault path retries
  2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
  2016-11-08 23:00   ` [PATCH v2 1/5] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
  2016-11-08 23:00   ` [PATCH v2 2/5] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
@ 2016-11-08 23:00   ` Junaid Shahid
  2016-11-21 13:13     ` Paolo Bonzini
  2016-11-08 23:00   ` [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
  2016-11-08 23:00   ` [PATCH v2 5/5] kvm: x86: mmu: Update documentation for fast page fault mechanism Junaid Shahid
  4 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-11-08 23:00 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change adds retries into the Fast Page Fault path. Without the
retries, the code still works, but if a retry does end up being needed,
then it will result in a second page fault for the same memory access,
which will cause much more overhead compared to just retrying within the
original fault.

This would be especially useful with the upcoming fast access tracking
change, as that would make it more likely for retries to be needed
(e.g. due to read and write faults happening on different CPUs at
the same time).

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 117 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 69 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e580134..a22a8a2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2889,6 +2889,10 @@ static bool page_fault_can_be_fast(u32 error_code)
 	return true;
 }
 
+/*
+ * Returns true if the SPTE was fixed successfully. Otherwise,
+ * someone else modified the SPTE from its original value.
+ */
 static bool
 fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 			u64 *sptep, u64 spte)
@@ -2915,8 +2919,10 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
-		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
+		return false;
+
+	kvm_vcpu_mark_page_dirty(vcpu, gfn);
 
 	return true;
 }
@@ -2933,6 +2939,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	struct kvm_mmu_page *sp;
 	bool ret = false;
 	u64 spte = 0ull;
+	uint retry_count = 0;
 
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return false;
@@ -2945,57 +2952,71 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		if (!is_shadow_present_pte(spte) || iterator.level < level)
 			break;
 
-	/*
-	 * If the mapping has been changed, let the vcpu fault on the
-	 * same address again.
-	 */
-	if (!is_shadow_present_pte(spte)) {
-		ret = true;
-		goto exit;
-	}
+	do {
+		/*
+		 * If the mapping has been changed, let the vcpu fault on the
+		 * same address again.
+		 */
+		if (!is_shadow_present_pte(spte)) {
+			ret = true;
+			break;
+		}
 
-	sp = page_header(__pa(iterator.sptep));
-	if (!is_last_spte(spte, sp->role.level))
-		goto exit;
+		sp = page_header(__pa(iterator.sptep));
+		if (!is_last_spte(spte, sp->role.level))
+			break;
 
-	/*
-	 * Check if it is a spurious fault caused by TLB lazily flushed.
-	 *
-	 * Need not check the access of upper level table entries since
-	 * they are always ACC_ALL.
-	 */
-	 if (is_writable_pte(spte)) {
-		ret = true;
-		goto exit;
-	}
+		/*
+		 * Check if it is a spurious fault caused by TLB lazily flushed.
+		 *
+		 * Need not check the access of upper level table entries since
+		 * they are always ACC_ALL.
+		 */
+		if (is_writable_pte(spte)) {
+			ret = true;
+			break;
+		}
 
-	/*
-	 * Currently, to simplify the code, only the spte write-protected
-	 * by dirty-log can be fast fixed.
-	 */
-	if (!spte_can_locklessly_be_made_writable(spte))
-		goto exit;
+		/*
+		 * Currently, to simplify the code, only the spte
+		 * write-protected by dirty-log can be fast fixed.
+		 */
+		if (!spte_can_locklessly_be_made_writable(spte))
+			break;
 
-	/*
-	 * Do not fix write-permission on the large spte since we only dirty
-	 * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
-	 * that means other pages are missed if its slot is dirty-logged.
-	 *
-	 * Instead, we let the slow page fault path create a normal spte to
-	 * fix the access.
-	 *
-	 * See the comments in kvm_arch_commit_memory_region().
-	 */
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		goto exit;
+		/*
+		 * Do not fix write-permission on the large spte since we only
+		 * dirty the first page into the dirty-bitmap in
+		 * fast_pf_fix_direct_spte() that means other pages are missed
+		 * if its slot is dirty-logged.
+		 *
+		 * Instead, we let the slow page fault path create a normal spte
+		 * to fix the access.
+		 *
+		 * See the comments in kvm_arch_commit_memory_region().
+		 */
+		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+			break;
+
+		/*
+		 * Currently, fast page fault only works for direct mapping
+		 * since the gfn is not stable for indirect shadow page. See
+		 * Documentation/virtual/kvm/locking.txt to get more detail.
+		 */
+		ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
+		if (ret)
+			break;
+
+		if (++retry_count > 4) {
+			printk_once(KERN_WARNING
+				    "Fast #PF retrying more than 4 times.\n");
+			break;
+		}
+
+		spte = mmu_spte_get_lockless(iterator.sptep);
+
+	} while (true);
 
-	/*
-	 * Currently, fast page fault only works for direct mapping since
-	 * the gfn is not stable for indirect shadow page.
-	 * See Documentation/virtual/kvm/locking.txt to get more detail.
-	 */
-	ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
-exit:
 	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
 			      spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
                     ` (2 preceding siblings ...)
  2016-11-08 23:00   ` [PATCH v2 3/5] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
@ 2016-11-08 23:00   ` Junaid Shahid
  2016-11-21 14:42     ` Paolo Bonzini
  2016-11-08 23:00   ` [PATCH v2 5/5] kvm: x86: mmu: Update documentation for fast page fault mechanism Junaid Shahid
  4 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-11-08 23:00 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/include/asm/vmx.h |  39 ++++++
 arch/x86/kvm/mmu.c         | 314 ++++++++++++++++++++++++++++++++++-----------
 arch/x86/kvm/mmu.h         |   2 +
 arch/x86/kvm/vmx.c         |  20 ++-
 4 files changed, 301 insertions(+), 74 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 60991fb..3d63098 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -434,6 +434,45 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
 #define VMX_EPT_ACCESS_BIT				(1ull << 8)
 #define VMX_EPT_DIRTY_BIT				(1ull << 9)
+#define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
+						 VMX_EPT_WRITABLE_MASK |       \
+						 VMX_EPT_EXECUTABLE_MASK)
+#define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
+
+/* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
+#define VMX_EPT_MISCONFIG_WX_VALUE		(VMX_EPT_WRITABLE_MASK |       \
+						 VMX_EPT_EXECUTABLE_MASK)
+
+/*
+ * The shift to use for saving the original RWX value when marking the PTE as
+ * not-present for tracking purposes.
+ */
+#define VMX_EPT_RWX_SAVE_SHIFT			52
+
+/*
+ * The shift/mask for determining the type of tracking (if any) being used for a
+ * not-present PTE. Currently, only two bits are used, but more can be added.
+ *
+ * NOTE: Bit 63 is an architecturally ignored bit (and hence can be used for our
+ *       purpose) when the EPT PTE is in a misconfigured state. However, it is
+ *       not necessarily an ignored bit otherwise (even in a not-present state).
+ *       Since the existing MMIO code already uses this bit and since KVM
+ *       doesn't use #VEs currently (where this bit comes into play), so we can
+ *       continue to use it for storing the type. But to be on the safe side,
+ *       we should not set it to 1 in those TRACK_TYPEs where the tracking is
+ *       done via EPT Violations instead of EPT Misconfigurations.
+ */
+#define VMX_EPT_TRACK_TYPE_SHIFT		62
+#define VMX_EPT_TRACK_TYPE_MASK			(3ull <<                       \
+						 VMX_EPT_TRACK_TYPE_SHIFT)
+
+/* Sets only bit 62 as the tracking is done by EPT Violations. See note above */
+#define VMX_EPT_TRACK_ACCESS			(1ull <<                       \
+						 VMX_EPT_TRACK_TYPE_SHIFT)
+/* Sets bits 62 and 63. See note above */
+#define VMX_EPT_TRACK_MMIO			(3ull <<                       \
+						 VMX_EPT_TRACK_TYPE_SHIFT)
+
 
 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR		0xfffbc000ul
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a22a8a2..8ea1618 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -37,6 +37,7 @@
 #include <linux/srcu.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/kern_levels.h>
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -177,6 +178,10 @@ static u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mmio_mask;
 static u64 __read_mostly shadow_present_mask;
+static u64 __read_mostly shadow_acc_track_mask;
+static u64 __read_mostly shadow_acc_track_value;
+static u64 __read_mostly shadow_acc_track_saved_bits_mask;
+static u64 __read_mostly shadow_acc_track_saved_bits_shift;
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
 static void mmu_free_roots(struct kvm_vcpu *vcpu);
@@ -187,6 +192,26 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
+void kvm_mmu_set_access_track_masks(u64 acc_track_mask, u64 acc_track_value,
+				    u64 saved_bits_mask, u64 saved_bits_shift)
+{
+	shadow_acc_track_mask = acc_track_mask;
+	shadow_acc_track_value = acc_track_value;
+	shadow_acc_track_saved_bits_mask = saved_bits_mask;
+	shadow_acc_track_saved_bits_shift = saved_bits_shift;
+
+	BUG_ON((~acc_track_mask & acc_track_value) != 0);
+	BUG_ON((~acc_track_mask & saved_bits_mask) != 0);
+	BUG_ON(shadow_accessed_mask != 0);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_access_track_masks);
+
+static inline bool is_access_track_spte(u64 spte)
+{
+	return shadow_acc_track_mask != 0 &&
+	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
+}
+
 /*
  * the low bit of the generation number is always presumed to be zero.
  * This disables mmio caching during memslot updates.  The concept is
@@ -292,9 +317,25 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 	shadow_nx_mask = nx_mask;
 	shadow_x_mask = x_mask;
 	shadow_present_mask = p_mask;
+	BUG_ON(shadow_accessed_mask != 0 && shadow_acc_track_mask != 0);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
 
+void kvm_mmu_clear_all_pte_masks(void)
+{
+	shadow_user_mask = 0;
+	shadow_accessed_mask = 0;
+	shadow_dirty_mask = 0;
+	shadow_nx_mask = 0;
+	shadow_x_mask = 0;
+	shadow_mmio_mask = 0;
+	shadow_present_mask = 0;
+	shadow_acc_track_mask = 0;
+	shadow_acc_track_value = 0;
+	shadow_acc_track_saved_bits_mask = 0;
+	shadow_acc_track_saved_bits_shift = 0;
+}
+
 static int is_cpuid_PSE36(void)
 {
 	return 1;
@@ -307,7 +348,8 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-	return (pte & 0xFFFFFFFFull) && !is_mmio_spte(pte);
+	return ((pte & 0xFFFFFFFFull) && !is_mmio_spte(pte)) ||
+	       is_access_track_spte(pte);
 }
 
 static int is_large_pte(u64 pte)
@@ -490,6 +532,9 @@ static bool spte_has_volatile_bits(u64 spte)
 	if (spte_can_locklessly_be_made_writable(spte))
 		return true;
 
+	if (is_access_track_spte(spte))
+		return true;
+
 	if (!shadow_accessed_mask)
 		return false;
 
@@ -533,17 +578,21 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
  * will find a read-only spte, even though the writable spte
  * might be cached on a CPU's TLB, the return value indicates this
  * case.
+ *
+ * Returns true if the TLB needs to be flushed
  */
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
 	u64 old_spte = *sptep;
-	bool ret = false;
+	bool flush = false;
+	bool writable_cleared;
+	bool acc_track_enabled;
 
 	WARN_ON(!is_shadow_present_pte(new_spte));
 
 	if (!is_shadow_present_pte(old_spte)) {
 		mmu_spte_set(sptep, new_spte);
-		return ret;
+		return flush;
 	}
 
 	if (!spte_has_volatile_bits(old_spte))
@@ -551,24 +600,16 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	else
 		old_spte = __update_clear_spte_slow(sptep, new_spte);
 
+	BUG_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
+
 	/*
 	 * For the spte updated out of mmu-lock is safe, since
 	 * we always atomically update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
 	if (spte_can_locklessly_be_made_writable(old_spte) &&
-	      !is_writable_pte(new_spte))
-		ret = true;
-
-	if (!shadow_accessed_mask) {
-		/*
-		 * We don't set page dirty when dropping non-writable spte.
-		 * So do it now if the new spte is becoming non-writable.
-		 */
-		if (ret)
-			kvm_set_pfn_dirty(spte_to_pfn(old_spte));
-		return ret;
-	}
+	    !is_writable_pte(new_spte))
+		flush = true;
 
 	/*
 	 * Flush TLB when accessed/dirty bits are changed in the page tables,
@@ -576,20 +617,34 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 */
 	if (spte_is_bit_changed(old_spte, new_spte,
                                 shadow_accessed_mask | shadow_dirty_mask))
-		ret = true;
+		flush = true;
 
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
+	writable_cleared = is_writable_pte(old_spte) &&
+			   !is_writable_pte(new_spte);
+	acc_track_enabled = !is_access_track_spte(old_spte) &&
+			    is_access_track_spte(new_spte);
+
+	if (writable_cleared || acc_track_enabled)
+		flush = true;
+
+	if (shadow_accessed_mask ?
+	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
+	    acc_track_enabled)
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
+
+	if (shadow_dirty_mask ?
+	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
+	    writable_cleared)
 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
 
-	return ret;
+	return flush;
 }
 
 /*
  * Rules for using mmu_spte_clear_track_bits:
  * It sets the sptep from present to nonpresent, and track the
  * state bits, it is used to clear the last level sptep.
+ * Returns non-zero if the PTE was previously valid.
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
@@ -604,6 +659,13 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	if (!is_shadow_present_pte(old_spte))
 		return 0;
 
+	/*
+	 * For access tracking SPTEs, the pfn was already marked accessed/dirty
+	 * when the SPTE was marked for access tracking, so nothing to do here.
+	 */
+	if (is_access_track_spte(old_spte))
+		return 1;
+
 	pfn = spte_to_pfn(old_spte);
 
 	/*
@@ -618,6 +680,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	if (old_spte & (shadow_dirty_mask ? shadow_dirty_mask :
 					    PT_WRITABLE_MASK))
 		kvm_set_pfn_dirty(pfn);
+
 	return 1;
 }
 
@@ -636,6 +699,52 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
+static u64 mark_spte_for_access_track(u64 spte)
+{
+	if (shadow_acc_track_mask == 0)
+		return spte;
+
+	/*
+	 * Verify that the write-protection that we do below will be fixable
+	 * via the fast page fault path. Currently, that is always the case, at
+	 * least when using EPT (which is when access tracking would be used).
+	 */
+	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
+		  !spte_can_locklessly_be_made_writable(spte),
+		  "Writable SPTE is not locklessly dirty-trackable\n");
+
+	/*
+	 * Any PTE marked for access tracking should also be marked for dirty
+	 * tracking (by being non-writable)
+	 */
+	spte &= ~PT_WRITABLE_MASK;
+
+	spte &= ~(shadow_acc_track_saved_bits_mask <<
+		  shadow_acc_track_saved_bits_shift);
+	spte |= (spte & shadow_acc_track_saved_bits_mask) <<
+		shadow_acc_track_saved_bits_shift;
+	spte &= ~shadow_acc_track_mask;
+	spte |= shadow_acc_track_value;
+
+	return spte;
+}
+
+/* Returns true if the TLB needs to be flushed */
+static bool mmu_spte_enable_access_track(u64 *sptep)
+{
+	u64 spte = mmu_spte_get_lockless(sptep);
+
+	if (is_access_track_spte(spte))
+		return false;
+
+	/* Access tracking should not be enabled if CPU supports A/D bits */
+	BUG_ON(shadow_accessed_mask != 0);
+
+	spte = mark_spte_for_access_track(spte);
+
+	return mmu_spte_update(sptep, spte);
+}
+
 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 {
 	/*
@@ -1403,6 +1512,25 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return kvm_zap_rmapp(kvm, rmap_head);
 }
 
+static int kvm_acc_track_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			       struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level, unsigned long data)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	int need_tlb_flush = 0;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
+
+		rmap_printk("kvm_acc_track_rmapp: spte %p %llx gfn %llx (%d)\n",
+			    sptep, *sptep, gfn, level);
+
+		need_tlb_flush |= mmu_spte_enable_access_track(sptep);
+	}
+
+	return need_tlb_flush;
+}
+
 static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
 			     unsigned long data)
@@ -1419,8 +1547,9 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 restart:
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
+
 		rmap_printk("kvm_set_pte_rmapp: spte %p %llx gfn %llx (%d)\n",
-			     sptep, *sptep, gfn, level);
+			    sptep, *sptep, gfn, level);
 
 		need_flush = 1;
 
@@ -1435,6 +1564,8 @@ restart:
 			new_spte &= ~SPTE_HOST_WRITEABLE;
 			new_spte &= ~shadow_accessed_mask;
 
+			new_spte = mark_spte_for_access_track(new_spte);
+
 			mmu_spte_clear_track_bits(sptep);
 			mmu_spte_set(sptep, new_spte);
 		}
@@ -1615,24 +1746,14 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
-	int young = 0;
-
-	/*
-	 * If there's no access bit in the secondary pte set by the
-	 * hardware it's up to gup-fast/gup to set the access bit in
-	 * the primary pte or in the page structure.
-	 */
-	if (!shadow_accessed_mask)
-		goto out;
 
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
-		if (*sptep & shadow_accessed_mask) {
-			young = 1;
-			break;
-		}
+		if ((*sptep & shadow_accessed_mask) ||
+		    (!shadow_accessed_mask && !is_access_track_spte(*sptep)))
+			return 1;
 	}
-out:
-	return young;
+
+	return 0;
 }
 
 #define RMAP_RECYCLE_THRESHOLD 1000
@@ -1669,7 +1790,9 @@ int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 		 */
 		kvm->mmu_notifier_seq++;
 		return kvm_handle_hva_range(kvm, start, end, 0,
-					    kvm_unmap_rmapp);
+					    shadow_acc_track_mask != 0
+					    ? kvm_acc_track_rmapp
+					    : kvm_unmap_rmapp);
 	}
 
 	return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
@@ -2591,6 +2714,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		spte |= shadow_dirty_mask;
 	}
 
+	if (speculative)
+		spte = mark_spte_for_access_track(spte);
+
 set_pte:
 	if (mmu_spte_update(sptep, spte))
 		kvm_flush_remote_tlbs(vcpu->kvm);
@@ -2644,7 +2770,7 @@ static bool mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
 	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
 	pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
 		 is_large_pte(*sptep)? "2MB" : "4kB",
-		 *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
+		 *sptep & PT_WRITABLE_MASK ? "RW" : "R", gfn,
 		 *sptep, sptep);
 	if (!was_rmapped && is_large_pte(*sptep))
 		++vcpu->kvm->stat.lpages;
@@ -2877,16 +3003,27 @@ static bool page_fault_can_be_fast(u32 error_code)
 	if (unlikely(error_code & PFERR_RSVD_MASK))
 		return false;
 
-	/*
-	 * #PF can be fast only if the shadow page table is present and it
-	 * is caused by write-protect, that means we just need change the
-	 * W bit of the spte which can be done out of mmu-lock.
-	 */
-	if (!(error_code & PFERR_PRESENT_MASK) ||
-	      !(error_code & PFERR_WRITE_MASK))
+	/* See if the page fault is due to an NX violation */
+	if (unlikely(((error_code & (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))
+		      == (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))))
 		return false;
 
-	return true;
+	/*
+	 * #PF can be fast if:
+	 * 1. The shadow page table entry is not present, which could mean that
+	 *    the fault is potentially caused by access tracking (if enabled).
+	 * 2. The shadow page table entry is present and the fault
+	 *    is caused by write-protect, that means we just need change the W
+	 *    bit of the spte which can be done out of mmu-lock.
+	 *
+	 * However, if Access Tracking is disabled, then the first condition
+	 * above cannot be handled by the fast path. So if access tracking is
+	 * disabled, we return true only if the second condition is met.
+	 */
+
+	return shadow_acc_track_mask != 0 ||
+	       ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
+		== (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));
 }
 
 /*
@@ -2895,17 +3032,24 @@ static bool page_fault_can_be_fast(u32 error_code)
  */
 static bool
 fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-			u64 *sptep, u64 spte)
+			u64 *sptep, u64 old_spte,
+			bool remove_write_prot, bool remove_acc_track)
 {
 	gfn_t gfn;
+	u64 new_spte = old_spte;
 
 	WARN_ON(!sp->role.direct);
 
-	/*
-	 * The gfn of direct spte is stable since it is calculated
-	 * by sp->gfn.
-	 */
-	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+	if (remove_acc_track) {
+		u64 saved_bits = old_spte & (shadow_acc_track_saved_bits_mask <<
+					     shadow_acc_track_saved_bits_shift);
+
+		new_spte &= ~shadow_acc_track_mask;
+		new_spte |= saved_bits >> shadow_acc_track_saved_bits_shift;
+	}
+
+	if (remove_write_prot)
+		new_spte |= PT_WRITABLE_MASK;
 
 	/*
 	 * Theoretically we could also set dirty bit (and flush TLB) here in
@@ -2919,10 +3063,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
+	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
 		return false;
 
-	kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	if (remove_write_prot) {
+		/*
+		 * The gfn of direct spte is stable since it is
+		 * calculated by sp->gfn.
+		 */
+		gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	}
 
 	return true;
 }
@@ -2937,7 +3088,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
-	bool ret = false;
+	bool fault_handled = false;
 	u64 spte = 0ull;
 	uint retry_count = 0;
 
@@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 			break;
 
 	do {
-		/*
-		 * If the mapping has been changed, let the vcpu fault on the
-		 * same address again.
-		 */
-		if (!is_shadow_present_pte(spte)) {
-			ret = true;
-			break;
-		}
+		bool remove_write_prot = (error_code & PFERR_WRITE_MASK) &&
+					 !(spte & PT_WRITABLE_MASK);
+		bool remove_acc_track;
+		bool valid_exec_access = (error_code & PFERR_FETCH_MASK) &&
+					 (spte & shadow_x_mask);
 
 		sp = page_header(__pa(iterator.sptep));
 		if (!is_last_spte(spte, sp->role.level))
 			break;
 
 		/*
-		 * Check if it is a spurious fault caused by TLB lazily flushed.
+		 * Check whether the memory access that caused the fault would
+		 * still cause it if it were to be performed right now. If not,
+		 * then this is a spurious fault caused by TLB lazily flushed,
+		 * or some other CPU has already fixed the PTE after the
+		 * current CPU took the fault.
+		 *
+		 * If Write-Only mappings ever become supported, then the
+		 * condition below would need to be changed appropriately.
 		 *
 		 * Need not check the access of upper level table entries since
 		 * they are always ACC_ALL.
 		 */
-		if (is_writable_pte(spte)) {
-			ret = true;
+		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
+		    valid_exec_access) {
+			fault_handled = true;
 			break;
 		}
 
+		remove_acc_track = is_access_track_spte(spte);
+
 		/*
-		 * Currently, to simplify the code, only the spte
-		 * write-protected by dirty-log can be fast fixed.
+		 * Currently, to simplify the code, write-protection can be
+		 * removed in the fast path only if the SPTE was write-protected
+		 * for dirty-logging.
 		 */
-		if (!spte_can_locklessly_be_made_writable(spte))
-			break;
+		remove_write_prot &= spte_can_locklessly_be_made_writable(spte);
 
 		/*
 		 * Do not fix write-permission on the large spte since we only
@@ -2998,13 +3156,20 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
 			break;
 
+		/* Verify that the fault can be handled in the fast path */
+		if (!remove_acc_track && !remove_write_prot)
+			break;
+
 		/*
 		 * Currently, fast page fault only works for direct mapping
 		 * since the gfn is not stable for indirect shadow page. See
 		 * Documentation/virtual/kvm/locking.txt to get more detail.
 		 */
-		ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
-		if (ret)
+		fault_handled = fast_pf_fix_direct_spte(vcpu, sp,
+							iterator.sptep, spte,
+							remove_write_prot,
+							remove_acc_track);
+		if (fault_handled)
 			break;
 
 		if (++retry_count > 4) {
@@ -3018,10 +3183,10 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	} while (true);
 
 	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
-			      spte, ret);
+			      spte, fault_handled);
 	walk_shadow_page_lockless_end(vcpu);
 
-	return ret;
+	return fault_handled;
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
@@ -4300,6 +4465,7 @@ static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
 	vcpu->arch.mmu.update_pte(vcpu, sp, spte, new);
 }
 
+/* This is only supposed to be used for non-EPT mappings */
 static bool need_remote_flush(u64 old, u64 new)
 {
 	if (!is_shadow_present_pte(old))
@@ -5067,6 +5233,8 @@ static void mmu_destroy_caches(void)
 
 int kvm_mmu_module_init(void)
 {
+	kvm_mmu_clear_all_pte_masks();
+
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
 					    sizeof(struct pte_list_desc),
 					    0, 0, NULL);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ddc56e9..dfd3056 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -52,6 +52,8 @@ static inline u64 rsvd_bits(int s, int e)
 }
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
+void kvm_mmu_set_access_track_masks(u64 acc_track_mask, u64 acc_track_value,
+				    u64 saved_bits_mask, u64 saved_bits_shift);
 
 void
 reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 88e3b02..363517e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5019,7 +5019,22 @@ static void ept_set_mmio_spte_mask(void)
 	 * Also, magic bits (0x3ull << 62) is set to quickly identify mmio
 	 * spte.
 	 */
-	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
+	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE |
+				   VMX_EPT_TRACK_MMIO);
+}
+
+static void ept_set_acc_track_spte_mask(void)
+{
+	/*
+	 * For access track PTEs we use a non-present PTE to trigger an EPT
+	 * Violation. The original RWX value is saved in some unused bits in
+	 * the PTE and restored when the violation is fixed.
+	 */
+	kvm_mmu_set_access_track_masks(VMX_EPT_RWX_MASK |
+				       VMX_EPT_TRACK_TYPE_MASK,
+				       VMX_EPT_TRACK_ACCESS,
+				       VMX_EPT_RWX_MASK,
+				       VMX_EPT_RWX_SAVE_SHIFT);
 }
 
 #define VMX_XSS_EXIT_BITMAP 0
@@ -6551,6 +6566,9 @@ static __init int hardware_setup(void)
 				      0ull : VMX_EPT_READABLE_MASK);
 		ept_set_mmio_spte_mask();
 		kvm_enable_tdp();
+
+		if (!enable_ept_ad_bits)
+			ept_set_acc_track_spte_mask();
 	} else
 		kvm_disable_tdp();
 
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v2 5/5] kvm: x86: mmu: Update documentation for fast page fault mechanism
  2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
                     ` (3 preceding siblings ...)
  2016-11-08 23:00   ` [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-11-08 23:00   ` Junaid Shahid
  4 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-11-08 23:00 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

Add a brief description of the lockless access tracking mechanism
to the documentation of fast page faults in locking.txt.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 Documentation/virtual/kvm/locking.txt | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt
index f2491a8..e7a1f7c 100644
--- a/Documentation/virtual/kvm/locking.txt
+++ b/Documentation/virtual/kvm/locking.txt
@@ -12,9 +12,16 @@ KVM Lock Overview
 Fast page fault:
 
 Fast page fault is the fast path which fixes the guest page fault out of
-the mmu-lock on x86. Currently, the page fault can be fast only if the
-shadow page table is present and it is caused by write-protect, that means
-we just need change the W bit of the spte.
+the mmu-lock on x86. Currently, the page fault can be fast in one of the
+following two cases:
+
+1. Access Tracking: The SPTE is not present, but it is marked for access
+tracking i.e. the VMX_EPT_TRACK_ACCESS mask is set. That means we need to
+restore the saved RWX bits. This is described in more detail later below.
+
+2. Write-Protection: The SPTE is present and the fault is
+caused by write-protect. That means we just need to change the W bit of the 
+spte.
 
 What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
 SPTE_MMU_WRITEABLE bit on the spte:
@@ -24,7 +31,8 @@ SPTE_MMU_WRITEABLE bit on the spte:
   page write-protection.
 
 On fast page fault path, we will use cmpxchg to atomically set the spte W
-bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
+bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or 
+restore the saved RWX bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
 is safe because whenever changing these bits can be detected by cmpxchg.
 
 But we need carefully check these cases:
@@ -128,6 +136,17 @@ Since the spte is "volatile" if it can be updated out of mmu-lock, we always
 atomically update the spte, the race caused by fast page fault can be avoided,
 See the comments in spte_has_volatile_bits() and mmu_spte_update().
 
+Lockless Access Tracking:
+
+This is used for Intel CPUs that are using EPT but do not support the EPT A/D
+bits. In this case, when the KVM MMU notifier is called to track accesses to a
+page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
+by clearing the RWX bits in the PTE and storing the original bits in some
+unused/ignored bits. In addition, the VMX_EPT_TRACK_ACCESS mask is also set on
+the PTE (also using unused/ignored bits). When the VM tries to access the page
+later on, a fault is generated and the fast page fault mechanism described
+above is used to atomically restore the PTE to its original state.
+
 3. Reference
 ------------
 
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 1/5] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-11-08 23:00   ` [PATCH v2 1/5] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
@ 2016-11-21 13:06     ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-21 13:06 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, guangrong.xiao



On 09/11/2016 00:00, Junaid Shahid wrote:
> This change adds some symbolic constants for VM Exit Qualifications
> related to EPT Violations and updates handle_ept_violation() to use
> these constants instead of hard-coded numbers.
> 
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
>  arch/x86/kvm/vmx.c         | 22 ++++++++++++++--------
>  2 files changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index a002b07..60991fb 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -465,6 +465,22 @@ struct vmx_msr_entry {
>  #define ENTRY_FAIL_VMCS_LINK_PTR	4
>  
>  /*
> + * Exit Qualifications for EPT Violations
> + */
> +#define EPT_VIOLATION_READ_BIT		0
> +#define EPT_VIOLATION_WRITE_BIT		1
> +#define EPT_VIOLATION_INSTR_BIT		2
> +#define EPT_VIOLATION_READABLE_BIT	3
> +#define EPT_VIOLATION_WRITABLE_BIT	4
> +#define EPT_VIOLATION_EXECUTABLE_BIT	5
> +#define EPT_VIOLATION_READ		(1 << EPT_VIOLATION_READ_BIT)
> +#define EPT_VIOLATION_WRITE		(1 << EPT_VIOLATION_WRITE_BIT)
> +#define EPT_VIOLATION_INSTR		(1 << EPT_VIOLATION_INSTR_BIT)
> +#define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
> +#define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
> +#define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
> +
> +/*
>   * VM-instruction error numbers
>   */
>  enum vm_instruction_error_number {
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index cf1b16d..88e3b02 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6170,14 +6170,20 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>  	trace_kvm_page_fault(gpa, exit_qualification);
>  
> -	/* it is a read fault? */
> -	error_code = (exit_qualification << 2) & PFERR_USER_MASK;
> -	/* it is a write fault? */
> -	error_code |= exit_qualification & PFERR_WRITE_MASK;
> -	/* It is a fetch fault? */
> -	error_code |= (exit_qualification << 2) & PFERR_FETCH_MASK;
> -	/* ept page table is present? */
> -	error_code |= (exit_qualification & 0x38) != 0;
> +	/* Is it a read fault? */
> +	error_code = ((exit_qualification >> EPT_VIOLATION_READ_BIT) & 1)
> +		     << PFERR_USER_BIT;
> +	/* Is it a write fault? */
> +	error_code |= ((exit_qualification >> EPT_VIOLATION_WRITE_BIT) & 1)
> +		      << PFERR_WRITE_BIT;
> +	/* Is it a fetch fault? */
> +	error_code |= ((exit_qualification >> EPT_VIOLATION_INSTR_BIT) & 1)
> +		      << PFERR_FETCH_BIT;
> +	/* ept page table entry is present? */
> +	error_code |= (((exit_qualification >> EPT_VIOLATION_READABLE_BIT) |
> +			(exit_qualification >> EPT_VIOLATION_WRITABLE_BIT) |
> +			(exit_qualification >> EPT_VIOLATION_EXECUTABLE_BIT))
> +		       & 1) << PFERR_PRESENT_BIT;

Please don't change the shape of the condition unnecessarily.

error_code |=
	(exit_qualification &
	 (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
	  EPT_VIOLATION_EXECUTABLE)) ? PFERR_PRESENT_MASK : 0;

The same form, with the ternary operator, is usable also for the other
cases.  GCC generates slightly worse code:

.LFB3:
	.cfi_startproc
	movl	%edi, %eax
	shrl	$2, %eax
	andl	$4, %eax
	ret
	.cfi_endproc

.LFB7:
	.cfi_startproc
	movl	%edi, %eax
	movl	$4, %edx
	andl	$16, %eax
	cmovne	%edx, %eax
	ret

but clang gets it right and it can be fixed in the compiler.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 2/5] kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  2016-11-08 23:00   ` [PATCH v2 2/5] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
@ 2016-11-21 13:07     ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-21 13:07 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, guangrong.xiao



On 09/11/2016 00:00, Junaid Shahid wrote:
> This change renames spte_is_locklessly_modifiable() to
> spte_can_locklessly_be_made_writable() to distinguish it from other
> forms of lockless modifications. The full set of lockless modifications
> is covered by spte_has_volatile_bits().
> 
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/kvm/mmu.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index d9c7e98..e580134 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -473,7 +473,7 @@ retry:
>  }
>  #endif
>  
> -static bool spte_is_locklessly_modifiable(u64 spte)
> +static bool spte_can_locklessly_be_made_writable(u64 spte)
>  {
>  	return (spte & (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) ==
>  		(SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE);
> @@ -487,7 +487,7 @@ static bool spte_has_volatile_bits(u64 spte)
>  	 * also, it can help us to get a stable is_writable_pte()
>  	 * to ensure tlb flush is not missed.
>  	 */
> -	if (spte_is_locklessly_modifiable(spte))
> +	if (spte_can_locklessly_be_made_writable(spte))
>  		return true;
>  
>  	if (!shadow_accessed_mask)
> @@ -556,7 +556,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  	 * we always atomically update it, see the comments in
>  	 * spte_has_volatile_bits().
>  	 */
> -	if (spte_is_locklessly_modifiable(old_spte) &&
> +	if (spte_can_locklessly_be_made_writable(old_spte) &&
>  	      !is_writable_pte(new_spte))
>  		ret = true;
>  
> @@ -1212,7 +1212,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
>  	u64 spte = *sptep;
>  
>  	if (!is_writable_pte(spte) &&
> -	      !(pt_protect && spte_is_locklessly_modifiable(spte)))
> +	      !(pt_protect && spte_can_locklessly_be_made_writable(spte)))
>  		return false;
>  
>  	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
> @@ -2973,7 +2973,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  	 * Currently, to simplify the code, only the spte write-protected
>  	 * by dirty-log can be fast fixed.
>  	 */
> -	if (!spte_is_locklessly_modifiable(spte))
> +	if (!spte_can_locklessly_be_made_writable(spte))
>  		goto exit;
>  
>  	/*
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 3/5] kvm: x86: mmu: Fast Page Fault path retries
  2016-11-08 23:00   ` [PATCH v2 3/5] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
@ 2016-11-21 13:13     ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-21 13:13 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, guangrong.xiao



On 09/11/2016 00:00, Junaid Shahid wrote:
> +
> +		if (++retry_count > 4) {
> +			printk_once(KERN_WARNING
> +				    "Fast #PF retrying more than 4 times.\n");

This needs to say "kvm:", but it can be fixed when committing.

Paolo

> +			break;
> +		}
> +

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-08 23:00   ` [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-11-21 14:42     ` Paolo Bonzini
  2016-11-24  3:50       ` Junaid Shahid
  2016-12-01 22:54       ` Junaid Shahid
  0 siblings, 2 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-21 14:42 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, guangrong.xiao



On 09/11/2016 00:00, Junaid Shahid wrote:
> This change implements lockless access tracking for Intel CPUs without EPT
> A bits. This is achieved by marking the PTEs as not-present (but not
> completely clearing them) when clear_flush_young() is called after marking
> the pages as accessed. When an EPT Violation is generated as a result of
> the VM accessing those pages, the PTEs are restored to their original values.
> 
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/include/asm/vmx.h |  39 ++++++
>  arch/x86/kvm/mmu.c         | 314 ++++++++++++++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu.h         |   2 +
>  arch/x86/kvm/vmx.c         |  20 ++-
>  4 files changed, 301 insertions(+), 74 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 60991fb..3d63098 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -434,6 +434,45 @@ enum vmcs_field {
>  #define VMX_EPT_IPAT_BIT    			(1ull << 6)
>  #define VMX_EPT_ACCESS_BIT				(1ull << 8)
>  #define VMX_EPT_DIRTY_BIT				(1ull << 9)
> +#define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
> +						 VMX_EPT_WRITABLE_MASK |       \
> +						 VMX_EPT_EXECUTABLE_MASK)
> +#define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
> +
> +/* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
> +#define VMX_EPT_MISCONFIG_WX_VALUE		(VMX_EPT_WRITABLE_MASK |       \
> +						 VMX_EPT_EXECUTABLE_MASK)
> +

So far so good.

> + * The shift to use for saving the original RWX value when marking the PTE as
> + * not-present for tracking purposes.
> + */
> +#define VMX_EPT_RWX_SAVE_SHIFT			52
> +
> +/*
> + * The shift/mask for determining the type of tracking (if any) being used for a
> + * not-present PTE. Currently, only two bits are used, but more can be added.
> + *
> + * NOTE: Bit 63 is an architecturally ignored bit (and hence can be used for our
> + *       purpose) when the EPT PTE is in a misconfigured state. However, it is
> + *       not necessarily an ignored bit otherwise (even in a not-present state).
> + *       Since the existing MMIO code already uses this bit and since KVM
> + *       doesn't use #VEs currently (where this bit comes into play), so we can
> + *       continue to use it for storing the type. But to be on the safe side,
> + *       we should not set it to 1 in those TRACK_TYPEs where the tracking is
> + *       done via EPT Violations instead of EPT Misconfigurations.
> + */
> +#define VMX_EPT_TRACK_TYPE_SHIFT		62
> +#define VMX_EPT_TRACK_TYPE_MASK			(3ull <<                       \
> +						 VMX_EPT_TRACK_TYPE_SHIFT)
> +
> +/* Sets only bit 62 as the tracking is done by EPT Violations. See note above */
> +#define VMX_EPT_TRACK_ACCESS			(1ull <<                       \
> +						 VMX_EPT_TRACK_TYPE_SHIFT)
> +/* Sets bits 62 and 63. See note above */
> +#define VMX_EPT_TRACK_MMIO			(3ull <<                       \
> +						 VMX_EPT_TRACK_TYPE_SHIFT)
> +

I think this is overengineered.  Let's just say bit 62 in the SPTE 
denotes "special SPTE" violations, of both the access tracking and MMIO 
kinds; no need to use bit 63 as well.  Let's call it SPTE_SPECIAL_MASK 
and, first of all, make KVM use it for the fast MMIO case.

Second, is_shadow_present_pte can just be

	return (pte != 0) && !is_mmio_spte(pte);

Third we can add this on top, but let's keep the cases simple.  You can 
fix:

- shadow_acc_track_saved_bits_mask to 7 (for XWR in the EPT case, but it would work just as well for UWP for "traditional" page tables)

- shadow_acc_track_saved_bits_shift to PT64_SECOND_AVAIL_BITS_SHIFT

- shadow_acc_track_value to SPTE_SPECIAL_MASK aka bit 62

In the end, all that needs customization is shadow_acc_track_mask.  
That one is 0 if fast access tracking is not used, and 
VMX_EPT_RWX_MASK|SPTE_SPECIAL_MASK for ept && !eptad.  So you can add a 
single argument to kvm_mmu_set_mask_ptes.

> @@ -490,6 +532,9 @@ static bool spte_has_volatile_bits(u64 spte)
>  	if (spte_can_locklessly_be_made_writable(spte))
>  		return true;
>  
> +	if (is_access_track_spte(spte))
> +		return true;
> +
>  	if (!shadow_accessed_mask)
>  		return false;

I think it's worth rewriting the whole function:

	if (spte_is_locklessly_modifiable(spte))
		return true;

	if (!is_shadow_present_pte(spte))
		return false;

	if (shadow_accessed_mask)
		return ((spte & shadow_accessed_mask) == 0 ||
			(is_writable_pte(spte) &&
			 (spte & shadow_dirty_mask) == 0))
        else
		return is_access_track_spte(spte);

> @@ -533,17 +578,21 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
>   * will find a read-only spte, even though the writable spte
>   * might be cached on a CPU's TLB, the return value indicates this
>   * case.
> + *
> + * Returns true if the TLB needs to be flushed
>   */
>  static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  {
>  	u64 old_spte = *sptep;
> -	bool ret = false;
> +	bool flush = false;
> +	bool writable_cleared;
> +	bool acc_track_enabled;
>  
>  	WARN_ON(!is_shadow_present_pte(new_spte));
>  
>  	if (!is_shadow_present_pte(old_spte)) {
>  		mmu_spte_set(sptep, new_spte);
> -		return ret;
> +		return flush;
>  	}
>  
>  	if (!spte_has_volatile_bits(old_spte))
> @@ -551,24 +600,16 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  	else
>  		old_spte = __update_clear_spte_slow(sptep, new_spte);
>  
> +	BUG_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));

WARN_ON, please.

>  	/*
>  	 * For the spte updated out of mmu-lock is safe, since
>  	 * we always atomically update it, see the comments in
>  	 * spte_has_volatile_bits().
>  	 */
>  	if (spte_can_locklessly_be_made_writable(old_spte) &&
> -	      !is_writable_pte(new_spte))
> -		ret = true;
> -
> -	if (!shadow_accessed_mask) {
> -		/*
> -		 * We don't set page dirty when dropping non-writable spte.
> -		 * So do it now if the new spte is becoming non-writable.
> -		 */
> -		if (ret)
> -			kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> -		return ret;
> -	}
> +	    !is_writable_pte(new_spte))
> +		flush = true;
>  
>  	/*
>  	 * Flush TLB when accessed/dirty bits are changed in the page tables,
> @@ -576,20 +617,34 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  	 */
>  	if (spte_is_bit_changed(old_spte, new_spte,
>                                  shadow_accessed_mask | shadow_dirty_mask))
> -		ret = true;
> +		flush = true;
>  
> -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
> +	writable_cleared = is_writable_pte(old_spte) &&
> +			   !is_writable_pte(new_spte);
> +	acc_track_enabled = !is_access_track_spte(old_spte) &&
> +			    is_access_track_spte(new_spte);
> +
> +	if (writable_cleared || acc_track_enabled)
> +		flush = true;
> +
> +	if (shadow_accessed_mask ?
> +	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
> +	    acc_track_enabled)

Please introduce a new function spte_is_access_tracking_enabled(u64 
old_spte, u64 new_spte) and use it here:

	if (shadow_accessed_mask ?
	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
	    spte_is_access_tracking_enabled(old_spte, new_spte)) {
		flush |= !shadow_accessed_mask;
		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
	}


>  		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
> +
> +	if (shadow_dirty_mask ?
> +	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
> +	    writable_cleared)

writable_cleared can be inline here and written as

	spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)

so

	if (shadow_dirty_mask ?
	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
	    spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)) {
		flush |= !shadow_dirty_mask;
		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
	}
>  		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
>  
> -	return ret;
> +	return flush;
>  }

But please anticipate the changes to this function, except for 
introducing is_access_track_spte of course, to a separate function.

>  /*
>   * Rules for using mmu_spte_clear_track_bits:
>   * It sets the sptep from present to nonpresent, and track the
>   * state bits, it is used to clear the last level sptep.
> + * Returns non-zero if the PTE was previously valid.
>   */
>  static int mmu_spte_clear_track_bits(u64 *sptep)
>  {
> @@ -604,6 +659,13 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
>  	if (!is_shadow_present_pte(old_spte))
>  		return 0;
>  
> +	/*
> +	 * For access tracking SPTEs, the pfn was already marked accessed/dirty
> +	 * when the SPTE was marked for access tracking, so nothing to do here.
> +	 */
> +	if (is_access_track_spte(old_spte))
> +		return 1;

This should go after the "WARN_ON", since that's a valuable check.  In 
addition, I think it's a good idea to keep similar idioms between 
mmu_spte_update and mmu_spte_clear_track_bits, like this:

	if (shadow_accessed_mask
	    ? old_spte & shadow_accessed_mask
	    : !is_access_track_spte(old_spte))
		kvm_set_pfn_accessed(pfn);
        if (shadow_dirty_mask
	    ? old_spte & shadow_dirty_mask
	    : old_spte & PT_WRITABLE_MASK)
                kvm_set_pfn_dirty(pfn);

	return 1;

or (you pick)

	if (shadow_accessed_mask
	    ? !(old_spte & shadow_accessed_mask)
	    : is_access_track_spte(old_spte))
		return 1;

	kvm_set_pfn_accessed(pfn);
        if (shadow_dirty_mask
	    ? old_spte & shadow_dirty_mask
	    : old_spte & PT_WRITABLE_MASK)
                kvm_set_pfn_dirty(pfn);

	return 1;

>  	pfn = spte_to_pfn(old_spte);
>  
>  	/*
> @@ -618,6 +680,7 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
>  	if (old_spte & (shadow_dirty_mask ? shadow_dirty_mask :
>  					    PT_WRITABLE_MASK))
>  		kvm_set_pfn_dirty(pfn);
> +
>  	return 1;
>  }
>  
> @@ -636,6 +699,52 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
>  	return __get_spte_lockless(sptep);
>  }
>  
> +static u64 mark_spte_for_access_track(u64 spte)
> +{
> +	if (shadow_acc_track_mask == 0)
> +		return spte;

Should this return spte & ~shadow_accessed_mask if shadow_accessed_mask 
is nonzero?  See for example:

 			new_spte &= ~shadow_accessed_mask;
 
+			new_spte = mark_spte_for_access_track(new_spte);

> +	/*
> +	 * Verify that the write-protection that we do below will be fixable
> +	 * via the fast page fault path. Currently, that is always the case, at
> +	 * least when using EPT (which is when access tracking would be used).
> +	 */
> +	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
> +		  !spte_can_locklessly_be_made_writable(spte),
> +		  "Writable SPTE is not locklessly dirty-trackable\n");
> +
> +	/*
> +	 * Any PTE marked for access tracking should also be marked for dirty
> +	 * tracking (by being non-writable)
> +	 */
> +	spte &= ~PT_WRITABLE_MASK;

This should be implicit in the definition of 
shadow_acc_track_mask/value, so it's not necessary (or it can be a 
WARN_ONCE).

> +	spte &= ~(shadow_acc_track_saved_bits_mask <<
> +		  shadow_acc_track_saved_bits_shift);

Should there be a WARN_ON if these bits are not zero?

> +	spte |= (spte & shadow_acc_track_saved_bits_mask) <<
> +		shadow_acc_track_saved_bits_shift;
> +	spte &= ~shadow_acc_track_mask;
> +	spte |= shadow_acc_track_value;
> +
> +	return spte;
> +}
> +
> +/* Returns true if the TLB needs to be flushed */
> +static bool mmu_spte_enable_access_track(u64 *sptep)
> +{
> +	u64 spte = mmu_spte_get_lockless(sptep);
> +
> +	if (is_access_track_spte(spte))
> +		return false;
> +
> +	/* Access tracking should not be enabled if CPU supports A/D bits */
> +	BUG_ON(shadow_accessed_mask != 0);

WARN_ONCE please.  However, I think mmu_spte_enable_access_track should 
be renamed to mmu_spte_age and handle the shadow_accessed_mask case as

	clear_bit((ffs(shadow_accessed_mask) - 1),
		  (unsigned long *)sptep);

similar to kvm_age_rmapp.  See below about kvm_age_rmapp.


> +	spte = mark_spte_for_access_track(spte);
> +
> +	return mmu_spte_update(sptep, spte);
> +}
> +
>  static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
>  {
>  	/*
> @@ -1403,6 +1512,25 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  	return kvm_zap_rmapp(kvm, rmap_head);
>  }
>  
> +static int kvm_acc_track_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> +			       struct kvm_memory_slot *slot, gfn_t gfn,
> +			       int level, unsigned long data)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	int need_tlb_flush = 0;
> +
> +	for_each_rmap_spte(rmap_head, &iter, sptep) {
> +

Unnecessary blank line---but see below about kvm_acc_track_rmapp.

> +		rmap_printk("kvm_acc_track_rmapp: spte %p %llx gfn %llx (%d)\n",
> +			    sptep, *sptep, gfn, level);
> +
> +		need_tlb_flush |= mmu_spte_enable_access_track(sptep);
> +	}
> +
> +	return need_tlb_flush;
> +}
> +
>  static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
>  			     unsigned long data)
> @@ -1419,8 +1547,9 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  
>  restart:
>  	for_each_rmap_spte(rmap_head, &iter, sptep) {
> +

Unnecessary blank line.

>  		rmap_printk("kvm_set_pte_rmapp: spte %p %llx gfn %llx (%d)\n",
> -			     sptep, *sptep, gfn, level);
> +			    sptep, *sptep, gfn, level);
>  
>  		need_flush = 1;
>  
> @@ -1435,6 +1564,8 @@ restart:
>  			new_spte &= ~SPTE_HOST_WRITEABLE;
>  			new_spte &= ~shadow_accessed_mask;
>  
> +			new_spte = mark_spte_for_access_track(new_spte);
> +
>  			mmu_spte_clear_track_bits(sptep);
>  			mmu_spte_set(sptep, new_spte);
>  		}
> @@ -1615,24 +1746,14 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  {
>  	u64 *sptep;
>  	struct rmap_iterator iter;
> -	int young = 0;
> -
> -	/*
> -	 * If there's no access bit in the secondary pte set by the
> -	 * hardware it's up to gup-fast/gup to set the access bit in
> -	 * the primary pte or in the page structure.
> -	 */
> -	if (!shadow_accessed_mask)
> -		goto out;
>  
>  	for_each_rmap_spte(rmap_head, &iter, sptep) {
> -		if (*sptep & shadow_accessed_mask) {
> -			young = 1;
> -			break;
> -		}
> +		if ((*sptep & shadow_accessed_mask) ||
> +		    (!shadow_accessed_mask && !is_access_track_spte(*sptep)))

This can also be written, like before, as

	if (shadow_accessed_mask
	    ? *sptep & shadow_accessed_mask
	    : !is_access_track_spte(*sptep))

Introducing a new helper is_accessed_spte starts to look like a good idea!

> +			return 1;
>  	}
> -out:
> -	return young;
> +
> +	return 0;
>  }
>  
>  #define RMAP_RECYCLE_THRESHOLD 1000
> @@ -1669,7 +1790,9 @@ int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
>  		 */
>  		kvm->mmu_notifier_seq++;
>  		return kvm_handle_hva_range(kvm, start, end, 0,
> -					    kvm_unmap_rmapp);
> +					    shadow_acc_track_mask != 0
> +					    ? kvm_acc_track_rmapp
> +					    : kvm_unmap_rmapp);

Please rewrite kvm_age_rmapp to use the new mmu_spte_age instead, and test

	if (shadow_accessed_mask || shadow_acc_track_mask)

in kvm_age_hva.


> @@ -2877,16 +3003,27 @@ static bool page_fault_can_be_fast(u32 error_code)
>  	if (unlikely(error_code & PFERR_RSVD_MASK))
>  		return false;
>  
> -	/*
> -	 * #PF can be fast only if the shadow page table is present and it
> -	 * is caused by write-protect, that means we just need change the
> -	 * W bit of the spte which can be done out of mmu-lock.
> -	 */
> -	if (!(error_code & PFERR_PRESENT_MASK) ||
> -	      !(error_code & PFERR_WRITE_MASK))
> +	/* See if the page fault is due to an NX violation */
> +	if (unlikely(((error_code & (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))
> +		      == (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))))
>  		return false;
>  
> -	return true;
> +	/*
> +	 * #PF can be fast if:
> +	 * 1. The shadow page table entry is not present, which could mean that
> +	 *    the fault is potentially caused by access tracking (if enabled).
> +	 * 2. The shadow page table entry is present and the fault
> +	 *    is caused by write-protect, that means we just need change the W
> +	 *    bit of the spte which can be done out of mmu-lock.
> +	 *
> +	 * However, if Access Tracking is disabled, then the first condition
> +	 * above cannot be handled by the fast path. So if access tracking is
> +	 * disabled, we return true only if the second condition is met.

Better:

However, if access tracking is disabled we know that a non-present page 
must be a genuine page fault where we have to create a new SPTE.  So, 
if access tracking is disabled, we return true only for write accesses 
to a present page.

> +	 */
> +
> +	return shadow_acc_track_mask != 0 ||
> +	       ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
> +		== (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));
>  }
>  
>  /*
> @@ -2895,17 +3032,24 @@ static bool page_fault_can_be_fast(u32 error_code)
>   */
>  static bool
>  fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> -			u64 *sptep, u64 spte)
> +			u64 *sptep, u64 old_spte,
> +			bool remove_write_prot, bool remove_acc_track)
>  {
>  	gfn_t gfn;
> +	u64 new_spte = old_spte;
>  
>  	WARN_ON(!sp->role.direct);
>  
> -	/*
> -	 * The gfn of direct spte is stable since it is calculated
> -	 * by sp->gfn.
> -	 */
> -	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +	if (remove_acc_track) {
> +		u64 saved_bits = old_spte & (shadow_acc_track_saved_bits_mask <<
> +					     shadow_acc_track_saved_bits_shift);

You can shift right old_spte here...

> +		new_spte &= ~shadow_acc_track_mask;
> +		new_spte |= saved_bits >> shadow_acc_track_saved_bits_shift;

... instead of shifting saved_bits left here.

> +	}
> +
> +	if (remove_write_prot)
> +		new_spte |= PT_WRITABLE_MASK;
>  
>  	/*
>  	 * Theoretically we could also set dirty bit (and flush TLB) here in
> @@ -2919,10 +3063,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	 *
>  	 * Compare with set_spte where instead shadow_dirty_mask is set.
>  	 */
> -	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
> +	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
>  		return false;
>  
> -	kvm_vcpu_mark_page_dirty(vcpu, gfn);
> +	if (remove_write_prot) {

Stupid question ahead, why not call kvm_vcpu_mark_page_accessed in this 
function?

> +		/*
> +		 * The gfn of direct spte is stable since it is
> +		 * calculated by sp->gfn.
> +		 */
> +		gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +		kvm_vcpu_mark_page_dirty(vcpu, gfn);
> +	}
>  
>  	return true;
>  }
> @@ -2937,7 +3088,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  {
>  	struct kvm_shadow_walk_iterator iterator;
>  	struct kvm_mmu_page *sp;
> -	bool ret = false;
> +	bool fault_handled = false;
>  	u64 spte = 0ull;
>  	uint retry_count = 0;
>  
> @@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  			break;
>  
>  	do {
> -		/*
> -		 * If the mapping has been changed, let the vcpu fault on the
> -		 * same address again.
> -		 */
> -		if (!is_shadow_present_pte(spte)) {
> -			ret = true;
> -			break;
> -		}
> +		bool remove_write_prot = (error_code & PFERR_WRITE_MASK) &&
> +					 !(spte & PT_WRITABLE_MASK);
> +		bool remove_acc_track;
> +		bool valid_exec_access = (error_code & PFERR_FETCH_MASK) &&
> +					 (spte & shadow_x_mask);
>  
>  		sp = page_header(__pa(iterator.sptep));
>  		if (!is_last_spte(spte, sp->role.level))
>  			break;
>  
>  		/*
> -		 * Check if it is a spurious fault caused by TLB lazily flushed.
> +		 * Check whether the memory access that caused the fault would
> +		 * still cause it if it were to be performed right now. If not,
> +		 * then this is a spurious fault caused by TLB lazily flushed,
> +		 * or some other CPU has already fixed the PTE after the
> +		 * current CPU took the fault.
> +		 *
> +		 * If Write-Only mappings ever become supported, then the
> +		 * condition below would need to be changed appropriately.
>  		 *
>  		 * Need not check the access of upper level table entries since
>  		 * they are always ACC_ALL.
>  		 */
> -		if (is_writable_pte(spte)) {
> -			ret = true;
> +		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
> +		    valid_exec_access) {
> +			fault_handled = true;
>  			break;
>  		}

Let's separate the three conditions (R/W/X):

		if ((error_code & PFERR_FETCH_MASK) {
			if ((spte & (shadow_x_mask|shadow_nx_mask))
			    == shadow_x_mask) {
				fault_handled = true;
				break;
			}
		}
		if (error_code & PFERR_WRITE_MASK) {
			if (is_writable_pte(spte)) {
				fault_handled = true;
				break;
			}
			remove_write_prot =
				spte_can_locklessly_be_made_writable(spte);
		}
		if (!(error_code & PFERR_PRESENT_MASK)) {
			if (!is_access_track_spte(spte)) {
				fault_handled = true;
				break;
			}
			remove_acc_track = true;
		}

> +		remove_acc_track = is_access_track_spte(spte);
> +
>  		/*
> -		 * Currently, to simplify the code, only the spte
> -		 * write-protected by dirty-log can be fast fixed.
> +		 * Currently, to simplify the code, write-protection can be
> +		 * removed in the fast path only if the SPTE was write-protected
> +		 * for dirty-logging.
>  		 */
> -		if (!spte_can_locklessly_be_made_writable(spte))
> -			break;
> +		remove_write_prot &= spte_can_locklessly_be_made_writable(spte);
>  
>  		/*
>  		 * Do not fix write-permission on the large spte since we only
> @@ -2998,13 +3156,20 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
>  			break;
>  
> +		/* Verify that the fault can be handled in the fast path */
> +		if (!remove_acc_track && !remove_write_prot)
> +			break;
> +
>  		/*
>  		 * Currently, fast page fault only works for direct mapping
>  		 * since the gfn is not stable for indirect shadow page. See
>  		 * Documentation/virtual/kvm/locking.txt to get more detail.
>  		 */
> -		ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
> -		if (ret)
> +		fault_handled = fast_pf_fix_direct_spte(vcpu, sp,
> +							iterator.sptep, spte,
> +							remove_write_prot,
> +							remove_acc_track);
> +		if (fault_handled)
>  			break;
>  
>  		if (++retry_count > 4) {
> @@ -3018,10 +3183,10 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  	} while (true);
>  
>  	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
> -			      spte, ret);
> +			      spte, fault_handled);
>  	walk_shadow_page_lockless_end(vcpu);
>  
> -	return ret;
> +	return fault_handled;
>  }
>  
>  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> @@ -4300,6 +4465,7 @@ static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
>  	vcpu->arch.mmu.update_pte(vcpu, sp, spte, new);
>  }
>  
> +/* This is only supposed to be used for non-EPT mappings */

It's only used for non-EPT mappings, why is it only *supposed* to be 
used for non-EPT mappings?  It seems to me that it would work.

>  static bool need_remote_flush(u64 old, u64 new)
>  {
>  	if (!is_shadow_present_pte(old))
> @@ -5067,6 +5233,8 @@ static void mmu_destroy_caches(void)
>  
>  int kvm_mmu_module_init(void)
>  {
> +	kvm_mmu_clear_all_pte_masks();
> +
>  	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
>  					    sizeof(struct pte_list_desc),
>  					    0, 0, NULL);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index ddc56e9..dfd3056 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -52,6 +52,8 @@ static inline u64 rsvd_bits(int s, int e)
>  }
>  
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
> +void kvm_mmu_set_access_track_masks(u64 acc_track_mask, u64 acc_track_value,
> +				    u64 saved_bits_mask, u64 saved_bits_shift);
>  
>  void
>  reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 88e3b02..363517e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5019,7 +5019,22 @@ static void ept_set_mmio_spte_mask(void)
>  	 * Also, magic bits (0x3ull << 62) is set to quickly identify mmio
>  	 * spte.
>  	 */
> -	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
> +	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE |
> +				   VMX_EPT_TRACK_MMIO);
> +}
> +
> +static void ept_set_acc_track_spte_mask(void)
> +{
> +	/*
> +	 * For access track PTEs we use a non-present PTE to trigger an EPT
> +	 * Violation. The original RWX value is saved in some unused bits in
> +	 * the PTE and restored when the violation is fixed.
> +	 */
> +	kvm_mmu_set_access_track_masks(VMX_EPT_RWX_MASK |
> +				       VMX_EPT_TRACK_TYPE_MASK,
> +				       VMX_EPT_TRACK_ACCESS,
> +				       VMX_EPT_RWX_MASK,
> +				       VMX_EPT_RWX_SAVE_SHIFT);
>  }
>  
>  #define VMX_XSS_EXIT_BITMAP 0
> @@ -6551,6 +6566,9 @@ static __init int hardware_setup(void)
>  				      0ull : VMX_EPT_READABLE_MASK);
>  		ept_set_mmio_spte_mask();
>  		kvm_enable_tdp();
> +
> +		if (!enable_ept_ad_bits)
> +			ept_set_acc_track_spte_mask();

Let's put the whole "then" block in a single function vmx_enable_tdp.

Thanks,

Paolo

>  	} else
>  		kvm_disable_tdp();
>  
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-21 14:42     ` Paolo Bonzini
@ 2016-11-24  3:50       ` Junaid Shahid
  2016-11-25  9:45         ` Paolo Bonzini
  2016-12-01 22:54       ` Junaid Shahid
  1 sibling, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-11-24  3:50 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner, guangrong.xiao

Hi Paolo,

Thank you for the detailed feedback. I will send an updated version of the patch soon. A few comments below:

On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
> > @@ -576,20 +617,34 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
> >  	 */
> >  	if (spte_is_bit_changed(old_spte, new_spte,
> >                                  shadow_accessed_mask | shadow_dirty_mask))
> > -		ret = true;
> > +		flush = true;
> >  
> > -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
> > +	writable_cleared = is_writable_pte(old_spte) &&
> > +			   !is_writable_pte(new_spte);
> > +	acc_track_enabled = !is_access_track_spte(old_spte) &&
> > +			    is_access_track_spte(new_spte);
> > +
> > +	if (writable_cleared || acc_track_enabled)
> > +		flush = true;
> > +
> > +	if (shadow_accessed_mask ?
> > +	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
> > +	    acc_track_enabled)
> 
> Please introduce a new function spte_is_access_tracking_enabled(u64 
> old_spte, u64 new_spte) and use it here:
> 
> 	if (shadow_accessed_mask ?
> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
> 	    spte_is_access_tracking_enabled(old_spte, new_spte)) {
> 		flush |= !shadow_accessed_mask;
> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> 	}
> 

I think we can just set flush = true in the then block instead of 
flush |= !shadow_accessed_mask.

And while we are at it, is there any reason to flush the TLB when setting the
A or D bit in the PTE? If not, we can remove this earlier block since the
clearing case is now handled in the separate if blocks for accessed and dirty:

	if (spte_is_bit_changed(old_spte, new_spte,
                           shadow_accessed_mask | shadow_dirty_mask))
		flush = true;

Also, instead of spte_is_access_tracking_enabled(), I’ve added is_accessed_spte
as you suggested later and used that here as well.

> > -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
> > +
> > +	if (shadow_dirty_mask ?
> > +	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
> > +	    writable_cleared)
> 
> writable_cleared can be inline here and written as
> 
> 	spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)
> 
> so
> 
> 	if (shadow_dirty_mask ?
> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
> 	    spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)) {
> 		flush |= !shadow_dirty_mask;
> 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> 	}
> >  		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> >  
> > -	return ret;
> > +	return flush;
> >  }
> 
> But please anticipate the changes to this function, except for 
> introducing is_access_track_spte of course, to a separate function.

Sorry, I didn’t exactly understand what you meant by the last line. But I have
made it like this:

	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
		flush = true;
		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
	}

	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
		flush = true;
		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
	}

> >  static int mmu_spte_clear_track_bits(u64 *sptep)
> >  {
> > @@ -604,6 +659,13 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
> >  	if (!is_shadow_present_pte(old_spte))
> >  		return 0;
> >  
> > +	/*
> > +	 * For access tracking SPTEs, the pfn was already marked accessed/dirty
> > +	 * when the SPTE was marked for access tracking, so nothing to do here.
> > +	 */
> > +	if (is_access_track_spte(old_spte))
> > +		return 1;
> 
> This should go after the "WARN_ON", since that's a valuable check.  In 
> addition, I think it's a good idea to keep similar idioms between 
> mmu_spte_update and mmu_spte_clear_track_bits, like this:
> 
> 	if (shadow_accessed_mask
> 	    ? old_spte & shadow_accessed_mask
> 	    : !is_access_track_spte(old_spte))
> 		kvm_set_pfn_accessed(pfn);
>         if (shadow_dirty_mask
> 	    ? old_spte & shadow_dirty_mask
> 	    : old_spte & PT_WRITABLE_MASK)
>                 kvm_set_pfn_dirty(pfn);
> 
> 	return 1;
> 
> or (you pick)
> 
> 	if (shadow_accessed_mask
> 	    ? !(old_spte & shadow_accessed_mask)
> 	    : is_access_track_spte(old_spte))
> 		return 1;
> 
> 	kvm_set_pfn_accessed(pfn);
>         if (shadow_dirty_mask
> 	    ? old_spte & shadow_dirty_mask
> 	    : old_spte & PT_WRITABLE_MASK)
>                 kvm_set_pfn_dirty(pfn);
> 
> 	return 1;

I’ve replaced the checks with if(is_accessed_spte...) / if (is_dirty_spte...)

> > +	/*
> > +	 * Any PTE marked for access tracking should also be marked for dirty
> > +	 * tracking (by being non-writable)
> > +	 */
> > +	spte &= ~PT_WRITABLE_MASK;
> 
> This should be implicit in the definition of 
> shadow_acc_track_mask/value, so it's not necessary (or it can be a 
> WARN_ONCE).

It can't be handled by shadow_acc_track_mask/value, but I have changed
shadow_acc_track_saved_bits_mask to save only the R/X bits, which achieves the
same result.

> 
> > +	spte &= ~(shadow_acc_track_saved_bits_mask <<
> > +		  shadow_acc_track_saved_bits_shift);
> 
> Should there be a WARN_ON if these bits are not zero?

No, these bits can be non-zero from a previous instance of 
mark_spte_for_access_track. They are not cleared when the PTE is restored to
normal state. (Though we could do that and then have a WARN_ONCE here.)

> > @@ -2919,10 +3063,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >  	 *
> >  	 * Compare with set_spte where instead shadow_dirty_mask is set.
> >  	 */
> > -	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
> > +	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
> >  		return false;
> >  
> > -	kvm_vcpu_mark_page_dirty(vcpu, gfn);
> > +	if (remove_write_prot) {
> 
> Stupid question ahead, why not call kvm_vcpu_mark_page_accessed in this 
> function?

We could, but I kept that call in the same paths as before for consistency with
the other cases. Plus, even though it is a cheap call, why add it to the fast 
PF path if it is not necessary.

> > @@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> >  
> >  		/*
> > -		 * Check if it is a spurious fault caused by TLB lazily flushed.
> > +		 * Check whether the memory access that caused the fault would
> > +		 * still cause it if it were to be performed right now. If not,
> > +		 * then this is a spurious fault caused by TLB lazily flushed,
> > +		 * or some other CPU has already fixed the PTE after the
> > +		 * current CPU took the fault.
> > +		 *
> > +		 * If Write-Only mappings ever become supported, then the
> > +		 * condition below would need to be changed appropriately.
> >  		 *
> >  		 * Need not check the access of upper level table entries since
> >  		 * they are always ACC_ALL.
> >  		 */
> > -		if (is_writable_pte(spte)) {
> > -			ret = true;
> > +		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
> > +		    valid_exec_access) {
> > +			fault_handled = true;
> >  			break;
> >  		}
> 
> Let's separate the three conditions (R/W/X):
> 
> 		if ((error_code & PFERR_FETCH_MASK) {
> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
> 			    == shadow_x_mask) {
> 				fault_handled = true;
> 				break;
> 			}
> 		}
> 		if (error_code & PFERR_WRITE_MASK) {
> 			if (is_writable_pte(spte)) {
> 				fault_handled = true;
> 				break;
> 			}
> 			remove_write_prot =
> 				spte_can_locklessly_be_made_writable(spte);
> 		}
> 		if (!(error_code & PFERR_PRESENT_MASK)) {
> 			if (!is_access_track_spte(spte)) {
> 				fault_handled = true;
> 				break;
> 			}
> 			remove_acc_track = true;
> 		}

I think the third block is incorrect e.g. it will set fault_handled = true even
for a completely zero PTE. I have replaced it with the following:

		if ((error_code & PFERR_USER_MASK) &&
		    (spte & PT_PRESENT_MASK)) {
			fault_handled = true;
			break;
		}

		remove_acc_track = is_access_track_spte(spte);

> >  
> > +/* This is only supposed to be used for non-EPT mappings */
> 
> It's only used for non-EPT mappings, why is it only *supposed* to be 
> used for non-EPT mappings?  It seems to me that it would work.
> 
> >  static bool need_remote_flush(u64 old, u64 new)
> >  {

It would work but it will return true in at least one (probably only) case 
where it doesn’t need to e.g. going from an acc-track PTE to a zeroed PTE.
Though maybe that is not a big deal. Of course, we could also just update it 
to handle acc-track. 



Thanks,
Junaid


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-24  3:50       ` Junaid Shahid
@ 2016-11-25  9:45         ` Paolo Bonzini
  2016-11-29  2:43           ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-25  9:45 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: kvm, andreslc, pfeiner, guangrong.xiao



On 24/11/2016 04:50, Junaid Shahid wrote:
> On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
>> Please introduce a new function spte_is_access_tracking_enabled(u64 
>> old_spte, u64 new_spte) and use it here:
>>
>> 	if (shadow_accessed_mask ?
>> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
>> 	    spte_is_access_tracking_enabled(old_spte, new_spte)) {
>> 		flush |= !shadow_accessed_mask;
>> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
>> 	}
>>
> 
> I think we can just set flush = true in the then block instead of 
> flush |= !shadow_accessed_mask.
> 
> And while we are at it, is there any reason to flush the TLB when setting the
> A or D bit in the PTE? If not, we can remove this earlier block since the
> clearing case is now handled in the separate if blocks for accessed and dirty:

Hmm, flushing the TLB is expensive.  I would have thought that we want to avoid
a shootdown (kvm_flush_remote_tlbs) for the A and D bits.  But it's probably rare
enough that it doesn't matter, and the existing code has

        /*
         * Flush TLB when accessed/dirty bits are changed in the page tables,
         * to guarantee consistency between TLB and page tables.
         */
        if (spte_is_bit_changed(old_spte, new_spte,
                                shadow_accessed_mask | shadow_dirty_mask))
                ret = true;

so yeah.

> 	if (spte_is_bit_changed(old_spte, new_spte,
>                            shadow_accessed_mask | shadow_dirty_mask))
> 		flush = true;
> 
> Also, instead of spte_is_access_tracking_enabled(), I’ve added is_accessed_spte
> as you suggested later and used that here as well.

Yes, that makes more sense.  Thanks!

>>> -	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
>>> +
>>> +	if (shadow_dirty_mask ?
>>> +	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
>>> +	    writable_cleared)
>>
>> writable_cleared can be inline here and written as
>>
>> 	spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)
>>
>> so
>>
>> 	if (shadow_dirty_mask ?
>> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask) :
>> 	    spte_is_bit_cleared(old_spte, new_spte, PT_WRITABLE_MASK)) {
>> 		flush |= !shadow_dirty_mask;
>> 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
>> 	}
>>>  		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
>>>  
>>> -	return ret;
>>> +	return flush;
>>>  }
>>
>> But please anticipate the changes to this function, except for 
>> introducing is_access_track_spte of course, to a separate function.
> 
> Sorry, I didn’t exactly understand what you meant by the last line. But I have
> made it like this:
> 
> 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> 		flush = true;
> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> 	}
> 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> 		flush = true;
> 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> 	}

The idea is to split this patch in two.  You can first refactor the function to
have the above code and introduce is_accessed_spte/is_dirty_spte.  Then, you
add the lockless access tracking on top.  But it's okay if we leave it for a
subsequent review.  There are going to be many changes already between v2 and v3!

> I’ve replaced the checks with if(is_accessed_spte...) / if (is_dirty_spte...)

Good!

>>> +	/*
>>> +	 * Any PTE marked for access tracking should also be marked for dirty
>>> +	 * tracking (by being non-writable)
>>> +	 */
>>> +	spte &= ~PT_WRITABLE_MASK;
>>
>> This should be implicit in the definition of 
>> shadow_acc_track_mask/value, so it's not necessary (or it can be a 
>> WARN_ONCE).
> 
> It can't be handled by shadow_acc_track_mask/value, but I have changed
> shadow_acc_track_saved_bits_mask to save only the R/X bits, which achieves the
> same result.
> 
>>
>>> +	spte &= ~(shadow_acc_track_saved_bits_mask <<
>>> +		  shadow_acc_track_saved_bits_shift);
>>
>> Should there be a WARN_ON if these bits are not zero?
> 
> No, these bits can be non-zero from a previous instance of 
> mark_spte_for_access_track. They are not cleared when the PTE is restored to
> normal state. (Though we could do that and then have a WARN_ONCE here.)

Ok, if it's not too hard, do add it.  I think it's worth having
more self-checks.

>>> @@ -2919,10 +3063,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>>  	 *
>>>  	 * Compare with set_spte where instead shadow_dirty_mask is set.
>>>  	 */
>>> -	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
>>> +	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
>>>  		return false;
>>>  
>>> -	kvm_vcpu_mark_page_dirty(vcpu, gfn);
>>> +	if (remove_write_prot) {
>>
>> Stupid question ahead, why not call kvm_vcpu_mark_page_accessed in this 
>> function?
> 
> We could, but I kept that call in the same paths as before for consistency with
> the other cases. Plus, even though it is a cheap call, why add it to the fast 
> PF path if it is not necessary.

Makes sense, I said it was a stupid question. :)

>>> @@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>>>  
>>>  		/*
>>> -		 * Check if it is a spurious fault caused by TLB lazily flushed.
>>> +		 * Check whether the memory access that caused the fault would
>>> +		 * still cause it if it were to be performed right now. If not,
>>> +		 * then this is a spurious fault caused by TLB lazily flushed,
>>> +		 * or some other CPU has already fixed the PTE after the
>>> +		 * current CPU took the fault.
>>> +		 *
>>> +		 * If Write-Only mappings ever become supported, then the
>>> +		 * condition below would need to be changed appropriately.
>>>  		 *
>>>  		 * Need not check the access of upper level table entries since
>>>  		 * they are always ACC_ALL.
>>>  		 */
>>> -		if (is_writable_pte(spte)) {
>>> -			ret = true;
>>> +		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
>>> +		    valid_exec_access) {
>>> +			fault_handled = true;
>>>  			break;
>>>  		}
>>
>> Let's separate the three conditions (R/W/X):
>>
>> 		if ((error_code & PFERR_FETCH_MASK) {
>> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
>> 			    == shadow_x_mask) {
>> 				fault_handled = true;
>> 				break;
>> 			}
>> 		}
>> 		if (error_code & PFERR_WRITE_MASK) {
>> 			if (is_writable_pte(spte)) {
>> 				fault_handled = true;
>> 				break;
>> 			}
>> 			remove_write_prot =
>> 				spte_can_locklessly_be_made_writable(spte);
>> 		}
>> 		if (!(error_code & PFERR_PRESENT_MASK)) {
>> 			if (!is_access_track_spte(spte)) {
>> 				fault_handled = true;
>> 				break;
>> 			}
>> 			remove_acc_track = true;
>> 		}
> 
> I think the third block is incorrect e.g. it will set fault_handled = true even
> for a completely zero PTE.

A completely zero PTE would have been filtered before by the
is_shadow_present_pte check, wouldn't it?

>>>  
>>> +/* This is only supposed to be used for non-EPT mappings */
>>
>> It's only used for non-EPT mappings, why is it only *supposed* to be 
>> used for non-EPT mappings?  It seems to me that it would work.
>>
>>>  static bool need_remote_flush(u64 old, u64 new)
>>>  {
> 
> It would work but it will return true in at least one (probably only) case 
> where it doesn’t need to e.g. going from an acc-track PTE to a zeroed PTE.
> Though maybe that is not a big deal. Of course, we could also just update it 
> to handle acc-track. 

Yeah, I think it's not a big deal.  But actually is it really used only
for non-EPT?  Nested virtualization uses kvm_mmu_pte_write for EPT as well.

Paolo


> 
> 
> Thanks,
> Junaid
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-25  9:45         ` Paolo Bonzini
@ 2016-11-29  2:43           ` Junaid Shahid
  2016-11-29  8:09             ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-11-29  2:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner, guangrong.xiao

On Friday, November 25, 2016 10:45:28 AM Paolo Bonzini wrote:
> 
> On 24/11/2016 04:50, Junaid Shahid wrote:
> > On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
> >> Please introduce a new function spte_is_access_tracking_enabled(u64 
> >> old_spte, u64 new_spte) and use it here:
> >>
> >> 	if (shadow_accessed_mask ?
> >> 	    spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask) :
> >> 	    spte_is_access_tracking_enabled(old_spte, new_spte)) {
> >> 		flush |= !shadow_accessed_mask;
> >> 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> >> 	}
> >>
> > 
> > I think we can just set flush = true in the then block instead of 
> > flush |= !shadow_accessed_mask.
> > 
> > And while we are at it, is there any reason to flush the TLB when setting the
> > A or D bit in the PTE? If not, we can remove this earlier block since the
> > clearing case is now handled in the separate if blocks for accessed and dirty:
> 
> Hmm, flushing the TLB is expensive.  I would have thought that we want to avoid
> a shootdown (kvm_flush_remote_tlbs) for the A and D bits.  But it's probably rare
> enough that it doesn't matter, and the existing code has
> 
>         /*
>          * Flush TLB when accessed/dirty bits are changed in the page tables,
>          * to guarantee consistency between TLB and page tables.
>          */
>         if (spte_is_bit_changed(old_spte, new_spte,
>                                 shadow_accessed_mask | shadow_dirty_mask))
>                 ret = true;
> 
> so yeah.

Ok. So I’ll remove the existing spte_is_bit_changed block and set flush = true
inside the separate blocks that check accessed and dirty masks.

> >> But please anticipate the changes to this function, except for 
> >> introducing is_access_track_spte of course, to a separate function.
> > 
> > Sorry, I didn’t exactly understand what you meant by the last line. But I have
> > made it like this:
> > 
> > 	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
> > 		flush = true;
> > 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > 	}
> > 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
> > 		flush = true;
> > 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > 	}
> 
> The idea is to split this patch in two.  You can first refactor the function to
> have the above code and introduce is_accessed_spte/is_dirty_spte.  Then, you
> add the lockless access tracking on top.  But it's okay if we leave it for a
> subsequent review.  There are going to be many changes already between v2 and v3!

Thanks for the clarification. I guess I can just add the other patch now to
separate out the refactoring from the rest of the access tracking changes.

> >>> +	spte &= ~(shadow_acc_track_saved_bits_mask <<
> >>> +		  shadow_acc_track_saved_bits_shift);
> >>
> >> Should there be a WARN_ON if these bits are not zero?
> > 
> > No, these bits can be non-zero from a previous instance of 
> > mark_spte_for_access_track. They are not cleared when the PTE is restored to
> > normal state. (Though we could do that and then have a WARN_ONCE here.)
> 
> Ok, if it's not too hard, do add it.  I think it's worth having
> more self-checks.

Sure. Will do.

> 
> >>> @@ -2953,36 +3104,43 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
> >>>  
> >>>  		/*
> >>> -		 * Check if it is a spurious fault caused by TLB lazily flushed.
> >>> +		 * Check whether the memory access that caused the fault would
> >>> +		 * still cause it if it were to be performed right now. If not,
> >>> +		 * then this is a spurious fault caused by TLB lazily flushed,
> >>> +		 * or some other CPU has already fixed the PTE after the
> >>> +		 * current CPU took the fault.
> >>> +		 *
> >>> +		 * If Write-Only mappings ever become supported, then the
> >>> +		 * condition below would need to be changed appropriately.
> >>>  		 *
> >>>  		 * Need not check the access of upper level table entries since
> >>>  		 * they are always ACC_ALL.
> >>>  		 */
> >>> -		if (is_writable_pte(spte)) {
> >>> -			ret = true;
> >>> +		if (((spte & PT_PRESENT_MASK) && !remove_write_prot) ||
> >>> +		    valid_exec_access) {
> >>> +			fault_handled = true;
> >>>  			break;
> >>>  		}
> >>
> >> Let's separate the three conditions (R/W/X):
> >>
> >> 		if ((error_code & PFERR_FETCH_MASK) {
> >> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
> >> 			    == shadow_x_mask) {
> >> 				fault_handled = true;
> >> 				break;
> >> 			}
> >> 		}
> >> 		if (error_code & PFERR_WRITE_MASK) {
> >> 			if (is_writable_pte(spte)) {
> >> 				fault_handled = true;
> >> 				break;
> >> 			}
> >> 			remove_write_prot =
> >> 				spte_can_locklessly_be_made_writable(spte);
> >> 		}
> >> 		if (!(error_code & PFERR_PRESENT_MASK)) {
> >> 			if (!is_access_track_spte(spte)) {
> >> 				fault_handled = true;
> >> 				break;
> >> 			}
> >> 			remove_acc_track = true;
> >> 		}
> > 
> > I think the third block is incorrect e.g. it will set fault_handled = true even
> > for a completely zero PTE.
> 
> A completely zero PTE would have been filtered before by the
> is_shadow_present_pte check, wouldn't it?

Oh, the is_shadow_present_pte check was actually removed in the patch. We could
add it back, minus the ret = true statement, and then it would filter the zero 
PTE case. But I still think that the other form:

                if ((error_code & PFERR_USER_MASK) &&
                    (spte & PT_PRESENT_MASK)) {
                        fault_handled = true;
                        break;
                }

is simpler as it is directly analogous to the cases for fetch and write.
Please let me know if you think otherwise.

> >>> +/* This is only supposed to be used for non-EPT mappings */
> >>
> >> It's only used for non-EPT mappings, why is it only *supposed* to be 
> >> used for non-EPT mappings?  It seems to me that it would work.
> >>
> >>>  static bool need_remote_flush(u64 old, u64 new)
> >>>  {
> > 
> > It would work but it will return true in at least one (probably only) case 
> > where it doesn’t need to e.g. going from an acc-track PTE to a zeroed PTE.
> > Though maybe that is not a big deal. Of course, we could also just update it 
> > to handle acc-track. 
> 
> Yeah, I think it's not a big deal.  But actually is it really used only
> for non-EPT?  Nested virtualization uses kvm_mmu_pte_write for EPT as well.
 
Ok. I guess it might be more accurate to say indirect mappings instead of
non-EPT mappings. In any case, I’ll just remove the comment.

Thanks,
Junaid


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-29  2:43           ` Junaid Shahid
@ 2016-11-29  8:09             ` Paolo Bonzini
  2016-11-30  0:59               ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-29  8:09 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: kvm, andreslc, pfeiner, guangrong xiao


> > >> Let's separate the three conditions (R/W/X):
> > >>
> > >> 		if ((error_code & PFERR_FETCH_MASK) {
> > >> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
> > >> 			    == shadow_x_mask) {
> > >> 				fault_handled = true;
> > >> 				break;
> > >> 			}
> > >> 		}
> > >> 		if (error_code & PFERR_WRITE_MASK) {
> > >> 			if (is_writable_pte(spte)) {
> > >> 				fault_handled = true;
> > >> 				break;
> > >> 			}
> > >> 			remove_write_prot =
> > >> 				spte_can_locklessly_be_made_writable(spte);
> > >> 		}
> > >> 		if (!(error_code & PFERR_PRESENT_MASK)) {
> > >> 			if (!is_access_track_spte(spte)) {
> > >> 				fault_handled = true;
> > >> 				break;
> > >> 			}
> > >> 			remove_acc_track = true;
> > >> 		}
> > > 
> > > I think the third block is incorrect e.g. it will set fault_handled =
> > > true even
> > > for a completely zero PTE.
> > 
> > A completely zero PTE would have been filtered before by the
> > is_shadow_present_pte check, wouldn't it?
> 
> Oh, the is_shadow_present_pte check was actually removed in the patch. We could
> add it back, minus the ret = true statement, and then it would filter the zero
> PTE case. But I still think that the other form:
> 
>                 if ((error_code & PFERR_USER_MASK) &&
>                     (spte & PT_PRESENT_MASK)) {
>                         fault_handled = true;
>                         break;
>                 }
> 
> is simpler as it is directly analogous to the cases for fetch and write.
> Please let me know if you think otherwise.

Fair enough, but add a comment to explain the error_code check because I
don't get it. :)

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-29  8:09             ` Paolo Bonzini
@ 2016-11-30  0:59               ` Junaid Shahid
  2016-11-30 11:09                 ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-11-30  0:59 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner, guangrong xiao

On Tuesday, November 29, 2016 03:09:31 AM Paolo Bonzini wrote:
> 
> > > >> Let's separate the three conditions (R/W/X):
> > > >>
> > > >> 		if ((error_code & PFERR_FETCH_MASK) {
> > > >> 			if ((spte & (shadow_x_mask|shadow_nx_mask))
> > > >> 			    == shadow_x_mask) {
> > > >> 				fault_handled = true;
> > > >> 				break;
> > > >> 			}
> > > >> 		}
> > > >> 		if (error_code & PFERR_WRITE_MASK) {
> > > >> 			if (is_writable_pte(spte)) {
> > > >> 				fault_handled = true;
> > > >> 				break;
> > > >> 			}
> > > >> 			remove_write_prot =
> > > >> 				spte_can_locklessly_be_made_writable(spte);
> > > >> 		}
> > > >> 		if (!(error_code & PFERR_PRESENT_MASK)) {
> > > >> 			if (!is_access_track_spte(spte)) {
> > > >> 				fault_handled = true;
> > > >> 				break;
> > > >> 			}
> > > >> 			remove_acc_track = true;
> > > >> 		}
> > > > 
> > > > I think the third block is incorrect e.g. it will set fault_handled =
> > > > true even
> > > > for a completely zero PTE.
> > > 
> > > A completely zero PTE would have been filtered before by the
> > > is_shadow_present_pte check, wouldn't it?
> > 
> > Oh, the is_shadow_present_pte check was actually removed in the patch. We could
> > add it back, minus the ret = true statement, and then it would filter the zero
> > PTE case. But I still think that the other form:
> > 
> >                 if ((error_code & PFERR_USER_MASK) &&
> >                     (spte & PT_PRESENT_MASK)) {
> >                         fault_handled = true;
> >                         break;
> >                 }
> > 
> > is simpler as it is directly analogous to the cases for fetch and write.
> > Please let me know if you think otherwise.
> 
> Fair enough, but add a comment to explain the error_code check because I
> don't get it. :)

The error_code check verifies that it was a Read access, as PFERR_USER_MASK
is mapped to EPT_VIOLATION_READ. However, I have just realized that this isn’t
the case when not using EPT. So I’ll just use the following instead, which 
works for both EPT and non-EPT:

		if (error_code & PFERR_FETCH_MASK) {
			if ((spte & (shadow_x_mask | shadow_nx_mask))
			    == shadow_x_mask) {
				fault_handled = true;
				break;
			}
		} else if (error_code & PFERR_WRITE_MASK) {
			if (is_writable_pte(spte)) {
				fault_handled = true;
				break;
			}
			remove_write_prot =
				spte_can_locklessly_be_made_writable(spte);
		} else {
          /* Fault was on Read access */
			if (spte & PT_PRESENT_MASK) {
				fault_handled = true;
				break;
			}
		}

		remove_acc_track = is_access_track_spte(spte);



Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-30  0:59               ` Junaid Shahid
@ 2016-11-30 11:09                 ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-11-30 11:09 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: kvm, andreslc, pfeiner, guangrong xiao



On 30/11/2016 01:59, Junaid Shahid wrote:
>> > 
>> > Fair enough, but add a comment to explain the error_code check because I
>> > don't get it. :)
> The error_code check verifies that it was a Read access, as PFERR_USER_MASK
> is mapped to EPT_VIOLATION_READ. However, I have just realized that this isn’t
> the case when not using EPT. So I’ll just use the following instead, which 
> works for both EPT and non-EPT:

Looks good, thanks!

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-11-21 14:42     ` Paolo Bonzini
  2016-11-24  3:50       ` Junaid Shahid
@ 2016-12-01 22:54       ` Junaid Shahid
  2016-12-02  8:33         ` Paolo Bonzini
  1 sibling, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-01 22:54 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner, guangrong.xiao

On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
> Please rewrite kvm_age_rmapp to use the new mmu_spte_age instead

Hi Paolo,

While updating kvm_age_rmapp/mmu_spte_age, I noticed an inconsistency in the existing kvm code between the A/D and non-A/D cases. When using A/D bits, kvm_age_hva calls kvm_age_rmapp, which does NOT call kvm_set_pfn_accessed. However, when using EPT without A/D bits, kvm_unmap_rmapp is called, which does internally end up in a call to kvm_set_pfn_accessed. Do you know if this difference is deliberate? If not, should we call kvm_set_pfn_accessed in the A/D case as well, or should we leave that as is? Does it make any difference?

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-01 22:54       ` Junaid Shahid
@ 2016-12-02  8:33         ` Paolo Bonzini
  2016-12-05 22:57           ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2016-12-02  8:33 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: kvm, andreslc, pfeiner, guangrong xiao

> On Monday, November 21, 2016 03:42:23 PM Paolo Bonzini wrote:
> > Please rewrite kvm_age_rmapp to use the new mmu_spte_age instead
> 
> Hi Paolo,
> 
> While updating kvm_age_rmapp/mmu_spte_age, I noticed an inconsistency in the
> existing kvm code between the A/D and non-A/D cases. When using A/D bits,
> kvm_age_hva calls kvm_age_rmapp, which does NOT call kvm_set_pfn_accessed.
> However, when using EPT without A/D bits, kvm_unmap_rmapp is called, which
> does internally end up in a call to kvm_set_pfn_accessed. Do you know if
> this difference is deliberate? If not, should we call kvm_set_pfn_accessed
> in the A/D case as well, or should we leave that as is? Does it make any
> difference?

I think it's correct _not_ to call kvm_set_pfn_accessed, because the
clear_flush_young MMU notifier is called when you want to clear the
accessed bit.  So your patch would be fixing a bug in the case where
EPT A/D bits aren't available.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-02  8:33         ` Paolo Bonzini
@ 2016-12-05 22:57           ` Junaid Shahid
  0 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-05 22:57 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner, guangrong xiao

On Friday, December 02, 2016 03:33:38 AM Paolo Bonzini wrote:
> 
> I think it's correct _not_ to call kvm_set_pfn_accessed, because the
> clear_flush_young MMU notifier is called when you want to clear the
> accessed bit.  So your patch would be fixing a bug in the case where
> EPT A/D bits aren't available.
 
Thanks for the clarification. The existing patch actually didn’t fix it because mmu_spte_update calls kvm_set_pfn_accessed when switching to an acc-track PTE (or clearing the A bit). But I guess I can introduce a parameter to avoid doing that when mmu_spte_update is being called from mmu_spte_age.

Thanks,
Junaid


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits
  2016-10-27  2:19 [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                   ` (4 preceding siblings ...)
  2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
@ 2016-12-07  0:46 ` Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
                     ` (7 more replies)
  5 siblings, 8 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

Changes from v2:
* Added two patches to refactor mmu_spte_update/clear and add a no_track
  version of mmu_spte_update.
* Ensured that fast_page_fault handles non-write faults to large pages that
  are being access tracked.
* Several minor changes based on code review feedback from v2.

Changes from v1:
* Patch 1 correctly maps to the current codebase by setting the Present bit
  in the page fault error code if any of the Readable, Writeable or Executable
  bits are set in the Exit Qualification.
* Added Patch 5 to update Documentation/virtual/kvm/locking.txt

This patch series implements a lockless access tracking mechanism for KVM
when running on Intel CPUs that do not have EPT A/D bits. 

Currently, KVM tracks accesses on these machines by just clearing the PTEs
and then remapping them when they are accessed again. However, the remapping
requires acquiring the MMU lock in order to lookup the information needed to
construct the PTE. On high core count VMs, this can result in significant MMU
lock contention when running some memory-intesive workloads.

This new mechanism just marks the PTEs as not-present, but keeps all the
information within the PTE instead of clearing it. When the page is accessed
again, the PTE can thus be restored without needing to acquire the MMU lock.

Junaid Shahid (8):
  kvm: x86: mmu: Use symbolic constants for EPT Violation Exit
    Qualifications
  kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  kvm: x86: mmu: Fast Page Fault path retries
  kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear
  kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update
  kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs
  kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A
    bits.
  kvm: x86: mmu: Update documentation for fast page fault mechanism

 Documentation/virtual/kvm/locking.txt |  31 ++-
 arch/x86/include/asm/kvm_host.h       |  10 +-
 arch/x86/include/asm/vmx.h            |  28 +-
 arch/x86/kvm/mmu.c                    | 464 +++++++++++++++++++++++-----------
 arch/x86/kvm/vmx.c                    |  54 ++--
 arch/x86/kvm/x86.c                    |   2 +-
 6 files changed, 419 insertions(+), 170 deletions(-)

-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  2016-12-15  6:50     ` Xiao Guangrong
  2016-12-07  0:46   ` [PATCH v3 2/8] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change adds some symbolic constants for VM Exit Qualifications
related to EPT Violations and updates handle_ept_violation() to use
these constants instead of hard-coded numbers.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
 arch/x86/kvm/vmx.c         | 22 ++++++++++++++--------
 2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 20e5e31..659e402 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -469,6 +469,22 @@ struct vmx_msr_entry {
 #define ENTRY_FAIL_VMCS_LINK_PTR	4
 
 /*
+ * Exit Qualifications for EPT Violations
+ */
+#define EPT_VIOLATION_READ_BIT		0
+#define EPT_VIOLATION_WRITE_BIT		1
+#define EPT_VIOLATION_INSTR_BIT		2
+#define EPT_VIOLATION_READABLE_BIT	3
+#define EPT_VIOLATION_WRITABLE_BIT	4
+#define EPT_VIOLATION_EXECUTABLE_BIT	5
+#define EPT_VIOLATION_READ		(1 << EPT_VIOLATION_READ_BIT)
+#define EPT_VIOLATION_WRITE		(1 << EPT_VIOLATION_WRITE_BIT)
+#define EPT_VIOLATION_INSTR		(1 << EPT_VIOLATION_INSTR_BIT)
+#define EPT_VIOLATION_READABLE		(1 << EPT_VIOLATION_READABLE_BIT)
+#define EPT_VIOLATION_WRITABLE		(1 << EPT_VIOLATION_WRITABLE_BIT)
+#define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
+
+/*
  * VM-instruction error numbers
  */
 enum vm_instruction_error_number {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0e86219..eb6b589 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6113,14 +6113,20 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(gpa, exit_qualification);
 
-	/* it is a read fault? */
-	error_code = (exit_qualification << 2) & PFERR_USER_MASK;
-	/* it is a write fault? */
-	error_code |= exit_qualification & PFERR_WRITE_MASK;
-	/* It is a fetch fault? */
-	error_code |= (exit_qualification << 2) & PFERR_FETCH_MASK;
-	/* ept page table is present? */
-	error_code |= (exit_qualification & 0x38) != 0;
+	/* Is it a read fault? */
+	error_code = (exit_qualification & EPT_VIOLATION_READ)
+		     ? PFERR_USER_MASK : 0;
+	/* Is it a write fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_WRITE)
+		      ? PFERR_WRITE_MASK : 0;
+	/* Is it a fetch fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_INSTR)
+		      ? PFERR_FETCH_MASK : 0;
+	/* ept page table entry is present? */
+	error_code |= (exit_qualification &
+		       (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
+			EPT_VIOLATION_EXECUTABLE))
+		      ? PFERR_PRESENT_MASK : 0;
 
 	vcpu->arch.exit_qualification = exit_qualification;
 
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v3 2/8] kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  2016-12-15  6:51     ` Xiao Guangrong
  2016-12-07  0:46   ` [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change renames spte_is_locklessly_modifiable() to
spte_can_locklessly_be_made_writable() to distinguish it from other
forms of lockless modifications. The full set of lockless modifications
is covered by spte_has_volatile_bits().

Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7012de4..4d33275 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -473,7 +473,7 @@ static u64 __get_spte_lockless(u64 *sptep)
 }
 #endif
 
-static bool spte_is_locklessly_modifiable(u64 spte)
+static bool spte_can_locklessly_be_made_writable(u64 spte)
 {
 	return (spte & (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) ==
 		(SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE);
@@ -487,7 +487,7 @@ static bool spte_has_volatile_bits(u64 spte)
 	 * also, it can help us to get a stable is_writable_pte()
 	 * to ensure tlb flush is not missed.
 	 */
-	if (spte_is_locklessly_modifiable(spte))
+	if (spte_can_locklessly_be_made_writable(spte))
 		return true;
 
 	if (!shadow_accessed_mask)
@@ -556,7 +556,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 * we always atomically update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
-	if (spte_is_locklessly_modifiable(old_spte) &&
+	if (spte_can_locklessly_be_made_writable(old_spte) &&
 	      !is_writable_pte(new_spte))
 		ret = true;
 
@@ -1212,7 +1212,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	u64 spte = *sptep;
 
 	if (!is_writable_pte(spte) &&
-	      !(pt_protect && spte_is_locklessly_modifiable(spte)))
+	      !(pt_protect && spte_can_locklessly_be_made_writable(spte)))
 		return false;
 
 	rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
@@ -2965,7 +2965,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	 * Currently, to simplify the code, only the spte write-protected
 	 * by dirty-log can be fast fixed.
 	 */
-	if (!spte_is_locklessly_modifiable(spte))
+	if (!spte_can_locklessly_be_made_writable(spte))
 		goto exit;
 
 	/*
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 2/8] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  2016-12-15  7:20     ` Xiao Guangrong
  2016-12-07  0:46   ` [PATCH v3 4/8] kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear Junaid Shahid
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change adds retries into the Fast Page Fault path. Without the
retries, the code still works, but if a retry does end up being needed,
then it will result in a second page fault for the same memory access,
which will cause much more overhead compared to just retrying within the
original fault.

This would be especially useful with the upcoming fast access tracking
change, as that would make it more likely for retries to be needed
(e.g. due to read and write faults happening on different CPUs at
the same time).

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 124 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 73 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4d33275..bcf1b95 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2881,6 +2881,10 @@ static bool page_fault_can_be_fast(u32 error_code)
 	return true;
 }
 
+/*
+ * Returns true if the SPTE was fixed successfully. Otherwise,
+ * someone else modified the SPTE from its original value.
+ */
 static bool
 fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 			u64 *sptep, u64 spte)
@@ -2907,8 +2911,10 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
-		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
+		return false;
+
+	kvm_vcpu_mark_page_dirty(vcpu, gfn);
 
 	return true;
 }
@@ -2923,8 +2929,9 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
-	bool ret = false;
+	bool fault_handled = false;
 	u64 spte = 0ull;
+	uint retry_count = 0;
 
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return false;
@@ -2937,62 +2944,77 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		if (!is_shadow_present_pte(spte) || iterator.level < level)
 			break;
 
-	/*
-	 * If the mapping has been changed, let the vcpu fault on the
-	 * same address again.
-	 */
-	if (!is_shadow_present_pte(spte)) {
-		ret = true;
-		goto exit;
-	}
+	do {
+		/*
+		 * If the mapping has been changed, let the vcpu fault on the
+		 * same address again.
+		 */
+		if (!is_shadow_present_pte(spte)) {
+			fault_handled = true;
+			break;
+		}
 
-	sp = page_header(__pa(iterator.sptep));
-	if (!is_last_spte(spte, sp->role.level))
-		goto exit;
+		sp = page_header(__pa(iterator.sptep));
+		if (!is_last_spte(spte, sp->role.level))
+			break;
 
-	/*
-	 * Check if it is a spurious fault caused by TLB lazily flushed.
-	 *
-	 * Need not check the access of upper level table entries since
-	 * they are always ACC_ALL.
-	 */
-	 if (is_writable_pte(spte)) {
-		ret = true;
-		goto exit;
-	}
+		/*
+		 * Check if it is a spurious fault caused by TLB lazily flushed.
+		 *
+		 * Need not check the access of upper level table entries since
+		 * they are always ACC_ALL.
+		 */
+		if (is_writable_pte(spte)) {
+			fault_handled = true;
+			break;
+		}
 
-	/*
-	 * Currently, to simplify the code, only the spte write-protected
-	 * by dirty-log can be fast fixed.
-	 */
-	if (!spte_can_locklessly_be_made_writable(spte))
-		goto exit;
+		/*
+		 * Currently, to simplify the code, only the spte
+		 * write-protected by dirty-log can be fast fixed.
+		 */
+		if (!spte_can_locklessly_be_made_writable(spte))
+			break;
 
-	/*
-	 * Do not fix write-permission on the large spte since we only dirty
-	 * the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
-	 * that means other pages are missed if its slot is dirty-logged.
-	 *
-	 * Instead, we let the slow page fault path create a normal spte to
-	 * fix the access.
-	 *
-	 * See the comments in kvm_arch_commit_memory_region().
-	 */
-	if (sp->role.level > PT_PAGE_TABLE_LEVEL)
-		goto exit;
+		/*
+		 * Do not fix write-permission on the large spte since we only
+		 * dirty the first page into the dirty-bitmap in
+		 * fast_pf_fix_direct_spte() that means other pages are missed
+		 * if its slot is dirty-logged.
+		 *
+		 * Instead, we let the slow page fault path create a normal spte
+		 * to fix the access.
+		 *
+		 * See the comments in kvm_arch_commit_memory_region().
+		 */
+		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+			break;
+
+		/*
+		 * Currently, fast page fault only works for direct mapping
+		 * since the gfn is not stable for indirect shadow page. See
+		 * Documentation/virtual/kvm/locking.txt to get more detail.
+		 */
+		fault_handled = fast_pf_fix_direct_spte(vcpu, sp,
+							iterator.sptep, spte);
+		if (fault_handled)
+			break;
+
+		if (++retry_count > 4) {
+			printk_once(KERN_WARNING
+				"kvm: Fast #PF retrying more than 4 times.\n");
+			break;
+		}
+
+		spte = mmu_spte_get_lockless(iterator.sptep);
+
+	} while (true);
 
-	/*
-	 * Currently, fast page fault only works for direct mapping since
-	 * the gfn is not stable for indirect shadow page.
-	 * See Documentation/virtual/kvm/locking.txt to get more detail.
-	 */
-	ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
-exit:
 	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
-			      spte, ret);
+			      spte, fault_handled);
 	walk_shadow_page_lockless_end(vcpu);
 
-	return ret;
+	return fault_handled;
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v3 4/8] kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                     ` (2 preceding siblings ...)
  2016-12-07  0:46   ` [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 5/8] kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update Junaid Shahid
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This simplifies mmu_spte_update() a little bit.
The checks for clearing of accessed and dirty bits are refactored into
separate functions, which are used inside both mmu_spte_update() and
mmu_spte_clear_track_bits(), as well as kvm_test_age_rmapp(). The new
helper functions handle both the case when A/D bits are supported in
hardware and the case when they are not.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 68 +++++++++++++++++++++++++-----------------------------
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bcf1b95..a9cd1df 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -503,14 +503,16 @@ static bool spte_has_volatile_bits(u64 spte)
 	return true;
 }
 
-static bool spte_is_bit_cleared(u64 old_spte, u64 new_spte, u64 bit_mask)
+static bool is_accessed_spte(u64 spte)
 {
-	return (old_spte & bit_mask) && !(new_spte & bit_mask);
+	return shadow_accessed_mask ? spte & shadow_accessed_mask
+				    : true;
 }
 
-static bool spte_is_bit_changed(u64 old_spte, u64 new_spte, u64 bit_mask)
+static bool is_dirty_spte(u64 spte)
 {
-	return (old_spte & bit_mask) != (new_spte & bit_mask);
+	return shadow_dirty_mask ? spte & shadow_dirty_mask
+				 : spte & PT_WRITABLE_MASK;
 }
 
 /* Rules for using mmu_spte_set:
@@ -533,17 +535,19 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
  * will find a read-only spte, even though the writable spte
  * might be cached on a CPU's TLB, the return value indicates this
  * case.
+ *
+ * Returns true if the TLB needs to be flushed
  */
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
 	u64 old_spte = *sptep;
-	bool ret = false;
+	bool flush = false;
 
 	WARN_ON(!is_shadow_present_pte(new_spte));
 
 	if (!is_shadow_present_pte(old_spte)) {
 		mmu_spte_set(sptep, new_spte);
-		return ret;
+		return flush;
 	}
 
 	if (!spte_has_volatile_bits(old_spte))
@@ -551,6 +555,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	else
 		old_spte = __update_clear_spte_slow(sptep, new_spte);
 
+	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
+
 	/*
 	 * For the spte updated out of mmu-lock is safe, since
 	 * we always atomically update it, see the comments in
@@ -558,38 +564,31 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 */
 	if (spte_can_locklessly_be_made_writable(old_spte) &&
 	      !is_writable_pte(new_spte))
-		ret = true;
-
-	if (!shadow_accessed_mask) {
-		/*
-		 * We don't set page dirty when dropping non-writable spte.
-		 * So do it now if the new spte is becoming non-writable.
-		 */
-		if (ret)
-			kvm_set_pfn_dirty(spte_to_pfn(old_spte));
-		return ret;
-	}
+		flush = true;
 
 	/*
-	 * Flush TLB when accessed/dirty bits are changed in the page tables,
+	 * Flush TLB when accessed/dirty states are changed in the page tables,
 	 * to guarantee consistency between TLB and page tables.
 	 */
-	if (spte_is_bit_changed(old_spte, new_spte,
-                                shadow_accessed_mask | shadow_dirty_mask))
-		ret = true;
 
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_accessed_mask))
+	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
+		flush = true;
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
-	if (spte_is_bit_cleared(old_spte, new_spte, shadow_dirty_mask))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+	}
 
-	return ret;
+	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
+		flush = true;
+		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+	}
+
+	return flush;
 }
 
 /*
  * Rules for using mmu_spte_clear_track_bits:
  * It sets the sptep from present to nonpresent, and track the
  * state bits, it is used to clear the last level sptep.
+ * Returns non-zero if the PTE was previously valid.
  */
 static int mmu_spte_clear_track_bits(u64 *sptep)
 {
@@ -613,11 +612,12 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	 */
 	WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
 
-	if (!shadow_accessed_mask || old_spte & shadow_accessed_mask)
+	if (is_accessed_spte(old_spte))
 		kvm_set_pfn_accessed(pfn);
-	if (old_spte & (shadow_dirty_mask ? shadow_dirty_mask :
-					    PT_WRITABLE_MASK))
+
+	if (is_dirty_spte(old_spte))
 		kvm_set_pfn_dirty(pfn);
+
 	return 1;
 }
 
@@ -1615,7 +1615,6 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
-	int young = 0;
 
 	/*
 	 * If there's no access bit in the secondary pte set by the
@@ -1625,14 +1624,11 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	if (!shadow_accessed_mask)
 		goto out;
 
-	for_each_rmap_spte(rmap_head, &iter, sptep) {
-		if (*sptep & shadow_accessed_mask) {
-			young = 1;
-			break;
-		}
-	}
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		if (is_accessed_spte(*sptep))
+			return 1;
 out:
-	return young;
+	return 0;
 }
 
 #define RMAP_RECYCLE_THRESHOLD 1000
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v3 5/8] kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                     ` (3 preceding siblings ...)
  2016-12-07  0:46   ` [PATCH v3 4/8] kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 6/8] kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs Junaid Shahid
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

mmu_spte_update() tracks changes in the accessed/dirty state of
the SPTE being updated and calls kvm_set_pfn_accessed/dirty
appropriately. However, in some cases (e.g. when aging the SPTE),
this shouldn't be done. mmu_spte_update_no_track() is introduced
for use in such cases.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/kvm/mmu.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a9cd1df..3f66fd3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -527,6 +527,31 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 	__set_spte(sptep, new_spte);
 }
 
+/*
+ * Update the SPTE (excluding the PFN), but do not track changes in its
+ * accessed/dirty status.
+ */
+static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+{
+	u64 old_spte = *sptep;
+
+	WARN_ON(!is_shadow_present_pte(new_spte));
+
+	if (!is_shadow_present_pte(old_spte)) {
+		mmu_spte_set(sptep, new_spte);
+		return old_spte;
+	}
+
+	if (!spte_has_volatile_bits(old_spte))
+		__update_clear_spte_fast(sptep, new_spte);
+	else
+		old_spte = __update_clear_spte_slow(sptep, new_spte);
+
+	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
+
+	return old_spte;
+}
+
 /* Rules for using mmu_spte_update:
  * Update the state bits, it means the mapped pfn is not changed.
  *
@@ -540,22 +565,11 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
  */
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
-	u64 old_spte = *sptep;
 	bool flush = false;
+	u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
 
-	WARN_ON(!is_shadow_present_pte(new_spte));
-
-	if (!is_shadow_present_pte(old_spte)) {
-		mmu_spte_set(sptep, new_spte);
-		return flush;
-	}
-
-	if (!spte_has_volatile_bits(old_spte))
-		__update_clear_spte_fast(sptep, new_spte);
-	else
-		old_spte = __update_clear_spte_slow(sptep, new_spte);
-
-	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
+	if (!is_shadow_present_pte(old_spte))
+		return false;
 
 	/*
 	 * For the spte updated out of mmu-lock is safe, since
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v3 6/8] kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                     ` (4 preceding siblings ...)
  2016-12-07  0:46   ` [PATCH v3 5/8] kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
  2016-12-07  0:46   ` [PATCH v3 8/8] kvm: x86: mmu: Update documentation for fast page fault mechanism Junaid Shahid
  7 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
acceptable. However, the upcoming fast access tracking feature adds another
type of special tracking PTE, which uses not-Present PTEs and hence should
not set bit 63.

In order to use common bits to distinguish both type of special PTEs, we
now use only bit 62 as the special bit.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/include/asm/kvm_host.h | 7 +++++++
 arch/x86/include/asm/vmx.h      | 9 +++++++--
 arch/x86/kvm/vmx.c              | 6 +++---
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 77cb3f9..5a10eb7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -208,6 +208,13 @@ enum {
 				 PFERR_WRITE_MASK |		\
 				 PFERR_PRESENT_MASK)
 
+/*
+ * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
+ * Access Tracking SPTEs. We use bit 62 instead of bit 63 to avoid conflicting
+ * with the SVE bit in EPT PTEs.
+ */
+#define SPTE_SPECIAL_MASK (1ULL << 62)
+
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC	0
 /*
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 659e402..45ee6d9 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -436,8 +436,13 @@ enum vmcs_field {
 #define VMX_EPT_WRITABLE_MASK			0x2ull
 #define VMX_EPT_EXECUTABLE_MASK			0x4ull
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
-#define VMX_EPT_ACCESS_BIT				(1ull << 8)
-#define VMX_EPT_DIRTY_BIT				(1ull << 9)
+#define VMX_EPT_ACCESS_BIT			(1ull << 8)
+#define VMX_EPT_DIRTY_BIT			(1ull << 9)
+
+/* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
+#define VMX_EPT_MISCONFIG_WX_VALUE           (VMX_EPT_WRITABLE_MASK |       \
+                                              VMX_EPT_EXECUTABLE_MASK)
+
 
 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR		0xfffbc000ul
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index eb6b589..6a01e755 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4959,10 +4959,10 @@ static void ept_set_mmio_spte_mask(void)
 	/*
 	 * EPT Misconfigurations can be generated if the value of bits 2:0
 	 * of an EPT paging-structure entry is 110b (write/execute).
-	 * Also, magic bits (0x3ull << 62) is set to quickly identify mmio
-	 * spte.
+	 * Also, special bit (62) is set to quickly identify mmio spte.
 	 */
-	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
+	kvm_mmu_set_mmio_spte_mask(SPTE_SPECIAL_MASK |
+				   VMX_EPT_MISCONFIG_WX_VALUE);
 }
 
 #define VMX_XSS_EXIT_BITMAP 0
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                     ` (5 preceding siblings ...)
  2016-12-07  0:46   ` [PATCH v3 6/8] kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  2016-12-14 16:28     ` Paolo Bonzini
  2016-12-16 13:04     ` Xiao Guangrong
  2016-12-07  0:46   ` [PATCH v3 8/8] kvm: x86: mmu: Update documentation for fast page fault mechanism Junaid Shahid
  7 siblings, 2 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/include/asm/vmx.h      |   9 +-
 arch/x86/kvm/mmu.c              | 274 +++++++++++++++++++++++++++++++---------
 arch/x86/kvm/vmx.c              |  26 ++--
 arch/x86/kvm/x86.c              |   2 +-
 5 files changed, 237 insertions(+), 77 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5a10eb7..da1d4b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1056,7 +1056,8 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu);
 void kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
-		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask);
+		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
+		u64 acc_track_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 45ee6d9..9d228a8 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -438,11 +438,14 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
 #define VMX_EPT_ACCESS_BIT			(1ull << 8)
 #define VMX_EPT_DIRTY_BIT			(1ull << 9)
+#define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
+						 VMX_EPT_WRITABLE_MASK |       \
+						 VMX_EPT_EXECUTABLE_MASK)
+#define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
 
 /* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
-#define VMX_EPT_MISCONFIG_WX_VALUE           (VMX_EPT_WRITABLE_MASK |       \
-                                              VMX_EPT_EXECUTABLE_MASK)
-
+#define VMX_EPT_MISCONFIG_WX_VALUE		(VMX_EPT_WRITABLE_MASK |       \
+						 VMX_EPT_EXECUTABLE_MASK)
 
 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR		0xfffbc000ul
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3f66fd3..6ba6220 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -37,6 +37,7 @@
 #include <linux/srcu.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/kern_levels.h>
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -129,6 +130,10 @@ module_param(dbg, bool, 0644);
 #define ACC_USER_MASK    PT_USER_MASK
 #define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
 
+/* The mask for the R/X bits in EPT PTEs */
+#define PT64_EPT_READABLE_MASK			0x1ull
+#define PT64_EPT_EXECUTABLE_MASK		0x4ull
+
 #include <trace/events/kvm.h>
 
 #define CREATE_TRACE_POINTS
@@ -178,6 +183,25 @@ static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mmio_mask;
 static u64 __read_mostly shadow_present_mask;
 
+/*
+ * The mask/value to distinguish a PTE that has been marked not-present for
+ * access tracking purposes.
+ * The mask would be either 0 if access tracking is disabled, or
+ * SPTE_SPECIAL_MASK|VMX_EPT_RWX_MASK if access tracking is enabled.
+ */
+static u64 __read_mostly shadow_acc_track_mask;
+static const u64 shadow_acc_track_value = SPTE_SPECIAL_MASK;
+
+/*
+ * The mask/shift to use for saving the original R/X bits when marking the PTE
+ * as not-present for access tracking purposes. We do not save the W bit as the
+ * PTEs being access tracked also need to be dirty tracked, so the W bit will be
+ * restored only when a write is attempted to the page.
+ */
+static const u64 shadow_acc_track_saved_bits_mask = PT64_EPT_READABLE_MASK |
+						    PT64_EPT_EXECUTABLE_MASK;
+static const u64 shadow_acc_track_saved_bits_shift = PT64_SECOND_AVAIL_BITS_SHIFT;
+
 static void mmu_spte_set(u64 *sptep, u64 spte);
 static void mmu_free_roots(struct kvm_vcpu *vcpu);
 
@@ -187,6 +211,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
+static inline bool is_access_track_spte(u64 spte)
+{
+	return shadow_acc_track_mask != 0 &&
+	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
+}
+
 /*
  * the low bit of the generation number is always presumed to be zero.
  * This disables mmio caching during memslot updates.  The concept is
@@ -284,7 +314,8 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte)
 }
 
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
-		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask)
+		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
+		u64 acc_track_mask)
 {
 	shadow_user_mask = user_mask;
 	shadow_accessed_mask = accessed_mask;
@@ -292,9 +323,23 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 	shadow_nx_mask = nx_mask;
 	shadow_x_mask = x_mask;
 	shadow_present_mask = p_mask;
+	shadow_acc_track_mask = acc_track_mask;
+	WARN_ON(shadow_accessed_mask != 0 && shadow_acc_track_mask != 0);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
 
+void kvm_mmu_clear_all_pte_masks(void)
+{
+	shadow_user_mask = 0;
+	shadow_accessed_mask = 0;
+	shadow_dirty_mask = 0;
+	shadow_nx_mask = 0;
+	shadow_x_mask = 0;
+	shadow_mmio_mask = 0;
+	shadow_present_mask = 0;
+	shadow_acc_track_mask = 0;
+}
+
 static int is_cpuid_PSE36(void)
 {
 	return 1;
@@ -307,7 +352,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
 
 static int is_shadow_present_pte(u64 pte)
 {
-	return (pte & 0xFFFFFFFFull) && !is_mmio_spte(pte);
+	return (pte != 0) && !is_mmio_spte(pte);
 }
 
 static int is_large_pte(u64 pte)
@@ -490,23 +535,20 @@ static bool spte_has_volatile_bits(u64 spte)
 	if (spte_can_locklessly_be_made_writable(spte))
 		return true;
 
-	if (!shadow_accessed_mask)
-		return false;
-
 	if (!is_shadow_present_pte(spte))
 		return false;
 
-	if ((spte & shadow_accessed_mask) &&
-	      (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
-		return false;
+	if (!shadow_accessed_mask)
+		return is_access_track_spte(spte);
 
-	return true;
+	return (spte & shadow_accessed_mask) == 0 ||
+		(is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0);
 }
 
 static bool is_accessed_spte(u64 spte)
 {
 	return shadow_accessed_mask ? spte & shadow_accessed_mask
-				    : true;
+				    : !is_access_track_spte(spte);
 }
 
 static bool is_dirty_spte(u64 spte)
@@ -650,6 +692,65 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
+static u64 mark_spte_for_access_track(u64 spte)
+{
+	if (shadow_accessed_mask != 0)
+		return spte & ~shadow_accessed_mask;
+
+	if (shadow_acc_track_mask == 0 || is_access_track_spte(spte))
+		return spte;
+
+	/*
+	 * Verify that the write-protection that we do below will be fixable
+	 * via the fast page fault path. Currently, that is always the case, at
+	 * least when using EPT (which is when access tracking would be used).
+	 */
+	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
+		  !spte_can_locklessly_be_made_writable(spte),
+		  "kvm: Writable SPTE is not locklessly dirty-trackable\n");
+
+	WARN_ONCE(spte & (shadow_acc_track_saved_bits_mask <<
+			  shadow_acc_track_saved_bits_shift),
+		  "kvm: Access Tracking saved bit locations are not zero\n");
+
+	spte |= (spte & shadow_acc_track_saved_bits_mask) <<
+		shadow_acc_track_saved_bits_shift;
+	spte &= ~shadow_acc_track_mask;
+	spte |= shadow_acc_track_value;
+
+	return spte;
+}
+
+/* Returns the Accessed status of the PTE and resets it at the same time. */
+static bool mmu_spte_age(u64 *sptep)
+{
+	u64 spte = mmu_spte_get_lockless(sptep);
+
+	if (spte & shadow_accessed_mask) {
+		clear_bit((ffs(shadow_accessed_mask) - 1),
+			  (unsigned long *)sptep);
+		return true;
+	}
+
+	if (shadow_accessed_mask == 0) {
+		if (is_access_track_spte(spte))
+			return false;
+
+		/*
+		 * Capture the dirty status of the page, so that it doesn't get
+		 * lost when the SPTE is marked for access tracking.
+		 */
+		if (is_writable_pte(spte))
+			kvm_set_pfn_dirty(spte_to_pfn(spte));
+
+		spte = mark_spte_for_access_track(spte);
+		mmu_spte_update_no_track(sptep, spte);
+		return true;
+	}
+
+	return false;
+}
+
 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 {
 	/*
@@ -1434,7 +1535,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 restart:
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
 		rmap_printk("kvm_set_pte_rmapp: spte %p %llx gfn %llx (%d)\n",
-			     sptep, *sptep, gfn, level);
+			    sptep, *sptep, gfn, level);
 
 		need_flush = 1;
 
@@ -1447,7 +1548,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 			new_spte &= ~PT_WRITABLE_MASK;
 			new_spte &= ~SPTE_HOST_WRITEABLE;
-			new_spte &= ~shadow_accessed_mask;
+
+			new_spte = mark_spte_for_access_track(new_spte);
 
 			mmu_spte_clear_track_bits(sptep);
 			mmu_spte_set(sptep, new_spte);
@@ -1609,15 +1711,8 @@ static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	struct rmap_iterator uninitialized_var(iter);
 	int young = 0;
 
-	BUG_ON(!shadow_accessed_mask);
-
-	for_each_rmap_spte(rmap_head, &iter, sptep) {
-		if (*sptep & shadow_accessed_mask) {
-			young = 1;
-			clear_bit((ffs(shadow_accessed_mask) - 1),
-				 (unsigned long *)sptep);
-		}
-	}
+	for_each_rmap_spte(rmap_head, &iter, sptep)
+		young |= mmu_spte_age(sptep);
 
 	trace_kvm_age_page(gfn, level, slot, young);
 	return young;
@@ -1631,11 +1726,11 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	struct rmap_iterator iter;
 
 	/*
-	 * If there's no access bit in the secondary pte set by the
-	 * hardware it's up to gup-fast/gup to set the access bit in
-	 * the primary pte or in the page structure.
+	 * If there's no access bit in the secondary pte set by the hardware and
+	 * fast access tracking is also not enabled, it's up to gup-fast/gup to
+	 * set the access bit in the primary pte or in the page structure.
 	 */
-	if (!shadow_accessed_mask)
+	if (!shadow_accessed_mask && !shadow_acc_track_mask)
 		goto out;
 
 	for_each_rmap_spte(rmap_head, &iter, sptep)
@@ -1670,7 +1765,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 	 * This has some overhead, but not as much as the cost of swapping
 	 * out actively used pages or breaking up actively used hugepages.
 	 */
-	if (!shadow_accessed_mask)
+	if (!shadow_accessed_mask && !shadow_acc_track_mask)
 		return kvm_handle_hva_range(kvm, start, end, 0,
 					    kvm_unmap_rmapp);
 
@@ -2593,6 +2688,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		spte |= shadow_dirty_mask;
 	}
 
+	if (speculative)
+		spte = mark_spte_for_access_track(spte);
+
 set_pte:
 	if (mmu_spte_update(sptep, spte))
 		kvm_flush_remote_tlbs(vcpu->kvm);
@@ -2646,7 +2744,7 @@ static bool mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
 	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
 	pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
 		 is_large_pte(*sptep)? "2MB" : "4kB",
-		 *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
+		 *sptep & PT_WRITABLE_MASK ? "RW" : "R", gfn,
 		 *sptep, sptep);
 	if (!was_rmapped && is_large_pte(*sptep))
 		++vcpu->kvm->stat.lpages;
@@ -2879,16 +2977,28 @@ static bool page_fault_can_be_fast(u32 error_code)
 	if (unlikely(error_code & PFERR_RSVD_MASK))
 		return false;
 
-	/*
-	 * #PF can be fast only if the shadow page table is present and it
-	 * is caused by write-protect, that means we just need change the
-	 * W bit of the spte which can be done out of mmu-lock.
-	 */
-	if (!(error_code & PFERR_PRESENT_MASK) ||
-	      !(error_code & PFERR_WRITE_MASK))
+	/* See if the page fault is due to an NX violation */
+	if (unlikely(((error_code & (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))
+		      == (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))))
 		return false;
 
-	return true;
+	/*
+	 * #PF can be fast if:
+	 * 1. The shadow page table entry is not present, which could mean that
+	 *    the fault is potentially caused by access tracking (if enabled).
+	 * 2. The shadow page table entry is present and the fault
+	 *    is caused by write-protect, that means we just need change the W
+	 *    bit of the spte which can be done out of mmu-lock.
+	 *
+	 * However, if access tracking is disabled we know that a non-present
+	 * page must be a genuine page fault where we have to create a new SPTE.
+	 * So, if access tracking is disabled, we return true only for write
+	 * accesses to a present page.
+	 */
+
+	return shadow_acc_track_mask != 0 ||
+	       ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
+		== (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));
 }
 
 /*
@@ -2897,17 +3007,26 @@ static bool page_fault_can_be_fast(u32 error_code)
  */
 static bool
 fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-			u64 *sptep, u64 spte)
+			u64 *sptep, u64 old_spte,
+			bool remove_write_prot, bool remove_acc_track)
 {
 	gfn_t gfn;
+	u64 new_spte = old_spte;
 
 	WARN_ON(!sp->role.direct);
 
-	/*
-	 * The gfn of direct spte is stable since it is calculated
-	 * by sp->gfn.
-	 */
-	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+	if (remove_acc_track) {
+		u64 saved_bits = (old_spte >> shadow_acc_track_saved_bits_shift)
+				 & shadow_acc_track_saved_bits_mask;
+
+		new_spte &= ~shadow_acc_track_mask;
+		new_spte &= ~(shadow_acc_track_saved_bits_mask <<
+			      shadow_acc_track_saved_bits_shift);
+		new_spte |= saved_bits;
+	}
+
+	if (remove_write_prot)
+		new_spte |= PT_WRITABLE_MASK;
 
 	/*
 	 * Theoretically we could also set dirty bit (and flush TLB) here in
@@ -2921,10 +3040,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
+	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
 		return false;
 
-	kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	if (remove_write_prot) {
+		/*
+		 * The gfn of direct spte is stable since it is
+		 * calculated by sp->gfn.
+		 */
+		gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	}
 
 	return true;
 }
@@ -2955,35 +3081,55 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 			break;
 
 	do {
-		/*
-		 * If the mapping has been changed, let the vcpu fault on the
-		 * same address again.
-		 */
-		if (!is_shadow_present_pte(spte)) {
-			fault_handled = true;
-			break;
-		}
+		bool remove_write_prot = false;
+		bool remove_acc_track;
 
 		sp = page_header(__pa(iterator.sptep));
 		if (!is_last_spte(spte, sp->role.level))
 			break;
 
 		/*
-		 * Check if it is a spurious fault caused by TLB lazily flushed.
+		 * Check whether the memory access that caused the fault would
+		 * still cause it if it were to be performed right now. If not,
+		 * then this is a spurious fault caused by TLB lazily flushed,
+		 * or some other CPU has already fixed the PTE after the
+		 * current CPU took the fault.
 		 *
 		 * Need not check the access of upper level table entries since
 		 * they are always ACC_ALL.
 		 */
-		if (is_writable_pte(spte)) {
-			fault_handled = true;
-			break;
+
+		if (error_code & PFERR_FETCH_MASK) {
+			if ((spte & (shadow_x_mask | shadow_nx_mask))
+			    == shadow_x_mask) {
+				fault_handled = true;
+				break;
+			}
+		} else if (error_code & PFERR_WRITE_MASK) {
+			if (is_writable_pte(spte)) {
+				fault_handled = true;
+				break;
+			}
+
+			/*
+			 * Currently, to simplify the code, write-protection can
+			 * be removed in the fast path only if the SPTE was
+			 * write-protected for dirty-logging.
+			 */
+			remove_write_prot =
+				spte_can_locklessly_be_made_writable(spte);
+		} else {
+			/* Fault was on Read access */
+			if (spte & PT_PRESENT_MASK) {
+				fault_handled = true;
+				break;
+			}
 		}
 
-		/*
-		 * Currently, to simplify the code, only the spte
-		 * write-protected by dirty-log can be fast fixed.
-		 */
-		if (!spte_can_locklessly_be_made_writable(spte))
+		remove_acc_track = is_access_track_spte(spte);
+
+		/* Verify that the fault can be handled in the fast path */
+		if (!remove_acc_track && !remove_write_prot)
 			break;
 
 		/*
@@ -2997,7 +3143,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		 *
 		 * See the comments in kvm_arch_commit_memory_region().
 		 */
-		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+		if (sp->role.level > PT_PAGE_TABLE_LEVEL && remove_write_prot)
 			break;
 
 		/*
@@ -3006,7 +3152,9 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 		 * Documentation/virtual/kvm/locking.txt to get more detail.
 		 */
 		fault_handled = fast_pf_fix_direct_spte(vcpu, sp,
-							iterator.sptep, spte);
+							iterator.sptep, spte,
+							remove_write_prot,
+							remove_acc_track);
 		if (fault_handled)
 			break;
 
@@ -5095,6 +5243,8 @@ static void mmu_destroy_caches(void)
 
 int kvm_mmu_module_init(void)
 {
+	kvm_mmu_clear_all_pte_masks();
+
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
 					    sizeof(struct pte_list_desc),
 					    0, 0, NULL);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6a01e755..50fc078 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6318,6 +6318,19 @@ static void wakeup_handler(void)
 	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
 }
 
+void vmx_enable_tdp(void)
+{
+	kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
+		enable_ept_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull,
+		enable_ept_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull,
+		0ull, VMX_EPT_EXECUTABLE_MASK,
+		cpu_has_vmx_ept_execute_only() ? 0ull : VMX_EPT_READABLE_MASK,
+		enable_ept_ad_bits ? 0ull : SPTE_SPECIAL_MASK | VMX_EPT_RWX_MASK);
+
+	ept_set_mmio_spte_mask();
+	kvm_enable_tdp();
+}
+
 static __init int hardware_setup(void)
 {
 	int r = -ENOMEM, i, msr;
@@ -6443,16 +6456,9 @@ static __init int hardware_setup(void)
 	/* SELF-IPI */
 	vmx_disable_intercept_msr_x2apic(0x83f, MSR_TYPE_W, true);
 
-	if (enable_ept) {
-		kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
-			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
-			(enable_ept_ad_bits) ? VMX_EPT_DIRTY_BIT : 0ull,
-			0ull, VMX_EPT_EXECUTABLE_MASK,
-			cpu_has_vmx_ept_execute_only() ?
-				      0ull : VMX_EPT_READABLE_MASK);
-		ept_set_mmio_spte_mask();
-		kvm_enable_tdp();
-	} else
+	if (enable_ept)
+		vmx_enable_tdp();
+	else
 		kvm_disable_tdp();
 
 	update_ple_window_actual_max();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ec59301..c15418d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5955,7 +5955,7 @@ int kvm_arch_init(void *opaque)
 
 	kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK,
 			PT_DIRTY_MASK, PT64_NX_MASK, 0,
-			PT_PRESENT_MASK);
+			PT_PRESENT_MASK, 0);
 	kvm_timer_init();
 
 	perf_register_guest_info_callbacks(&kvm_guest_cbs);
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v3 8/8] kvm: x86: mmu: Update documentation for fast page fault mechanism
  2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
                     ` (6 preceding siblings ...)
  2016-12-07  0:46   ` [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-12-07  0:46   ` Junaid Shahid
  7 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-07  0:46 UTC (permalink / raw)
  To: kvm; +Cc: andreslc, pfeiner, pbonzini, guangrong.xiao

Add a brief description of the lockless access tracking mechanism
to the documentation of fast page faults in locking.txt.

Signed-off-by: Junaid Shahid <junaids@google.com>
---
 Documentation/virtual/kvm/locking.txt | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt
index e5dd9f4..51c435a 100644
--- a/Documentation/virtual/kvm/locking.txt
+++ b/Documentation/virtual/kvm/locking.txt
@@ -22,9 +22,16 @@ else is a leaf: no other lock is taken inside the critical sections.
 Fast page fault:
 
 Fast page fault is the fast path which fixes the guest page fault out of
-the mmu-lock on x86. Currently, the page fault can be fast only if the
-shadow page table is present and it is caused by write-protect, that means
-we just need change the W bit of the spte.
+the mmu-lock on x86. Currently, the page fault can be fast in one of the
+following two cases:
+
+1. Access Tracking: The SPTE is not present, but it is marked for access
+tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
+restore the saved R/X bits. This is described in more detail later below.
+
+2. Write-Protection: The SPTE is present and the fault is
+caused by write-protect. That means we just need to change the W bit of the 
+spte.
 
 What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
 SPTE_MMU_WRITEABLE bit on the spte:
@@ -34,7 +41,8 @@ SPTE_MMU_WRITEABLE bit on the spte:
   page write-protection.
 
 On fast page fault path, we will use cmpxchg to atomically set the spte W
-bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
+bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or 
+restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
 is safe because whenever changing these bits can be detected by cmpxchg.
 
 But we need carefully check these cases:
@@ -138,6 +146,21 @@ Since the spte is "volatile" if it can be updated out of mmu-lock, we always
 atomically update the spte, the race caused by fast page fault can be avoided,
 See the comments in spte_has_volatile_bits() and mmu_spte_update().
 
+Lockless Access Tracking:
+
+This is used for Intel CPUs that are using EPT but do not support the EPT A/D
+bits. In this case, when the KVM MMU notifier is called to track accesses to a
+page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
+by clearing the RWX bits in the PTE and storing the original R & X bits in
+some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
+PTE (using the ignored bit 62). When the VM tries to access the page later on,
+a fault is generated and the fast page fault mechanism described above is used
+to atomically restore the PTE to a Present state. The W bit is not saved when
+the PTE is marked for access tracking and during restoration to the Present
+state, the W bit is set depending on whether or not it was a write access. If
+it wasn't, then the W bit will remain clear until a write access happens, at 
+which time it will be set using the Dirty tracking mechanism described above.
+
 3. Reference
 ------------
 
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-07  0:46   ` [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
@ 2016-12-14 16:28     ` Paolo Bonzini
  2016-12-14 22:36       ` Junaid Shahid
  2016-12-16 13:04     ` Xiao Guangrong
  1 sibling, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2016-12-14 16:28 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, guangrong.xiao



On 07/12/2016 01:46, Junaid Shahid wrote:
> This change implements lockless access tracking for Intel CPUs without EPT
> A bits. This is achieved by marking the PTEs as not-present (but not
> completely clearing them) when clear_flush_young() is called after marking
> the pages as accessed. When an EPT Violation is generated as a result of
> the VM accessing those pages, the PTEs are restored to their original values.
> 
> Signed-off-by: Junaid Shahid <junaids@google.com>

Just a few changes bordering on aesthetics...

Please review and let me know if you agree:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6ba62200530a..6b5d8ff66026 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -213,8 +213,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
 
 static inline bool is_access_track_spte(u64 spte)
 {
-	return shadow_acc_track_mask != 0 &&
-	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
+	/* Always false if shadow_acc_track_mask is zero.  */
+	return (spte & shadow_acc_track_mask) == shadow_acc_track_value;
 }
 
 /*
@@ -526,23 +526,24 @@ static bool spte_can_locklessly_be_made_writable(u64 spte)
 
 static bool spte_has_volatile_bits(u64 spte)
 {
+	if (is_shadow_present_pte(spte))
+		return false;
+
 	/*
 	 * Always atomically update spte if it can be updated
 	 * out of mmu-lock, it can ensure dirty bit is not lost,
 	 * also, it can help us to get a stable is_writable_pte()
 	 * to ensure tlb flush is not missed.
 	 */
-	if (spte_can_locklessly_be_made_writable(spte))
+	if (spte_can_locklessly_be_made_writable(spte) ||
+	    is_access_track_spte(spte))
 		return true;
 
-	if (!is_shadow_present_pte(spte))
-		return false;
-
-	if (!shadow_accessed_mask)
-		return is_access_track_spte(spte);
+	if ((spte & shadow_accessed_mask) == 0 ||
+     	    (is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0))
+		return true;
 
-	return (spte & shadow_accessed_mask) == 0 ||
-		(is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0);
+	return false;
 }
 
 static bool is_accessed_spte(u64 spte)
@@ -726,16 +727,13 @@ static bool mmu_spte_age(u64 *sptep)
 {
 	u64 spte = mmu_spte_get_lockless(sptep);
 
-	if (spte & shadow_accessed_mask) {
+	if (!is_accessed_spte(spte))
+		return false;
+
+	if (shadow_accessed_mask) {
 		clear_bit((ffs(shadow_accessed_mask) - 1),
 			  (unsigned long *)sptep);
-		return true;
-	}
-
-	if (shadow_accessed_mask == 0) {
-		if (is_access_track_spte(spte))
-			return false;
-
+	} else {
 		/*
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
@@ -745,10 +743,9 @@ static bool mmu_spte_age(u64 *sptep)
 
 		spte = mark_spte_for_access_track(spte);
 		mmu_spte_update_no_track(sptep, spte);
-		return true;
 	}
 
-	return false;
+	return true;
 }
 
 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)


Thanks,

Paolo

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-14 16:28     ` Paolo Bonzini
@ 2016-12-14 22:36       ` Junaid Shahid
  2016-12-14 23:35         ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-14 22:36 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, andreslc, pfeiner, guangrong.xiao

On Wednesday, December 14, 2016 05:28:48 PM Paolo Bonzini wrote:
> 
> Just a few changes bordering on aesthetics...
> 
> Please review and let me know if you agree:
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 6ba62200530a..6b5d8ff66026 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -213,8 +213,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
>  
>  static inline bool is_access_track_spte(u64 spte)
>  {
> -	return shadow_acc_track_mask != 0 &&
> -	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
> +	/* Always false if shadow_acc_track_mask is zero.  */
> +	return (spte & shadow_acc_track_mask) == shadow_acc_track_value;
>  }

This looks good.

>  
>  /*
> @@ -526,23 +526,24 @@ static bool spte_can_locklessly_be_made_writable(u64 spte)
>  
>  static bool spte_has_volatile_bits(u64 spte)
>  {
> +	if (is_shadow_present_pte(spte))
> +		return false;
> +

This should be !is_shadow_present_pte

>  	/*
>  	 * Always atomically update spte if it can be updated
>  	 * out of mmu-lock, it can ensure dirty bit is not lost,
>  	 * also, it can help us to get a stable is_writable_pte()
>  	 * to ensure tlb flush is not missed.
>  	 */
> -	if (spte_can_locklessly_be_made_writable(spte))
> +	if (spte_can_locklessly_be_made_writable(spte) ||
> +	    is_access_track_spte(spte))
>  		return true;
>  
> -	if (!is_shadow_present_pte(spte))
> -		return false;
> -
> -	if (!shadow_accessed_mask)
> -		return is_access_track_spte(spte);
> +	if ((spte & shadow_accessed_mask) == 0 ||
> +     	    (is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0))
> +		return true;

We also need a shadow_accessed_mask != 0 check here, otherwise it will always return true when shadow_accessed_mask is 0.

>  
> -	return (spte & shadow_accessed_mask) == 0 ||
> -		(is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0);
> +	return false;
>  }
>  
>  static bool is_accessed_spte(u64 spte)
> @@ -726,16 +727,13 @@ static bool mmu_spte_age(u64 *sptep)
>  {
>  	u64 spte = mmu_spte_get_lockless(sptep);
>  
> -	if (spte & shadow_accessed_mask) {
> +	if (!is_accessed_spte(spte))
> +		return false;
> +
> +	if (shadow_accessed_mask) {
>  		clear_bit((ffs(shadow_accessed_mask) - 1),
>  			  (unsigned long *)sptep);
> -		return true;
> -	}
> -
> -	if (shadow_accessed_mask == 0) {
> -		if (is_access_track_spte(spte))
> -			return false;
> -
> +	} else {
>  		/*
>  		 * Capture the dirty status of the page, so that it doesn't get
>  		 * lost when the SPTE is marked for access tracking.
> @@ -745,10 +743,9 @@ static bool mmu_spte_age(u64 *sptep)
>  
>  		spte = mark_spte_for_access_track(spte);
>  		mmu_spte_update_no_track(sptep, spte);
> -		return true;
>  	}
>  
> -	return false;
> +	return true;
>  }
>  

This looks good as well.

Thanks,
Junaid



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-14 22:36       ` Junaid Shahid
@ 2016-12-14 23:35         ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-12-14 23:35 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: kvm, andreslc, pfeiner, guangrong xiao



----- Original Message -----
> From: "Junaid Shahid" <junaids@google.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: kvm@vger.kernel.org, andreslc@google.com, pfeiner@google.com, "guangrong xiao" <guangrong.xiao@linux.intel.com>
> Sent: Wednesday, December 14, 2016 11:36:37 PM
> Subject: Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
> 
> On Wednesday, December 14, 2016 05:28:48 PM Paolo Bonzini wrote:
> > 
> > Just a few changes bordering on aesthetics...
> > 
> > Please review and let me know if you agree:
> > 
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 6ba62200530a..6b5d8ff66026 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -213,8 +213,8 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
> >  
> >  static inline bool is_access_track_spte(u64 spte)
> >  {
> > -	return shadow_acc_track_mask != 0 &&
> > -	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
> > +	/* Always false if shadow_acc_track_mask is zero.  */
> > +	return (spte & shadow_acc_track_mask) == shadow_acc_track_value;
> >  }
> 
> This looks good.
> 
> >  
> >  /*
> > @@ -526,23 +526,24 @@ static bool spte_can_locklessly_be_made_writable(u64
> > spte)
> >  
> >  static bool spte_has_volatile_bits(u64 spte)
> >  {
> > +	if (is_shadow_present_pte(spte))
> > +		return false;
> > +
> 
> This should be !is_shadow_present_pte

Caught me sending wrong patch. :(
 
> >  	/*
> >  	 * Always atomically update spte if it can be updated
> >  	 * out of mmu-lock, it can ensure dirty bit is not lost,
> >  	 * also, it can help us to get a stable is_writable_pte()
> >  	 * to ensure tlb flush is not missed.
> >  	 */
> > -	if (spte_can_locklessly_be_made_writable(spte))
> > +	if (spte_can_locklessly_be_made_writable(spte) ||
> > +	    is_access_track_spte(spte))
> >  		return true;
> >  
> > -	if (!is_shadow_present_pte(spte))
> > -		return false;
> > -
> > -	if (!shadow_accessed_mask)
> > -		return is_access_track_spte(spte);
> > +	if ((spte & shadow_accessed_mask) == 0 ||
> > +     	    (is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0))
> > +		return true;
> 
> We also need a shadow_accessed_mask != 0 check here, otherwise it will always
> return true when shadow_accessed_mask is 0.

... but this is not the wrong patch, it's a genuine mistake.  Thanks!

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-12-07  0:46   ` [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
@ 2016-12-15  6:50     ` Xiao Guangrong
  2016-12-15 23:06       ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Xiao Guangrong @ 2016-12-15  6:50 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, pbonzini



On 12/07/2016 08:46 AM, Junaid Shahid wrote:
> This change adds some symbolic constants for VM Exit Qualifications
> related to EPT Violations and updates handle_ept_violation() to use
> these constants instead of hard-coded numbers.
>
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
>  arch/x86/kvm/vmx.c         | 22 ++++++++++++++--------
>  2 files changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 20e5e31..659e402 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -469,6 +469,22 @@ struct vmx_msr_entry {
>  #define ENTRY_FAIL_VMCS_LINK_PTR	4
>
>  /*
> + * Exit Qualifications for EPT Violations
> + */
> +#define EPT_VIOLATION_READ_BIT		0
> +#define EPT_VIOLATION_WRITE_BIT		1
> +#define EPT_VIOLATION_INSTR_BIT		2

It would be better if their names are EPT_VIOLATION_ACC_{READ,WRITE,INSTR}_BIT.

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 2/8] kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
  2016-12-07  0:46   ` [PATCH v3 2/8] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
@ 2016-12-15  6:51     ` Xiao Guangrong
  0 siblings, 0 replies; 56+ messages in thread
From: Xiao Guangrong @ 2016-12-15  6:51 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, pbonzini



On 12/07/2016 08:46 AM, Junaid Shahid wrote:
> This change renames spte_is_locklessly_modifiable() to
> spte_can_locklessly_be_made_writable() to distinguish it from other
> forms of lockless modifications. The full set of lockless modifications
> is covered by spte_has_volatile_bits().
>

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries
  2016-12-07  0:46   ` [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
@ 2016-12-15  7:20     ` Xiao Guangrong
  2016-12-15 23:36       ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Xiao Guangrong @ 2016-12-15  7:20 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, pbonzini



On 12/07/2016 08:46 AM, Junaid Shahid wrote:
> This change adds retries into the Fast Page Fault path. Without the
> retries, the code still works, but if a retry does end up being needed,
> then it will result in a second page fault for the same memory access,
> which will cause much more overhead compared to just retrying within the
> original fault.
>
> This would be especially useful with the upcoming fast access tracking
> change, as that would make it more likely for retries to be needed
> (e.g. due to read and write faults happening on different CPUs at
> the same time).
>
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/kvm/mmu.c | 124 +++++++++++++++++++++++++++++++----------------------
>  1 file changed, 73 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 4d33275..bcf1b95 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2881,6 +2881,10 @@ static bool page_fault_can_be_fast(u32 error_code)
>  	return true;
>  }
>
> +/*
> + * Returns true if the SPTE was fixed successfully. Otherwise,
> + * someone else modified the SPTE from its original value.
> + */
>  static bool
>  fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  			u64 *sptep, u64 spte)
> @@ -2907,8 +2911,10 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	 *
>  	 * Compare with set_spte where instead shadow_dirty_mask is set.
>  	 */
> -	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
> -		kvm_vcpu_mark_page_dirty(vcpu, gfn);
> +	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
> +		return false;
> +
> +	kvm_vcpu_mark_page_dirty(vcpu, gfn);
>
>  	return true;
>  }
> @@ -2923,8 +2929,9 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  {
>  	struct kvm_shadow_walk_iterator iterator;
>  	struct kvm_mmu_page *sp;
> -	bool ret = false;
> +	bool fault_handled = false;
>  	u64 spte = 0ull;
> +	uint retry_count = 0;
>
>  	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
>  		return false;
> @@ -2937,62 +2944,77 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  		if (!is_shadow_present_pte(spte) || iterator.level < level)
>  			break;
>
> -	/*
> -	 * If the mapping has been changed, let the vcpu fault on the
> -	 * same address again.
> -	 */
> -	if (!is_shadow_present_pte(spte)) {
> -		ret = true;
> -		goto exit;
> -	}
> +	do {
> +		/*
> +		 * If the mapping has been changed, let the vcpu fault on the
> +		 * same address again.
> +		 */
> +		if (!is_shadow_present_pte(spte)) {
> +			fault_handled = true;
> +			break;
> +		}

Why not include lockless_walk into the loop, retry 4 times for a invalid sp is expensive.

I am curious that did you see this retry is really helpful?  :)


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
  2016-12-15  6:50     ` Xiao Guangrong
@ 2016-12-15 23:06       ` Junaid Shahid
  0 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-15 23:06 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, andreslc, pfeiner, pbonzini


On Thursday, December 15, 2016 02:50:22 PM Xiao Guangrong wrote:
> >  /*
> > + * Exit Qualifications for EPT Violations
> > + */
> > +#define EPT_VIOLATION_READ_BIT		0
> > +#define EPT_VIOLATION_WRITE_BIT		1
> > +#define EPT_VIOLATION_INSTR_BIT		2
> 
> It would be better if their names are EPT_VIOLATION_ACC_{READ,WRITE,INSTR}_BIT.

Sure. I’ll rename these.

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries
  2016-12-15  7:20     ` Xiao Guangrong
@ 2016-12-15 23:36       ` Junaid Shahid
  2016-12-16 13:13         ` Xiao Guangrong
  0 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-15 23:36 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, andreslc, pfeiner, pbonzini


On Thursday, December 15, 2016 03:20:19 PM Xiao Guangrong wrote:
 
> Why not include lockless_walk into the loop, retry 4 times for a invalid sp is expensive.

Yes, we can move the page table walk inside the loop as well. But I’m sorry I don’t fully understand how an invalid sp will lead to retrying 4 times. Could you please elaborate a bit? Wouldn’t we break out of the loop in that case? Or do you mean the case when a huge page is getting broken down or built up?
 
> I am curious that did you see this retry is really helpful?  :)

No, I haven’t done a comparison with and without the retries since it seemed to be a fairly simple optimization. And it may not be straightforward to reliably reproduce the situation where it will help.

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-07  0:46   ` [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
  2016-12-14 16:28     ` Paolo Bonzini
@ 2016-12-16 13:04     ` Xiao Guangrong
  2016-12-16 15:23       ` Paolo Bonzini
  2016-12-17  2:04       ` Junaid Shahid
  1 sibling, 2 replies; 56+ messages in thread
From: Xiao Guangrong @ 2016-12-16 13:04 UTC (permalink / raw)
  To: Junaid Shahid, kvm; +Cc: andreslc, pfeiner, pbonzini



On 12/07/2016 08:46 AM, Junaid Shahid wrote:
> This change implements lockless access tracking for Intel CPUs without EPT
> A bits. This is achieved by marking the PTEs as not-present (but not
> completely clearing them) when clear_flush_young() is called after marking
> the pages as accessed. When an EPT Violation is generated as a result of
> the VM accessing those pages, the PTEs are restored to their original values.
>
> Signed-off-by: Junaid Shahid <junaids@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   3 +-
>  arch/x86/include/asm/vmx.h      |   9 +-
>  arch/x86/kvm/mmu.c              | 274 +++++++++++++++++++++++++++++++---------
>  arch/x86/kvm/vmx.c              |  26 ++--
>  arch/x86/kvm/x86.c              |   2 +-
>  5 files changed, 237 insertions(+), 77 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5a10eb7..da1d4b9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1056,7 +1056,8 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu);
>  void kvm_mmu_init_vm(struct kvm *kvm);
>  void kvm_mmu_uninit_vm(struct kvm *kvm);
>  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> -		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask);
> +		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
> +		u64 acc_track_mask);

Actually, this is the mask cleared by acc-track rather that _set_ by
acc-track, maybe suppress_by_acc_track_mask is a better name.

>
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 45ee6d9..9d228a8 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -438,11 +438,14 @@ enum vmcs_field {
>  #define VMX_EPT_IPAT_BIT    			(1ull << 6)
>  #define VMX_EPT_ACCESS_BIT			(1ull << 8)
>  #define VMX_EPT_DIRTY_BIT			(1ull << 9)
> +#define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
> +						 VMX_EPT_WRITABLE_MASK |       \
> +						 VMX_EPT_EXECUTABLE_MASK)
> +#define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)

I saw no space using this mask, can be dropped.

>
>  /* The mask to use to trigger an EPT Misconfiguration in order to track MMIO */
> -#define VMX_EPT_MISCONFIG_WX_VALUE           (VMX_EPT_WRITABLE_MASK |       \
> -                                              VMX_EPT_EXECUTABLE_MASK)
> -
> +#define VMX_EPT_MISCONFIG_WX_VALUE		(VMX_EPT_WRITABLE_MASK |       \
> +						 VMX_EPT_EXECUTABLE_MASK)
>
>  #define VMX_EPT_IDENTITY_PAGETABLE_ADDR		0xfffbc000ul
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3f66fd3..6ba6220 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -37,6 +37,7 @@
>  #include <linux/srcu.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> +#include <linux/kern_levels.h>
>
>  #include <asm/page.h>
>  #include <asm/cmpxchg.h>
> @@ -129,6 +130,10 @@ module_param(dbg, bool, 0644);
>  #define ACC_USER_MASK    PT_USER_MASK
>  #define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
>
> +/* The mask for the R/X bits in EPT PTEs */
> +#define PT64_EPT_READABLE_MASK			0x1ull
> +#define PT64_EPT_EXECUTABLE_MASK		0x4ull
> +

Can we move this EPT specific stuff out of mmu.c?

>  #include <trace/events/kvm.h>
>
>  #define CREATE_TRACE_POINTS
> @@ -178,6 +183,25 @@ static u64 __read_mostly shadow_dirty_mask;
>  static u64 __read_mostly shadow_mmio_mask;
>  static u64 __read_mostly shadow_present_mask;
>
> +/*
> + * The mask/value to distinguish a PTE that has been marked not-present for
> + * access tracking purposes.
> + * The mask would be either 0 if access tracking is disabled, or
> + * SPTE_SPECIAL_MASK|VMX_EPT_RWX_MASK if access tracking is enabled.
> + */
> +static u64 __read_mostly shadow_acc_track_mask;
> +static const u64 shadow_acc_track_value = SPTE_SPECIAL_MASK;
> +
> +/*
> + * The mask/shift to use for saving the original R/X bits when marking the PTE
> + * as not-present for access tracking purposes. We do not save the W bit as the
> + * PTEs being access tracked also need to be dirty tracked, so the W bit will be
> + * restored only when a write is attempted to the page.
> + */
> +static const u64 shadow_acc_track_saved_bits_mask = PT64_EPT_READABLE_MASK |
> +						    PT64_EPT_EXECUTABLE_MASK;
> +static const u64 shadow_acc_track_saved_bits_shift = PT64_SECOND_AVAIL_BITS_SHIFT;
> +
>  static void mmu_spte_set(u64 *sptep, u64 spte);
>  static void mmu_free_roots(struct kvm_vcpu *vcpu);
>
> @@ -187,6 +211,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>
> +static inline bool is_access_track_spte(u64 spte)
> +{
> +	return shadow_acc_track_mask != 0 &&
> +	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
> +}

spte & SPECIAL_MASK && !is_mmio(spte) is more clearer.

> +
>  /*
>   * the low bit of the generation number is always presumed to be zero.
>   * This disables mmio caching during memslot updates.  The concept is
> @@ -284,7 +314,8 @@ static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte)
>  }
>
>  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> -		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask)
> +		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
> +		u64 acc_track_mask)
>  {
>  	shadow_user_mask = user_mask;
>  	shadow_accessed_mask = accessed_mask;
> @@ -292,9 +323,23 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  	shadow_nx_mask = nx_mask;
>  	shadow_x_mask = x_mask;
>  	shadow_present_mask = p_mask;
> +	shadow_acc_track_mask = acc_track_mask;
> +	WARN_ON(shadow_accessed_mask != 0 && shadow_acc_track_mask != 0);
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
>
> +void kvm_mmu_clear_all_pte_masks(void)
> +{
> +	shadow_user_mask = 0;
> +	shadow_accessed_mask = 0;
> +	shadow_dirty_mask = 0;
> +	shadow_nx_mask = 0;
> +	shadow_x_mask = 0;
> +	shadow_mmio_mask = 0;
> +	shadow_present_mask = 0;
> +	shadow_acc_track_mask = 0;
> +}
> +

Hmmm... why is it needed? Static values are always init-ed to zero...

>  static int is_cpuid_PSE36(void)
>  {
>  	return 1;
> @@ -307,7 +352,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
>
>  static int is_shadow_present_pte(u64 pte)
>  {
> -	return (pte & 0xFFFFFFFFull) && !is_mmio_spte(pte);
> +	return (pte != 0) && !is_mmio_spte(pte);
>  }
>
>  static int is_large_pte(u64 pte)
> @@ -490,23 +535,20 @@ static bool spte_has_volatile_bits(u64 spte)
>  	if (spte_can_locklessly_be_made_writable(spte))
>  		return true;
>
> -	if (!shadow_accessed_mask)
> -		return false;
> -
>  	if (!is_shadow_present_pte(spte))
>  		return false;
>
> -	if ((spte & shadow_accessed_mask) &&
> -	      (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
> -		return false;
> +	if (!shadow_accessed_mask)
> +		return is_access_track_spte(spte);
>
> -	return true;
> +	return (spte & shadow_accessed_mask) == 0 ||
> +		(is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0);
>  }
>
>  static bool is_accessed_spte(u64 spte)
>  {
>  	return shadow_accessed_mask ? spte & shadow_accessed_mask
> -				    : true;
> +				    : !is_access_track_spte(spte);
>  }
>
>  static bool is_dirty_spte(u64 spte)
> @@ -650,6 +692,65 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
>  	return __get_spte_lockless(sptep);
>  }
>
> +static u64 mark_spte_for_access_track(u64 spte)
> +{
> +	if (shadow_accessed_mask != 0)
> +		return spte & ~shadow_accessed_mask;
> +
> +	if (shadow_acc_track_mask == 0 || is_access_track_spte(spte))
> +		return spte;
> +
> +	/*
> +	 * Verify that the write-protection that we do below will be fixable
> +	 * via the fast page fault path. Currently, that is always the case, at
> +	 * least when using EPT (which is when access tracking would be used).
> +	 */
> +	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
> +		  !spte_can_locklessly_be_made_writable(spte),
> +		  "kvm: Writable SPTE is not locklessly dirty-trackable\n");

This code is right but i can not understand the comment here... :(

> +
> +	WARN_ONCE(spte & (shadow_acc_track_saved_bits_mask <<
> +			  shadow_acc_track_saved_bits_shift),
> +		  "kvm: Access Tracking saved bit locations are not zero\n");
> +
> +	spte |= (spte & shadow_acc_track_saved_bits_mask) <<
> +		shadow_acc_track_saved_bits_shift;
> +	spte &= ~shadow_acc_track_mask;
> +	spte |= shadow_acc_track_value;
> +
> +	return spte;
> +}
> +
> +/* Returns the Accessed status of the PTE and resets it at the same time. */
> +static bool mmu_spte_age(u64 *sptep)
> +{
> +	u64 spte = mmu_spte_get_lockless(sptep);
> +
> +	if (spte & shadow_accessed_mask) {
> +		clear_bit((ffs(shadow_accessed_mask) - 1),
> +			  (unsigned long *)sptep);
> +		return true;
> +	}
> +
> +	if (shadow_accessed_mask == 0) {
> +		if (is_access_track_spte(spte))
> +			return false;
> +
> +		/*
> +		 * Capture the dirty status of the page, so that it doesn't get
> +		 * lost when the SPTE is marked for access tracking.
> +		 */
> +		if (is_writable_pte(spte))
> +			kvm_set_pfn_dirty(spte_to_pfn(spte));
> +
> +		spte = mark_spte_for_access_track(spte);
> +		mmu_spte_update_no_track(sptep, spte);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
>  {
>  	/*
> @@ -1434,7 +1535,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  restart:
>  	for_each_rmap_spte(rmap_head, &iter, sptep) {
>  		rmap_printk("kvm_set_pte_rmapp: spte %p %llx gfn %llx (%d)\n",
> -			     sptep, *sptep, gfn, level);
> +			    sptep, *sptep, gfn, level);
>
>  		need_flush = 1;
>
> @@ -1447,7 +1548,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>
>  			new_spte &= ~PT_WRITABLE_MASK;
>  			new_spte &= ~SPTE_HOST_WRITEABLE;
> -			new_spte &= ~shadow_accessed_mask;
> +
> +			new_spte = mark_spte_for_access_track(new_spte);
>
>  			mmu_spte_clear_track_bits(sptep);
>  			mmu_spte_set(sptep, new_spte);
> @@ -1609,15 +1711,8 @@ static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  	struct rmap_iterator uninitialized_var(iter);
>  	int young = 0;
>
> -	BUG_ON(!shadow_accessed_mask);
> -
> -	for_each_rmap_spte(rmap_head, &iter, sptep) {
> -		if (*sptep & shadow_accessed_mask) {
> -			young = 1;
> -			clear_bit((ffs(shadow_accessed_mask) - 1),
> -				 (unsigned long *)sptep);
> -		}
> -	}
> +	for_each_rmap_spte(rmap_head, &iter, sptep)
> +		young |= mmu_spte_age(sptep);
>
>  	trace_kvm_age_page(gfn, level, slot, young);
>  	return young;
> @@ -1631,11 +1726,11 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  	struct rmap_iterator iter;
>
>  	/*
> -	 * If there's no access bit in the secondary pte set by the
> -	 * hardware it's up to gup-fast/gup to set the access bit in
> -	 * the primary pte or in the page structure.
> +	 * If there's no access bit in the secondary pte set by the hardware and
> +	 * fast access tracking is also not enabled, it's up to gup-fast/gup to
> +	 * set the access bit in the primary pte or in the page structure.
>  	 */
> -	if (!shadow_accessed_mask)
> +	if (!shadow_accessed_mask && !shadow_acc_track_mask)
>  		goto out;
>
>  	for_each_rmap_spte(rmap_head, &iter, sptep)
> @@ -1670,7 +1765,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
>  	 * This has some overhead, but not as much as the cost of swapping
>  	 * out actively used pages or breaking up actively used hugepages.
>  	 */
> -	if (!shadow_accessed_mask)
> +	if (!shadow_accessed_mask && !shadow_acc_track_mask)
>  		return kvm_handle_hva_range(kvm, start, end, 0,
>  					    kvm_unmap_rmapp);
>
> @@ -2593,6 +2688,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  		spte |= shadow_dirty_mask;
>  	}
>
> +	if (speculative)
> +		spte = mark_spte_for_access_track(spte);
> +
>  set_pte:
>  	if (mmu_spte_update(sptep, spte))
>  		kvm_flush_remote_tlbs(vcpu->kvm);
> @@ -2646,7 +2744,7 @@ static bool mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access,
>  	pgprintk("%s: setting spte %llx\n", __func__, *sptep);
>  	pgprintk("instantiating %s PTE (%s) at %llx (%llx) addr %p\n",
>  		 is_large_pte(*sptep)? "2MB" : "4kB",
> -		 *sptep & PT_PRESENT_MASK ?"RW":"R", gfn,
> +		 *sptep & PT_WRITABLE_MASK ? "RW" : "R", gfn,
>  		 *sptep, sptep);
>  	if (!was_rmapped && is_large_pte(*sptep))
>  		++vcpu->kvm->stat.lpages;
> @@ -2879,16 +2977,28 @@ static bool page_fault_can_be_fast(u32 error_code)
>  	if (unlikely(error_code & PFERR_RSVD_MASK))
>  		return false;
>
> -	/*
> -	 * #PF can be fast only if the shadow page table is present and it
> -	 * is caused by write-protect, that means we just need change the
> -	 * W bit of the spte which can be done out of mmu-lock.
> -	 */
> -	if (!(error_code & PFERR_PRESENT_MASK) ||
> -	      !(error_code & PFERR_WRITE_MASK))
> +	/* See if the page fault is due to an NX violation */
> +	if (unlikely(((error_code & (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))
> +		      == (PFERR_FETCH_MASK | PFERR_PRESENT_MASK))))
>  		return false;
>
> -	return true;
> +	/*
> +	 * #PF can be fast if:
> +	 * 1. The shadow page table entry is not present, which could mean that
> +	 *    the fault is potentially caused by access tracking (if enabled).
> +	 * 2. The shadow page table entry is present and the fault
> +	 *    is caused by write-protect, that means we just need change the W
> +	 *    bit of the spte which can be done out of mmu-lock.
> +	 *
> +	 * However, if access tracking is disabled we know that a non-present
> +	 * page must be a genuine page fault where we have to create a new SPTE.
> +	 * So, if access tracking is disabled, we return true only for write
> +	 * accesses to a present page.
> +	 */
> +
> +	return shadow_acc_track_mask != 0 ||
> +	       ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
> +		== (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));

acc-track can not fix a WRITE-access, this should be:

!(error_code & (PFERR_WRITE_MASK)) && shadow_acc_track_mask != 0 || ...


>  }
>
>  /*
> @@ -2897,17 +3007,26 @@ static bool page_fault_can_be_fast(u32 error_code)
>   */
>  static bool
>  fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> -			u64 *sptep, u64 spte)
> +			u64 *sptep, u64 old_spte,
> +			bool remove_write_prot, bool remove_acc_track)
>  {
>  	gfn_t gfn;
> +	u64 new_spte = old_spte;
>
>  	WARN_ON(!sp->role.direct);
>
> -	/*
> -	 * The gfn of direct spte is stable since it is calculated
> -	 * by sp->gfn.
> -	 */
> -	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +	if (remove_acc_track) {
> +		u64 saved_bits = (old_spte >> shadow_acc_track_saved_bits_shift)
> +				 & shadow_acc_track_saved_bits_mask;
> +
> +		new_spte &= ~shadow_acc_track_mask;
> +		new_spte &= ~(shadow_acc_track_saved_bits_mask <<
> +			      shadow_acc_track_saved_bits_shift);
> +		new_spte |= saved_bits;
> +	}
> +
> +	if (remove_write_prot)
> +		new_spte |= PT_WRITABLE_MASK;
>
>  	/*
>  	 * Theoretically we could also set dirty bit (and flush TLB) here in
> @@ -2921,10 +3040,17 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	 *
>  	 * Compare with set_spte where instead shadow_dirty_mask is set.
>  	 */
> -	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) != spte)
> +	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
>  		return false;
>
> -	kvm_vcpu_mark_page_dirty(vcpu, gfn);
> +	if (remove_write_prot) {
> +		/*
> +		 * The gfn of direct spte is stable since it is
> +		 * calculated by sp->gfn.
> +		 */
> +		gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +		kvm_vcpu_mark_page_dirty(vcpu, gfn);
> +	}
>
>  	return true;
>  }
> @@ -2955,35 +3081,55 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  			break;
>
>  	do {
> -		/*
> -		 * If the mapping has been changed, let the vcpu fault on the
> -		 * same address again.
> -		 */
> -		if (!is_shadow_present_pte(spte)) {
> -			fault_handled = true;
> -			break;
> -		}
> +		bool remove_write_prot = false;
> +		bool remove_acc_track;
>
>  		sp = page_header(__pa(iterator.sptep));
>  		if (!is_last_spte(spte, sp->role.level))
>  			break;
>
>  		/*
> -		 * Check if it is a spurious fault caused by TLB lazily flushed.
> +		 * Check whether the memory access that caused the fault would
> +		 * still cause it if it were to be performed right now. If not,
> +		 * then this is a spurious fault caused by TLB lazily flushed,
> +		 * or some other CPU has already fixed the PTE after the
> +		 * current CPU took the fault.
>  		 *
>  		 * Need not check the access of upper level table entries since
>  		 * they are always ACC_ALL.
>  		 */
> -		if (is_writable_pte(spte)) {
> -			fault_handled = true;
> -			break;
> +
> +		if (error_code & PFERR_FETCH_MASK) {
> +			if ((spte & (shadow_x_mask | shadow_nx_mask))
> +			    == shadow_x_mask) {
> +				fault_handled = true;
> +				break;
> +			}
> +		} else if (error_code & PFERR_WRITE_MASK) {
> +			if (is_writable_pte(spte)) {
> +				fault_handled = true;
> +				break;
> +			}
> +
> +			/*
> +			 * Currently, to simplify the code, write-protection can
> +			 * be removed in the fast path only if the SPTE was
> +			 * write-protected for dirty-logging.
> +			 */
> +			remove_write_prot =
> +				spte_can_locklessly_be_made_writable(spte);
> +		} else {
> +			/* Fault was on Read access */
> +			if (spte & PT_PRESENT_MASK) {
> +				fault_handled = true;
> +				break;
> +			}
>  		}
>
> -		/*
> -		 * Currently, to simplify the code, only the spte
> -		 * write-protected by dirty-log can be fast fixed.
> -		 */
> -		if (!spte_can_locklessly_be_made_writable(spte))
> +		remove_acc_track = is_access_track_spte(spte);
> +

Why not check cached R/X permission can satisfy R/X access before goto atomic path?


> +		/* Verify that the fault can be handled in the fast path */
> +		if (!remove_acc_track && !remove_write_prot)
>  			break;
>
>  		/*
> @@ -2997,7 +3143,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  		 *
>  		 * See the comments in kvm_arch_commit_memory_region().
>  		 */
> -		if (sp->role.level > PT_PAGE_TABLE_LEVEL)
> +		if (sp->role.level > PT_PAGE_TABLE_LEVEL && remove_write_prot)
>  			break;
>
>  		/*
> @@ -3006,7 +3152,9 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
>  		 * Documentation/virtual/kvm/locking.txt to get more detail.
>  		 */
>  		fault_handled = fast_pf_fix_direct_spte(vcpu, sp,
> -							iterator.sptep, spte);
> +							iterator.sptep, spte,
> +							remove_write_prot,
> +							remove_acc_track);
>  		if (fault_handled)
>  			break;
>
> @@ -5095,6 +5243,8 @@ static void mmu_destroy_caches(void)
>
>  int kvm_mmu_module_init(void)
>  {
> +	kvm_mmu_clear_all_pte_masks();
> +
>  	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
>  					    sizeof(struct pte_list_desc),
>  					    0, 0, NULL);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6a01e755..50fc078 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6318,6 +6318,19 @@ static void wakeup_handler(void)
>  	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
>  }
>
> +void vmx_enable_tdp(void)
> +{
> +	kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
> +		enable_ept_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull,
> +		enable_ept_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull,
> +		0ull, VMX_EPT_EXECUTABLE_MASK,
> +		cpu_has_vmx_ept_execute_only() ? 0ull : VMX_EPT_READABLE_MASK,
> +		enable_ept_ad_bits ? 0ull : SPTE_SPECIAL_MASK | VMX_EPT_RWX_MASK);

I think commonly set SPTE_SPECIAL_MASK (move set SPTE_SPECIAL_MASK to mmu.c) for
mmio-mask and acc-track-mask can make the code more clearer...

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries
  2016-12-15 23:36       ` Junaid Shahid
@ 2016-12-16 13:13         ` Xiao Guangrong
  2016-12-17  0:36           ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Xiao Guangrong @ 2016-12-16 13:13 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: kvm, andreslc, pfeiner, pbonzini



On 12/16/2016 07:36 AM, Junaid Shahid wrote:
>
> On Thursday, December 15, 2016 03:20:19 PM Xiao Guangrong wrote:
>
>> Why not include lockless_walk into the loop, retry 4 times for a invalid sp is expensive.
>
> Yes, we can move the page table walk inside the loop as well. But I’m sorry I don’t fully understand how an invalid sp will lead to retrying 4 times. Could you please elaborate a bit? Wouldn’t we break out of the loop in that case? Or do you mean the case when a huge page is getting broken down or built up?

I mean the it is unlinked from the upper level page structure.

>
>> I am curious that did you see this retry is really helpful?  :)
>
> No, I haven’t done a comparison with and without the retries since it seemed to be a fairly simple optimization. And it may not be straightforward to reliably reproduce the situation where it will help.

So we are not sure if retry is really useful...

After this change, all !W page fault can go to this fast path, i think it does not hurt the
performance, but we'd better have a performance test.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-16 13:04     ` Xiao Guangrong
@ 2016-12-16 15:23       ` Paolo Bonzini
  2016-12-17  0:01         ` Junaid Shahid
  2016-12-21  9:49         ` Xiao Guangrong
  2016-12-17  2:04       ` Junaid Shahid
  1 sibling, 2 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-12-16 15:23 UTC (permalink / raw)
  To: Xiao Guangrong, Junaid Shahid, kvm; +Cc: andreslc, pfeiner



On 16/12/2016 14:04, Xiao Guangrong wrote:
>> +    /*
>> +     * #PF can be fast if:
>> +     * 1. The shadow page table entry is not present, which could mean that
>> +     *    the fault is potentially caused by access tracking (if enabled).
>> +     * 2. The shadow page table entry is present and the fault
>> +     *    is caused by write-protect, that means we just need change the W
>> +     *    bit of the spte which can be done out of mmu-lock.
>> +     *
>> +     * However, if access tracking is disabled we know that a non-present
>> +     * page must be a genuine page fault where we have to create a new SPTE.
>> +     * So, if access tracking is disabled, we return true only for write
>> +     * accesses to a present page.
>> +     */
>> +
>> +    return shadow_acc_track_mask != 0 ||
>> +           ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
>> +        == (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));
> 
> acc-track can not fix a WRITE-access, this should be:
> 
> !(error_code & (PFERR_WRITE_MASK)) && shadow_acc_track_mask != 0 || ...

Access tracking makes pages non-present, so a !W !P fault can sometimes
be fixed.

One possibility is to test is_access_track_pte, but it is handled a
little below the call to page_fault_can_be_fast:

            remove_acc_track = is_access_track_spte(spte);

            /* Verify that the fault can be handled in the fast path */
            if (!remove_acc_track && !remove_write_prot)
                    break;

It's not different from the way page_fault_can_be_fast return true for
writes, even if spte_can_locklessly_be_made_writable will return false
later.

So I think Junaid's patch is okay.

Junaid, of all comments from Guangrong I'm mostly interested in
kvm_mmu_clear_all_pte_masks.  What was the intended purpose?

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-16 15:23       ` Paolo Bonzini
@ 2016-12-17  0:01         ` Junaid Shahid
  2016-12-21  9:49         ` Xiao Guangrong
  1 sibling, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-17  0:01 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Xiao Guangrong, kvm, andreslc, pfeiner


On Friday, December 16, 2016 04:23:21 PM Paolo Bonzini wrote:
> Junaid, of all comments from Guangrong I'm mostly interested in
> kvm_mmu_clear_all_pte_masks.  What was the intended purpose?

This was needed in the original version of this patch where the shadow_acc_track_mask was set via the separate kvm_mmu_set_access_track_masks() call rather than as part of kvm_mmu_set_mask_ptes(). In that case, without the clearing during init, we could end up with both shadow_acc_track_mask and shadow_accessed_mask being set if the kvm_intel module was reloaded with different EPT parameters e.g.

modprobe kvm_intel ept_ad=0 
rmmod kvm_intel                 
modprobe kvm_intel ept_ad=1    

Now that we are setting both masks together through kvm_mmu_set_mask_ptes(), this problem doesn’t exist and the kvm_mmu_clear_all_pte_masks() isn’t strictly needed. However, I think it might still be a good idea to keep it because the basic issue is that these masks are expected to be set by the kvm_(intel|amd) modules but they are actually a part of the kvm module and hence they are initialized to 0 only on the (re)loading of the kvm module, but not of the kvm_(intel|amd) modules.

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries
  2016-12-16 13:13         ` Xiao Guangrong
@ 2016-12-17  0:36           ` Junaid Shahid
  0 siblings, 0 replies; 56+ messages in thread
From: Junaid Shahid @ 2016-12-17  0:36 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, andreslc, pfeiner, pbonzini


On Friday, December 16, 2016 09:13:13 PM Xiao Guangrong wrote:
> 
> I mean the it is unlinked from the upper level page structure.
> 

Ah, ok. But even in that case, the retries won’t happen unless somebody was actively trying to write to the just unlinked page table at the same time, which can cause bigger problems. In any case, I can move the pgtable walk inside the loop, as there doesn’t seem to be any downside in it.

> >
> >> I am curious that did you see this retry is really helpful?  :)
> >
> > No, I haven’t done a comparison with and without the retries since it seemed to be a fairly simple optimization. And it may not be straightforward to reliably reproduce the situation where it will help.
> 
> So we are not sure if retry is really useful...
> 
> After this change, all !W page fault can go to this fast path, i think it does not hurt the
> performance, but we'd better have a performance test.

Yes, more faults can go into this fast path now. However, all except those where we actually need to fix the acc-tracking or write-prot will break out in the first iteration, so the retries won’t happen for those. For the cases where we actually do retry, the alternative could be another VMEXIT, which would obviously be a lot more expensive. But I’ll try to think of some way to test the performance.

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-16 13:04     ` Xiao Guangrong
  2016-12-16 15:23       ` Paolo Bonzini
@ 2016-12-17  2:04       ` Junaid Shahid
  2016-12-17 14:19         ` Paolo Bonzini
  1 sibling, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-17  2:04 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: kvm, andreslc, pfeiner, pbonzini


On Friday, December 16, 2016 09:04:56 PM Xiao Guangrong wrote:
> >  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> > -		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask);
> > +		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
> > +		u64 acc_track_mask);
> 
> Actually, this is the mask cleared by acc-track rather that _set_ by
> acc-track, maybe suppress_by_acc_track_mask is a better name.

Well, the original reason behind it was that a PTE is an access-track PTE if when masked by acc_track_mask, it yields acc_track_value. But we can change the name if it is confusing. Though suppress_by_acc_track_mask isn’t quite right since only the RWX bits are cleared, but the Special bit is set and the mask includes both of these.

> > +#define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
> 
> I saw no space using this mask, can be dropped.

Ok. I’ll drop it.

> > +/* The mask for the R/X bits in EPT PTEs */
> > +#define PT64_EPT_READABLE_MASK			0x1ull
> > +#define PT64_EPT_EXECUTABLE_MASK		0x4ull
> > +
> 
> Can we move this EPT specific stuff out of mmu.c?

We need these in order to define the shadow_acc_track_saved_bits_mask and since we don’t have vmx.h included in mmu.c so I had to define these here. Is adding an #include for vmx.h better? Alternatively, we can have the shadow_acc_track_saved_bits_mask passed by kvm_intel when it loads, which was the case in the original version but I had changed it to a constant based on previous feedback.

> > +static inline bool is_access_track_spte(u64 spte)
> > +{
> > +	return shadow_acc_track_mask != 0 &&
> > +	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
> > +}
> 
> spte & SPECIAL_MASK && !is_mmio(spte) is more clearer.

We can change to that. But it seems less flexible as it assumes that there is never going to be a 3rd type of Special PTE.

> > +	/*
> > +	 * Verify that the write-protection that we do below will be fixable
> > +	 * via the fast page fault path. Currently, that is always the case, at
> > +	 * least when using EPT (which is when access tracking would be used).
> > +	 */
> > +	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
> > +		  !spte_can_locklessly_be_made_writable(spte),
> > +		  "kvm: Writable SPTE is not locklessly dirty-trackable\n");
> 
> This code is right but i can not understand the comment here... :(

Basically, I was just trying to say that since making the PTE an acc-track PTE will remove the write access as well, so we better have the ability to restore the write access later in fast_page_fault. I’ll try to make the comment more clear.

> >
> > -		/*
> > -		 * Currently, to simplify the code, only the spte
> > -		 * write-protected by dirty-log can be fast fixed.
> > -		 */
> > -		if (!spte_can_locklessly_be_made_writable(spte))
> > +		remove_acc_track = is_access_track_spte(spte);
> > +
> 
> Why not check cached R/X permission can satisfy R/X access before goto atomic path?

Yes, I guess we can do that since if the restored PTE doesn’t satisfy the access we are just going to get another fault anyway.

> > +void vmx_enable_tdp(void)
> > +{
> > +	kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
> > +		enable_ept_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull,
> > +		enable_ept_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull,
> > +		0ull, VMX_EPT_EXECUTABLE_MASK,
> > +		cpu_has_vmx_ept_execute_only() ? 0ull : VMX_EPT_READABLE_MASK,
> > +		enable_ept_ad_bits ? 0ull : SPTE_SPECIAL_MASK | VMX_EPT_RWX_MASK);
> 
> I think commonly set SPTE_SPECIAL_MASK (move set SPTE_SPECIAL_MASK to mmu.c) for
> mmio-mask and acc-track-mask can make the code more clearer...

Ok. So you mean that vmx.c should just pass VMX_EPT_RWX_MASK here and VMX_EPT_MISCONFIG_WX_VALUE for the mmio mask and then mmu.c should add in SPTE_SPECIAL_MASK before storing these values in shadow_acc_track_mask and shadow_mmio_mask?


Thanks,
Junaid


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-17  2:04       ` Junaid Shahid
@ 2016-12-17 14:19         ` Paolo Bonzini
  2016-12-20  3:36           ` Junaid Shahid
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2016-12-17 14:19 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: Xiao Guangrong, kvm, andreslc, pfeiner



----- Original Message -----
> From: "Junaid Shahid" <junaids@google.com>
> To: "Xiao Guangrong" <guangrong.xiao@linux.intel.com>
> Cc: kvm@vger.kernel.org, andreslc@google.com, pfeiner@google.com, pbonzini@redhat.com
> Sent: Saturday, December 17, 2016 3:04:22 AM
> Subject: Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
> 
> 
> On Friday, December 16, 2016 09:04:56 PM Xiao Guangrong wrote:
> > >  void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
> > > -		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask);
> > > +		u64 dirty_mask, u64 nx_mask, u64 x_mask, u64 p_mask,
> > > +		u64 acc_track_mask);
> > 
> > Actually, this is the mask cleared by acc-track rather that _set_ by
> > acc-track, maybe suppress_by_acc_track_mask is a better name.
> 
> Well, the original reason behind it was that a PTE is an access-track PTE if
> when masked by acc_track_mask, it yields acc_track_value. But we can change
> the name if it is confusing. Though suppress_by_acc_track_mask isn’t quite
> right since only the RWX bits are cleared, but the Special bit is set and
> the mask includes both of these.

I agree.  The MMIO mask argument of kvm_mmu_set_mask_ptes requires some
knowledge of the inner working of mmu.c, and acc_track_mask is the same.

> > > +#define VMX_EPT_MT_MASK				(7ull << VMX_EPT_MT_EPTE_SHIFT)
> > 
> > I saw no space using this mask, can be dropped.
> 
> Ok. I’ll drop it.

Ok, I can do it too.

> > > +/* The mask for the R/X bits in EPT PTEs */
> > > +#define PT64_EPT_READABLE_MASK			0x1ull
> > > +#define PT64_EPT_EXECUTABLE_MASK		0x4ull
> > > +
> > 
> > Can we move this EPT specific stuff out of mmu.c?
> 
> We need these in order to define the shadow_acc_track_saved_bits_mask and
> since we don’t have vmx.h included in mmu.c so I had to define these here.
> Is adding an #include for vmx.h better? Alternatively, we can have the
> shadow_acc_track_saved_bits_mask passed by kvm_intel when it loads, which
> was the case in the original version but I had changed it to a constant
> based on previous feedback.

It is a constant, it's more efficient to treat it as such.  Unless someone
else needs access tracking (they shouldn't), it's okay to have a minor
layering violation.

> > > +static inline bool is_access_track_spte(u64 spte)
> > > +{
> > > +	return shadow_acc_track_mask != 0 &&
> > > +	       (spte & shadow_acc_track_mask) == shadow_acc_track_value;
> > > +}
> > 
> > spte & SPECIAL_MASK && !is_mmio(spte) is more clearer.
> 
> We can change to that. But it seems less flexible as it assumes that there is
> never going to be a 3rd type of Special PTE.
> 
> > > +	/*
> > > +	 * Verify that the write-protection that we do below will be fixable
> > > +	 * via the fast page fault path. Currently, that is always the case, at
> > > +	 * least when using EPT (which is when access tracking would be used).
> > > +	 */
> > > +	WARN_ONCE((spte & PT_WRITABLE_MASK) &&
> > > +		  !spte_can_locklessly_be_made_writable(spte),
> > > +		  "kvm: Writable SPTE is not locklessly dirty-trackable\n");
> > 
> > This code is right but i can not understand the comment here... :(
> 
> Basically, I was just trying to say that since making the PTE an acc-track
> PTE will remove the write access as well, so we better have the ability to
> restore the write access later in fast_page_fault. I’ll try to make the
> comment more clear.
> 
> > >
> > > -		/*
> > > -		 * Currently, to simplify the code, only the spte
> > > -		 * write-protected by dirty-log can be fast fixed.
> > > -		 */
> > > -		if (!spte_can_locklessly_be_made_writable(spte))
> > > +		remove_acc_track = is_access_track_spte(spte);
> > > +
> > 
> > Why not check cached R/X permission can satisfy R/X access before goto
> > atomic path?
> 
> Yes, I guess we can do that since if the restored PTE doesn’t satisfy the
> access we are just going to get another fault anyway.

Please do it as a follow up, since it complicates the logic a bit.

> > > +void vmx_enable_tdp(void)
> > > +{
> > > +	kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
> > > +		enable_ept_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull,
> > > +		enable_ept_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull,
> > > +		0ull, VMX_EPT_EXECUTABLE_MASK,
> > > +		cpu_has_vmx_ept_execute_only() ? 0ull : VMX_EPT_READABLE_MASK,
> > > +		enable_ept_ad_bits ? 0ull : SPTE_SPECIAL_MASK | VMX_EPT_RWX_MASK);
> > 
> > I think commonly set SPTE_SPECIAL_MASK (move set SPTE_SPECIAL_MASK to
> > mmu.c) for
> > mmio-mask and acc-track-mask can make the code more clearer...
> 
> Ok. So you mean that vmx.c should just pass VMX_EPT_RWX_MASK here and
> VMX_EPT_MISCONFIG_WX_VALUE for the mmio mask and then mmu.c should add in
> SPTE_SPECIAL_MASK before storing these values in shadow_acc_track_mask and
> shadow_mmio_mask?

I think I agree, but we can do this too as a separate follow-up cleanup patch.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-17 14:19         ` Paolo Bonzini
@ 2016-12-20  3:36           ` Junaid Shahid
  2016-12-20  9:01             ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Junaid Shahid @ 2016-12-20  3:36 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Xiao Guangrong, kvm, andreslc, pfeiner


On Saturday, December 17, 2016 09:19:29 AM Paolo Bonzini wrote:
> > 
> > Yes, I guess we can do that since if the restored PTE doesn’t satisfy the
> > access we are just going to get another fault anyway.
> 
> Please do it as a follow up, since it complicates the logic a bit.
> 
> ....
> > 
> > Ok. So you mean that vmx.c should just pass VMX_EPT_RWX_MASK here and
> > VMX_EPT_MISCONFIG_WX_VALUE for the mmio mask and then mmu.c should add in
> > SPTE_SPECIAL_MASK before storing these values in shadow_acc_track_mask and
> > shadow_mmio_mask?
> 
> I think I agree, but we can do this too as a separate follow-up cleanup patch.
> 

Sure. I’ll defer these to follow-up patches. What about the change to move the pgtable walk inside the retry loop in fast_page_fault? Should I update the current patch-set to do that or should we defer that to a later patch as well?

Thanks,
Junaid

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-20  3:36           ` Junaid Shahid
@ 2016-12-20  9:01             ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-12-20  9:01 UTC (permalink / raw)
  To: Junaid Shahid; +Cc: Xiao Guangrong, kvm, andreslc, pfeiner



On 20/12/2016 04:36, Junaid Shahid wrote:
> 
> On Saturday, December 17, 2016 09:19:29 AM Paolo Bonzini wrote:
>>>
>>> Yes, I guess we can do that since if the restored PTE doesn’t satisfy the
>>> access we are just going to get another fault anyway.
>>
>> Please do it as a follow up, since it complicates the logic a bit.
>>
>> ....
>>>
>>> Ok. So you mean that vmx.c should just pass VMX_EPT_RWX_MASK here and
>>> VMX_EPT_MISCONFIG_WX_VALUE for the mmio mask and then mmu.c should add in
>>> SPTE_SPECIAL_MASK before storing these values in shadow_acc_track_mask and
>>> shadow_mmio_mask?
>>
>> I think I agree, but we can do this too as a separate follow-up cleanup patch.
>>
> 
> Sure. I’ll defer these to follow-up patches. What about the change to move the pgtable walk inside the retry loop in fast_page_fault? Should I update the current patch-set to do that or should we defer that to a later patch as well?

Please separate everything.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-16 15:23       ` Paolo Bonzini
  2016-12-17  0:01         ` Junaid Shahid
@ 2016-12-21  9:49         ` Xiao Guangrong
  2016-12-21 18:00           ` Paolo Bonzini
  1 sibling, 1 reply; 56+ messages in thread
From: Xiao Guangrong @ 2016-12-21  9:49 UTC (permalink / raw)
  To: Paolo Bonzini, Junaid Shahid, kvm; +Cc: andreslc, pfeiner



On 12/16/2016 11:23 PM, Paolo Bonzini wrote:
>
>
> On 16/12/2016 14:04, Xiao Guangrong wrote:
>>> +    /*
>>> +     * #PF can be fast if:
>>> +     * 1. The shadow page table entry is not present, which could mean that
>>> +     *    the fault is potentially caused by access tracking (if enabled).
>>> +     * 2. The shadow page table entry is present and the fault
>>> +     *    is caused by write-protect, that means we just need change the W
>>> +     *    bit of the spte which can be done out of mmu-lock.
>>> +     *
>>> +     * However, if access tracking is disabled we know that a non-present
>>> +     * page must be a genuine page fault where we have to create a new SPTE.
>>> +     * So, if access tracking is disabled, we return true only for write
>>> +     * accesses to a present page.
>>> +     */
>>> +
>>> +    return shadow_acc_track_mask != 0 ||
>>> +           ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
>>> +        == (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));
>>
>> acc-track can not fix a WRITE-access, this should be:
>>
>> !(error_code & (PFERR_WRITE_MASK)) && shadow_acc_track_mask != 0 || ...
>
> Access tracking makes pages non-present, so a !W !P fault can sometimes
> be fixed.
>
> One possibility is to test is_access_track_pte, but it is handled a
> little below the call to page_fault_can_be_fast:
>
>             remove_acc_track = is_access_track_spte(spte);
>
>             /* Verify that the fault can be handled in the fast path */
>             if (!remove_acc_track && !remove_write_prot)
>                     break;
>
> It's not different from the way page_fault_can_be_fast return true for
> writes, even if spte_can_locklessly_be_made_writable will return false
> later.
>
> So I think Junaid's patch is okay.

Yes, it is workable.

My suggestion is just a optimization. Figure out Write access which can
not be fixed by acc-track earlier in page_fault_can_be_fast() can stop
useless  lockless-ly page-table walking.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
  2016-12-21  9:49         ` Xiao Guangrong
@ 2016-12-21 18:00           ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2016-12-21 18:00 UTC (permalink / raw)
  To: Xiao Guangrong, Junaid Shahid, kvm; +Cc: andreslc, pfeiner



On 21/12/2016 10:49, Xiao Guangrong wrote:
> 
> 
> On 12/16/2016 11:23 PM, Paolo Bonzini wrote:
>>
>>
>> On 16/12/2016 14:04, Xiao Guangrong wrote:
>>>> +    /*
>>>> +     * #PF can be fast if:
>>>> +     * 1. The shadow page table entry is not present, which could
>>>> mean that
>>>> +     *    the fault is potentially caused by access tracking (if
>>>> enabled).
>>>> +     * 2. The shadow page table entry is present and the fault
>>>> +     *    is caused by write-protect, that means we just need
>>>> change the W
>>>> +     *    bit of the spte which can be done out of mmu-lock.
>>>> +     *
>>>> +     * However, if access tracking is disabled we know that a
>>>> non-present
>>>> +     * page must be a genuine page fault where we have to create a
>>>> new SPTE.
>>>> +     * So, if access tracking is disabled, we return true only for
>>>> write
>>>> +     * accesses to a present page.
>>>> +     */
>>>> +
>>>> +    return shadow_acc_track_mask != 0 ||
>>>> +           ((error_code & (PFERR_WRITE_MASK | PFERR_PRESENT_MASK))
>>>> +        == (PFERR_WRITE_MASK | PFERR_PRESENT_MASK));
>>>
>>> acc-track can not fix a WRITE-access, this should be:
>>>
>>> !(error_code & (PFERR_WRITE_MASK)) && shadow_acc_track_mask != 0 || ...
>>
>> Access tracking makes pages non-present, so a !W !P fault can sometimes
>> be fixed.
>>
>> One possibility is to test is_access_track_pte, but it is handled a
>> little below the call to page_fault_can_be_fast:
>>
>>             remove_acc_track = is_access_track_spte(spte);
>>
>>             /* Verify that the fault can be handled in the fast path */
>>             if (!remove_acc_track && !remove_write_prot)
>>                     break;
>>
>> It's not different from the way page_fault_can_be_fast return true for
>> writes, even if spte_can_locklessly_be_made_writable will return false
>> later.
>>
>> So I think Junaid's patch is okay.
> 
> Yes, it is workable.
> 
> My suggestion is just a optimization. Figure out Write access which can
> not be fixed by acc-track earlier in page_fault_can_be_fast() can stop
> useless  lockless-ly page-table walking.

That optimization can be done as a follow-up, but your suggestion was
not complete.  page_fault_can_be_fast must be conservative and return
true if unsure.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2016-12-21 18:00 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-27  2:19 [PATCH 0/4] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
2016-10-27  2:19 ` [PATCH 1/4] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
2016-11-02 18:03   ` Paolo Bonzini
2016-11-02 21:40     ` Junaid Shahid
2016-10-27  2:19 ` [PATCH 2/4] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
2016-10-27  2:19 ` [PATCH 3/4] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
2016-10-27  2:19 ` [PATCH 4/4] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
2016-11-02 18:01   ` Paolo Bonzini
2016-11-02 21:42     ` Junaid Shahid
2016-11-08 23:00 ` [PATCH v2 0/5] Lockless Access Tracking " Junaid Shahid
2016-11-08 23:00   ` [PATCH v2 1/5] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
2016-11-21 13:06     ` Paolo Bonzini
2016-11-08 23:00   ` [PATCH v2 2/5] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
2016-11-21 13:07     ` Paolo Bonzini
2016-11-08 23:00   ` [PATCH v2 3/5] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
2016-11-21 13:13     ` Paolo Bonzini
2016-11-08 23:00   ` [PATCH v2 4/5] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
2016-11-21 14:42     ` Paolo Bonzini
2016-11-24  3:50       ` Junaid Shahid
2016-11-25  9:45         ` Paolo Bonzini
2016-11-29  2:43           ` Junaid Shahid
2016-11-29  8:09             ` Paolo Bonzini
2016-11-30  0:59               ` Junaid Shahid
2016-11-30 11:09                 ` Paolo Bonzini
2016-12-01 22:54       ` Junaid Shahid
2016-12-02  8:33         ` Paolo Bonzini
2016-12-05 22:57           ` Junaid Shahid
2016-11-08 23:00   ` [PATCH v2 5/5] kvm: x86: mmu: Update documentation for fast page fault mechanism Junaid Shahid
2016-12-07  0:46 ` [PATCH v3 0/8] Lockless Access Tracking for Intel CPUs without EPT A bits Junaid Shahid
2016-12-07  0:46   ` [PATCH v3 1/8] kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications Junaid Shahid
2016-12-15  6:50     ` Xiao Guangrong
2016-12-15 23:06       ` Junaid Shahid
2016-12-07  0:46   ` [PATCH v3 2/8] kvm: x86: mmu: Rename spte_is_locklessly_modifiable() Junaid Shahid
2016-12-15  6:51     ` Xiao Guangrong
2016-12-07  0:46   ` [PATCH v3 3/8] kvm: x86: mmu: Fast Page Fault path retries Junaid Shahid
2016-12-15  7:20     ` Xiao Guangrong
2016-12-15 23:36       ` Junaid Shahid
2016-12-16 13:13         ` Xiao Guangrong
2016-12-17  0:36           ` Junaid Shahid
2016-12-07  0:46   ` [PATCH v3 4/8] kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear Junaid Shahid
2016-12-07  0:46   ` [PATCH v3 5/8] kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update Junaid Shahid
2016-12-07  0:46   ` [PATCH v3 6/8] kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs Junaid Shahid
2016-12-07  0:46   ` [PATCH v3 7/8] kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits Junaid Shahid
2016-12-14 16:28     ` Paolo Bonzini
2016-12-14 22:36       ` Junaid Shahid
2016-12-14 23:35         ` Paolo Bonzini
2016-12-16 13:04     ` Xiao Guangrong
2016-12-16 15:23       ` Paolo Bonzini
2016-12-17  0:01         ` Junaid Shahid
2016-12-21  9:49         ` Xiao Guangrong
2016-12-21 18:00           ` Paolo Bonzini
2016-12-17  2:04       ` Junaid Shahid
2016-12-17 14:19         ` Paolo Bonzini
2016-12-20  3:36           ` Junaid Shahid
2016-12-20  9:01             ` Paolo Bonzini
2016-12-07  0:46   ` [PATCH v3 8/8] kvm: x86: mmu: Update documentation for fast page fault mechanism Junaid Shahid

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.