kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU
@ 2021-06-11 23:56 David Matlack
  2021-06-11 23:56 ` [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root() David Matlack
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:56 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

This patch series adds support for the TDP MMU in the fast_page_fault
path, which enables certain write-protection and access tracking faults
to be handled without taking the KVM MMU lock. This series brings the
performance of these faults up to par with the legacy MMU.

Design
------

This series enables the existing fast_page_fault handler to operate
independent of whether the TDP MMU is enabled or not by abstracting out
the details behind a new lockless page walk API. I tried an alterative
design where the TDP MMU provided its own fast_page_fault handler and
there was a shared helper code for modifying the PTE. However I decided
against this approach because it forced me to duplicate the retry loop,
resulted in calls back and forth between mmu.c and tdp_mmu.c, and
passing around the RET_PF_* values got complicated fast.

Testing
-------

Setup:
 - Ran all tests on a Cascade Lake machine.
 - Ran all tests with kvm_intel.eptad=N, kvm_intel.pml=N, kvm.tdp_mmu=N.
 - Ran all tests with kvm_intel.eptad=N, kvm_intel.pml=N, kvm.tdp_mmu=Y.

Tests:
 - Ran ll KVM selftests with default arguments
 - ./access_tracking_perf_test -v 4
 - ./access_tracking_perf_test -v 4 -o
 - ./access_tracking_perf_test -v 4 -s anonymous_thp
 - ./access_tracking_perf_test -v 4 -s anonymous_thp -o
 - ./access_tracking_perf_test -v 64
 - ./dirty_log_perf_test -v 4 -s anonymous_thp
 - ./dirty_log_perf_test -v 4 -s anonymous_thp -o
 - ./dirty_log_perf_test -v 4 -o
 - ./dirty_log_perf_test -v 64

For certain tests I also collected the fast_page_fault tracepoint to
manually make sure it was getting triggered properly:

  perf record -e kvmmmu:fast_page_fault --filter "old_spte != 0" -- <test>

Performance Results
-------------------

To measure performance I ran dirty_log_perf_test and
access_tracking_perf_test with 64 vCPUs. For dirty_log_perf_test
performance is measured by "Iteration 2 dirty memory time", the time it
takes for all vCPUs to write to their memory after it has been
write-protected. For access_tracking_perf_test performance is measured
by "Writing to idle memory", the time it takes for all vCPUs to write to
their memory after it has been access-protected.

Both metrics improved by 10x:

Metric                            | tdp_mmu=Y before   | tdp_mmu=Y after
--------------------------------- | ------------------ | --------------------
Iteration 2 dirty memory time     | 3.545234984s       | 0.312197959s
Writing to idle memory            | 3.249645416s       | 0.298275545s

The TDP MMU is now on par with the legacy MMU:

Metric                            | tdp_mmu=N          | tdp_mmu=Y
--------------------------------- | ------------------ | --------------------
Iteration 2 dirty memory time     | 0.300802793s       | 0.312197959s
Writing to idle memory            | 0.295591860s       | 0.298275545s

David Matlack (8):
  KVM: x86/mmu: Refactor is_tdp_mmu_root()
  KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault
  KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  KVM: x86/mmu: Common API for lockless shadow page walks
  KVM: x86/mmu: Also record spteps in shadow_page_walk
  KVM: x86/mmu: fast_page_fault support for the TDP MMU
  KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing
  KVM: selftests: Introduce access_tracking_perf_test

 arch/x86/kvm/mmu/mmu.c                        | 159 +++----
 arch/x86/kvm/mmu/mmu_internal.h               |  18 +
 arch/x86/kvm/mmu/mmutrace.h                   |   3 +
 arch/x86/kvm/mmu/tdp_mmu.c                    |  37 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |  14 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 .../selftests/kvm/access_tracking_perf_test.c | 419 ++++++++++++++++++
 .../selftests/kvm/dirty_log_perf_test.c       |   1 +
 9 files changed, 559 insertions(+), 96 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/access_tracking_perf_test.c

-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root()
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
@ 2021-06-11 23:56 ` David Matlack
  2021-06-14 17:56   ` Ben Gardon
  2021-06-14 19:07   ` Sean Christopherson
  2021-06-11 23:56 ` [PATCH 2/8] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:56 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

Refactor is_tdp_mmu_root() into is_vcpu_using_tdp_mmu() to reduce
duplicated code at call sites and make the code more readable.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 10 +++++-----
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.h |  8 +++++---
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0144c40d09c7..eccd889d20a5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3545,7 +3545,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 		return reserved;
 	}
 
-	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+	if (is_vcpu_using_tdp_mmu(vcpu))
 		leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
 	else
 		leaf = get_walk(vcpu, addr, sptes, &root);
@@ -3729,7 +3729,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
 
-	if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)) {
+	if (!is_vcpu_using_tdp_mmu(vcpu)) {
 		r = fast_page_fault(vcpu, gpa, error_code);
 		if (r != RET_PF_INVALID)
 			return r;
@@ -3751,7 +3751,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 	r = RET_PF_RETRY;
 
-	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+	if (is_vcpu_using_tdp_mmu(vcpu))
 		read_lock(&vcpu->kvm->mmu_lock);
 	else
 		write_lock(&vcpu->kvm->mmu_lock);
@@ -3762,7 +3762,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (r)
 		goto out_unlock;
 
-	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+	if (is_vcpu_using_tdp_mmu(vcpu))
 		r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
 				    pfn, prefault);
 	else
@@ -3770,7 +3770,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 prefault, is_tdp);
 
 out_unlock:
-	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+	if (is_vcpu_using_tdp_mmu(vcpu))
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 237317b1eddd..f4cc79dabeae 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -979,7 +979,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
 		return RET_PF_RETRY;
-	if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
+	if (WARN_ON(!is_vcpu_using_tdp_mmu(vcpu)))
 		return RET_PF_RETRY;
 
 	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5fdf63090451..c8cf12809fcf 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -91,16 +91,18 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
 #endif
 
-static inline bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
+static inline bool is_vcpu_using_tdp_mmu(struct kvm_vcpu *vcpu)
 {
+	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_page *sp;
+	hpa_t root_hpa = vcpu->arch.mmu->root_hpa;
 
 	if (!is_tdp_mmu_enabled(kvm))
 		return false;
-	if (WARN_ON(!VALID_PAGE(hpa)))
+	if (WARN_ON(!VALID_PAGE(root_hpa)))
 		return false;
 
-	sp = to_shadow_page(hpa);
+	sp = to_shadow_page(root_hpa);
 	if (WARN_ON(!sp))
 		return false;
 
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/8] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
  2021-06-11 23:56 ` [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root() David Matlack
@ 2021-06-11 23:56 ` David Matlack
  2021-06-14 17:56   ` Ben Gardon
  2021-06-11 23:56 ` [PATCH 3/8] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault David Matlack
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:56 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

fast_page_fault is only called from direct_page_fault where we know the
address is a gpa.

Fixes: 736c291c9f36 ("KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM")
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eccd889d20a5..1d0fe1445e04 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3007,8 +3007,7 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
 /*
  * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
  */
-static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-			   u32 error_code)
+static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
@@ -3024,7 +3023,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	do {
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, cr2_or_gpa, iterator, spte)
+		for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
 			if (!is_shadow_present_pte(spte))
 				break;
 
@@ -3103,8 +3102,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 	} while (true);
 
-	trace_fast_page_fault(vcpu, cr2_or_gpa, error_code, iterator.sptep,
-			      spte, ret);
+	trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
 
 	return ret;
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/8] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
  2021-06-11 23:56 ` [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root() David Matlack
  2021-06-11 23:56 ` [PATCH 2/8] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
@ 2021-06-11 23:56 ` David Matlack
  2021-06-11 23:56 ` [PATCH 4/8] KVM: x86/mmu: Common API for lockless shadow page walks David Matlack
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:56 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

Enum values have to be exported to userspace since the formatting is not
done in the kernel. Without doing this perf maps RET_PF_FIXED and
RET_PF_SPURIOUS to 0, which results in incorrect output:

  $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
  $ perf script | head -1
   [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1

Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.

Fixes: c4371c2a682e ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmutrace.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index e798489b56b5..669b1405c60d 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -244,6 +244,9 @@ TRACE_EVENT(
 		  __entry->access)
 );
 
+TRACE_DEFINE_ENUM(RET_PF_FIXED);
+TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
+
 TRACE_EVENT(
 	fast_page_fault,
 	TP_PROTO(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 error_code,
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/8] KVM: x86/mmu: Common API for lockless shadow page walks
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (2 preceding siblings ...)
  2021-06-11 23:56 ` [PATCH 3/8] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault David Matlack
@ 2021-06-11 23:56 ` David Matlack
  2021-06-14 17:56   ` Ben Gardon
  2021-06-11 23:56 ` [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk David Matlack
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:56 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

Introduce a common API for walking the shadow page tables locklessly
that abstracts away whether the TDP MMU is enabled or not. This will be
used in a follow-up patch to support the TDP MMU in fast_page_fault.

The API can be used as follows:

  struct shadow_page_walk walk;

  walk_shadow_page_lockless_begin(vcpu);
  if (!walk_shadow_page_lockless(vcpu, addr, &walk))
    goto out;

  ... use `walk` ...

out:
  walk_shadow_page_lockless_end(vcpu);

Note: Separating walk_shadow_page_lockless_begin() from
walk_shadow_page_lockless() seems superfluous at first glance but is
needed to support fast_page_fault() since it performs multiple walks
under the same begin/end block.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 96 ++++++++++++++++++++-------------
 arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 ++++++------
 arch/x86/kvm/mmu/tdp_mmu.h      |  6 ++-
 4 files changed, 96 insertions(+), 55 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1d0fe1445e04..8140c262f4d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -623,6 +623,11 @@ static bool mmu_spte_age(u64 *sptep)
 
 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 {
+	if (is_vcpu_using_tdp_mmu(vcpu)) {
+		kvm_tdp_mmu_walk_lockless_begin();
+		return;
+	}
+
 	/*
 	 * Prevent page table teardown by making any free-er wait during
 	 * kvm_flush_remote_tlbs() IPI to all active vcpus.
@@ -638,6 +643,11 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 
 static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 {
+	if (is_vcpu_using_tdp_mmu(vcpu)) {
+		kvm_tdp_mmu_walk_lockless_end();
+		return;
+	}
+
 	/*
 	 * Make sure the write to vcpu->mode is not reordered in front of
 	 * reads to sptes.  If it does, kvm_mmu_commit_zap_page() can see us
@@ -3501,59 +3511,61 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 }
 
 /*
- * Return the level of the lowest level SPTE added to sptes.
- * That SPTE may be non-present.
+ * Walks the shadow page table for the given address until a leaf or non-present
+ * spte is encountered.
+ *
+ * Returns false if no walk could be performed, in which case `walk` does not
+ * contain any valid data.
+ *
+ * Must be called between walk_shadow_page_lockless_{begin,end}.
  */
-static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level)
+static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
+				      struct shadow_page_walk *walk)
 {
-	struct kvm_shadow_walk_iterator iterator;
-	int leaf = -1;
+	struct kvm_shadow_walk_iterator it;
+	bool walk_ok = false;
 	u64 spte;
 
-	walk_shadow_page_lockless_begin(vcpu);
+	if (is_vcpu_using_tdp_mmu(vcpu))
+		return kvm_tdp_mmu_walk_lockless(vcpu, addr, walk);
 
-	for (shadow_walk_init(&iterator, vcpu, addr),
-	     *root_level = iterator.level;
-	     shadow_walk_okay(&iterator);
-	     __shadow_walk_next(&iterator, spte)) {
-		leaf = iterator.level;
-		spte = mmu_spte_get_lockless(iterator.sptep);
+	shadow_walk_init(&it, vcpu, addr);
+	walk->root_level = it.level;
 
-		sptes[leaf] = spte;
+	for (; shadow_walk_okay(&it); __shadow_walk_next(&it, spte)) {
+		walk_ok = true;
+
+		spte = mmu_spte_get_lockless(it.sptep);
+		walk->last_level = it.level;
+		walk->sptes[it.level] = spte;
 
 		if (!is_shadow_present_pte(spte))
 			break;
 	}
 
-	walk_shadow_page_lockless_end(vcpu);
-
-	return leaf;
+	return walk_ok;
 }
 
 /* return true if reserved bit(s) are detected on a valid, non-MMIO SPTE. */
 static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 {
-	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
+	struct shadow_page_walk walk;
 	struct rsvd_bits_validate *rsvd_check;
-	int root, leaf, level;
+	int last_level, level;
 	bool reserved = false;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
-		*sptep = 0ull;
+	*sptep = 0ull;
+
+	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
 		return reserved;
-	}
 
-	if (is_vcpu_using_tdp_mmu(vcpu))
-		leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
-	else
-		leaf = get_walk(vcpu, addr, sptes, &root);
+	walk_shadow_page_lockless_begin(vcpu);
 
-	if (unlikely(leaf < 0)) {
-		*sptep = 0ull;
-		return reserved;
-	}
+	if (!walk_shadow_page_lockless(vcpu, addr, &walk))
+		goto out;
 
-	*sptep = sptes[leaf];
+	last_level = walk.last_level;
+	*sptep = walk.sptes[last_level];
 
 	/*
 	 * Skip reserved bits checks on the terminal leaf if it's not a valid
@@ -3561,29 +3573,37 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 	 * design, always have reserved bits set.  The purpose of the checks is
 	 * to detect reserved bits on non-MMIO SPTEs. i.e. buggy SPTEs.
 	 */
-	if (!is_shadow_present_pte(sptes[leaf]))
-		leaf++;
+	if (!is_shadow_present_pte(walk.sptes[last_level]))
+		last_level++;
 
 	rsvd_check = &vcpu->arch.mmu->shadow_zero_check;
 
-	for (level = root; level >= leaf; level--)
+	for (level = walk.root_level; level >= last_level; level--) {
+		u64 spte = walk.sptes[level];
+
 		/*
 		 * Use a bitwise-OR instead of a logical-OR to aggregate the
 		 * reserved bit and EPT's invalid memtype/XWR checks to avoid
 		 * adding a Jcc in the loop.
 		 */
-		reserved |= __is_bad_mt_xwr(rsvd_check, sptes[level]) |
-			    __is_rsvd_bits_set(rsvd_check, sptes[level], level);
+		reserved |= __is_bad_mt_xwr(rsvd_check, spte) |
+			    __is_rsvd_bits_set(rsvd_check, spte, level);
+	}
 
 	if (reserved) {
 		pr_err("%s: reserved bits set on MMU-present spte, addr 0x%llx, hierarchy:\n",
 		       __func__, addr);
-		for (level = root; level >= leaf; level--)
+		for (level = walk.root_level; level >= last_level; level--) {
+			u64 spte = walk.sptes[level];
+
 			pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
-			       sptes[level], level,
-			       rsvd_check->rsvd_bits_mask[(sptes[level] >> 7) & 1][level-1]);
+			       spte, level,
+			       rsvd_check->rsvd_bits_mask[(spte >> 7) & 1][level-1]);
+		}
 	}
 
+out:
+	walk_shadow_page_lockless_end(vcpu);
 	return reserved;
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d64ccb417c60..26da6ca30fbf 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -165,4 +165,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+struct shadow_page_walk {
+	/* The level of the root spte in the walk. */
+	int root_level;
+
+	/*
+	 * The level of the last spte in the walk. The last spte is either the
+	 * leaf of the walk (which may or may not be present) or the first
+	 * non-present spte encountered during the walk.
+	 */
+	int last_level;
+
+	/* The spte value at each level. */
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
+};
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f4cc79dabeae..36f4844a5f95 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1504,28 +1504,32 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	return spte_set;
 }
 
-/*
- * Return the level of the lowest level SPTE added to sptes.
- * That SPTE may be non-present.
- */
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
+void kvm_tdp_mmu_walk_lockless_begin(void)
+{
+	rcu_read_lock();
+}
+
+void kvm_tdp_mmu_walk_lockless_end(void)
+{
+	rcu_read_unlock();
+}
+
+bool kvm_tdp_mmu_walk_lockless(struct kvm_vcpu *vcpu, u64 addr,
+			       struct shadow_page_walk *walk)
 {
 	struct tdp_iter iter;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
-	int leaf = -1;
+	bool walk_ok = false;
 
-	*root_level = vcpu->arch.mmu->shadow_root_level;
-
-	rcu_read_lock();
+	walk->root_level = vcpu->arch.mmu->shadow_root_level;
 
 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
-		leaf = iter.level;
-		sptes[leaf] = iter.old_spte;
-	}
+		walk_ok = true;
 
-	rcu_read_unlock();
+		walk->last_level = iter.level;
+		walk->sptes[iter.level] = iter.old_spte;
+	}
 
-	return leaf;
+	return walk_ok;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index c8cf12809fcf..772d11bbb92a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -76,8 +76,10 @@ bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn);
 
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level);
+void kvm_tdp_mmu_walk_lockless_begin(void);
+void kvm_tdp_mmu_walk_lockless_end(void);
+bool kvm_tdp_mmu_walk_lockless(struct kvm_vcpu *vcpu, u64 addr,
+			       struct shadow_page_walk *walk);
 
 #ifdef CONFIG_X86_64
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (3 preceding siblings ...)
  2021-06-11 23:56 ` [PATCH 4/8] KVM: x86/mmu: Common API for lockless shadow page walks David Matlack
@ 2021-06-11 23:56 ` David Matlack
  2021-06-14 17:56   ` Ben Gardon
                     ` (2 more replies)
  2021-06-11 23:56 ` [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU David Matlack
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:56 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

In order to use walk_shadow_page_lockless() in fast_page_fault() we need
to also record the spteps.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 1 +
 arch/x86/kvm/mmu/mmu_internal.h | 3 +++
 arch/x86/kvm/mmu/tdp_mmu.c      | 1 +
 3 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8140c262f4d3..765f5b01768d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3538,6 +3538,7 @@ static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
 		spte = mmu_spte_get_lockless(it.sptep);
 		walk->last_level = it.level;
 		walk->sptes[it.level] = spte;
+		walk->spteps[it.level] = it.sptep;
 
 		if (!is_shadow_present_pte(spte))
 			break;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 26da6ca30fbf..0fefbd5d6c95 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -178,6 +178,9 @@ struct shadow_page_walk {
 
 	/* The spte value at each level. */
 	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
+
+	/* The spte pointers at each level. */
+	u64 *spteps[PT64_ROOT_MAX_LEVEL + 1];
 };
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 36f4844a5f95..7279d17817a1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1529,6 +1529,7 @@ bool kvm_tdp_mmu_walk_lockless(struct kvm_vcpu *vcpu, u64 addr,
 
 		walk->last_level = iter.level;
 		walk->sptes[iter.level] = iter.old_spte;
+		walk->spteps[iter.level] = iter.sptep;
 	}
 
 	return walk_ok;
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (4 preceding siblings ...)
  2021-06-11 23:56 ` [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk David Matlack
@ 2021-06-11 23:56 ` David Matlack
  2021-06-11 23:59   ` David Matlack
  2021-06-11 23:57 ` [PATCH 7/8] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing David Matlack
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:56 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

This commit enables the fast_page_fault handler to work when the TDP MMU
is enabled by leveraging the new walk_shadow_page_lockless* API to
collect page walks independent of the TDP MMU.

fast_page_fault was already using
walk_shadow_page_lockless_{begin,end}(), we just have to change the
actual walk to use walk_shadow_page_lockless() which does the right
thing if the TDP MMU is in use.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++-------------------------
 1 file changed, 21 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 765f5b01768d..5562727c3699 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -657,6 +657,9 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	local_irq_enable();
 }
 
+static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
+				      struct shadow_page_walk *walk);
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -2967,14 +2970,9 @@ static bool page_fault_can_be_fast(u32 error_code)
  * Returns true if the SPTE was fixed successfully. Otherwise,
  * someone else modified the SPTE from its original value.
  */
-static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-			u64 *sptep, u64 old_spte, u64 new_spte)
+static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, gpa_t gpa,
+				    u64 *sptep, u64 old_spte, u64 new_spte)
 {
-	gfn_t gfn;
-
-	WARN_ON(!sp->role.direct);
-
 	/*
 	 * Theoretically we could also set dirty bit (and flush TLB) here in
 	 * order to eliminate unnecessary PML logging. See comments in
@@ -2990,14 +2988,8 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
 		return false;
 
-	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
-		/*
-		 * The gfn of direct spte is stable since it is
-		 * calculated by sp->gfn.
-		 */
-		gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
-		kvm_vcpu_mark_page_dirty(vcpu, gfn);
-	}
+	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
+		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
 
 	return true;
 }
@@ -3019,10 +3011,9 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
  */
 static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 {
-	struct kvm_shadow_walk_iterator iterator;
-	struct kvm_mmu_page *sp;
 	int ret = RET_PF_INVALID;
 	u64 spte = 0ull;
+	u64 *sptep = NULL;
 	uint retry_count = 0;
 
 	if (!page_fault_can_be_fast(error_code))
@@ -3031,17 +3022,19 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 	walk_shadow_page_lockless_begin(vcpu);
 
 	do {
+		struct shadow_page_walk walk;
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
-			if (!is_shadow_present_pte(spte))
-				break;
+		if (!walk_shadow_page_lockless(vcpu, gpa, &walk))
+			break;
+
+		spte = walk.sptes[walk.last_level];
+		sptep = walk.spteps[walk.last_level];
 
 		if (!is_shadow_present_pte(spte))
 			break;
 
-		sp = sptep_to_sp(iterator.sptep);
-		if (!is_last_spte(spte, sp->role.level))
+		if (!is_last_spte(spte, walk.last_level))
 			break;
 
 		/*
@@ -3084,7 +3077,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 			 *
 			 * See the comments in kvm_arch_commit_memory_region().
 			 */
-			if (sp->role.level > PG_LEVEL_4K)
+			if (walk.last_level > PG_LEVEL_4K)
 				break;
 		}
 
@@ -3098,8 +3091,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 		 * since the gfn is not stable for indirect shadow page. See
 		 * Documentation/virt/kvm/locking.rst to get more detail.
 		 */
-		if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte,
-					    new_spte)) {
+		if (fast_pf_fix_direct_spte(vcpu, gpa, sptep, spte, new_spte)) {
 			ret = RET_PF_FIXED;
 			break;
 		}
@@ -3112,7 +3104,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 
 	} while (true);
 
-	trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
+	trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
 
 	return ret;
@@ -3748,11 +3740,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
 
-	if (!is_vcpu_using_tdp_mmu(vcpu)) {
-		r = fast_page_fault(vcpu, gpa, error_code);
-		if (r != RET_PF_INVALID)
-			return r;
-	}
+	r = fast_page_fault(vcpu, gpa, error_code);
+	if (r != RET_PF_INVALID)
+		return r;
 
 	r = mmu_topup_memory_caches(vcpu, false);
 	if (r)
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/8] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (5 preceding siblings ...)
  2021-06-11 23:56 ` [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU David Matlack
@ 2021-06-11 23:57 ` David Matlack
  2021-06-14 17:56   ` Ben Gardon
  2021-06-11 23:57 ` [PATCH 8/8] KVM: selftests: Introduce access_tracking_perf_test David Matlack
  2021-06-14  9:54 ` [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU Paolo Bonzini
  8 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:57 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

There is a missing break statement which causes a fallthrough to the
next statement where optarg will be null and a segmentation fault will
be generated.

Fixes: 9e965bb75aae ("KVM: selftests: Add backing src parameter to dirty_log_perf_test")
Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 04a2641261be..80cbd3a748c0 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -312,6 +312,7 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			break;
 		case 's':
 			p.backing_src = parse_backing_src_type(optarg);
 			break;
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/8] KVM: selftests: Introduce access_tracking_perf_test
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (6 preceding siblings ...)
  2021-06-11 23:57 ` [PATCH 7/8] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing David Matlack
@ 2021-06-11 23:57 ` David Matlack
  2021-06-14 17:56   ` Ben Gardon
  2021-06-14  9:54 ` [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU Paolo Bonzini
  8 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:57 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, David Matlack

This test measures the performance effects of KVM's access tracking.
Access tracking is driven by the MMU notifiers test_young, clear_young,
and clear_flush_young. These notifiers do not have a direct userspace
API, however the clear_young notifier can be triggered by marking a
pages as idle in /sys/kernel/mm/page_idle/bitmap. This test leverages
that mechanism to enable access tracking on guest memory.

To measure performance this test runs a VM with a configurable number of
vCPUs that each touch every page in disjoint regions of memory.
Performance is measured in the time it takes all vCPUs to finish
touching their predefined region.

Example invocation:

  $ ./access_tracking_perf_test -v 8
  Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
  guest physical test memory offset: 0xffdfffff000

  Populating memory             : 1.337752570s
  Writing to populated memory   : 0.010177640s
  Reading from populated memory : 0.009548239s
  Mark memory idle              : 23.973131748s
  Writing to idle memory        : 0.063584496s
  Mark memory idle              : 24.924652964s
  Reading from idle memory      : 0.062042814s

Breaking down the results:

 * "Populating memory": The time it takes for all vCPUs to perform the
   first write to every page in their region.

 * "Writing to populated memory" / "Reading from populated memory": The
   time it takes for all vCPUs to write and read to every page in their
   region after it has been populated. This serves as a control for the
   later results.

 * "Mark memory idle": The time it takes for every vCPU to mark every
   page in their region as idle through page_idle.

 * "Writing to idle memory" / "Reading from idle memory": The time it
   takes for all vCPUs to write and read to every page in their region
   after it has been marked idle.

This test should be portable across architectures but it is only enabled
for x86_64 since that's all I have tested.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 419 ++++++++++++++++++
 3 files changed, 421 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/access_tracking_perf_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index bd83158e0e0b..32a362d71e05 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -34,6 +34,7 @@
 /x86_64/xen_vmcall_test
 /x86_64/xss_msr_test
 /x86_64/vmx_pmu_msrs_test
+/access_tracking_perf_test
 /demand_paging_test
 /dirty_log_test
 /dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index e439d027939d..9f1b478da92b 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -67,6 +67,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_msrs_test
 TEST_GEN_PROGS_x86_64 += x86_64/xen_shinfo_test
 TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
+TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
new file mode 100644
index 000000000000..60828f2d780f
--- /dev/null
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -0,0 +1,419 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * access_tracking_test
+ *
+ * Copyright (C) 2021, Google, Inc.
+ *
+ * This test measures the performance effects of KVM's access tracking.
+ * Access tracking is driven by the MMU notifiers test_young, clear_young, and
+ * clear_flush_young. These notifiers do not have a direct userspace API,
+ * however the clear_young notifier can be triggered by marking a pages as idle
+ * in /sys/kernel/mm/page_idle/bitmap. This test leverages that mechanism to
+ * enable access tracking on guest memory.
+ *
+ * To measure performance this test runs a VM with a configurable number of
+ * vCPUs that each touch every page in disjoint regions of memory. Performance
+ * is measured in the time it takes all vCPUs to finish touching their
+ * predefined region.
+ *
+ * Note that a deterministic correctness test of access tracking is not possible
+ * by using page_idle as it exists today. This is for a few reasons:
+ *
+ * 1. page_idle only issues clear_young notifiers, which lack a TLB flush. This
+ *    means subsequent guest accesses are not guaranteed to see page table
+ *    updates made by KVM until some time in the future.
+ *
+ * 2. page_idle only operates on LRU pages. Newly allocated pages are not
+ *    immediately allocated to LRU lists. Instead they are held in a "pagevec",
+ *    which is drained to LRU lists some time in the future. There is no
+ *    userspace API to force this drain to occur.
+ *
+ * These limitations are worked around in this test by using a large enough
+ * region of memory for each vCPU such that the number of translations cached in
+ * the TLB and the number of pages held in pagevecs are a small fraction of the
+ * overall workload. And if either of those conditions are not true this test
+ * will fail rather than silently passing.
+ */
+#include <inttypes.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "perf_test_util.h"
+#include "guest_modes.h"
+
+/* Global variable used to synchronize all of the vCPU threads. */
+static int iteration = -1;
+
+/* Defines what vCPU threads should do during a given iteration. */
+static enum {
+	/* Run the vCPU to access all its memory. */
+	ITERATION_ACCESS_MEMORY,
+	/* Mark the vCPU's memory idle in page_idle. */
+	ITERATION_MARK_IDLE,
+} iteration_work;
+
+/* Set to true when vCPU threads should exit. */
+static bool done;
+
+/* The iteration that was last completed by each vCPU. */
+static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
+
+/* Whether to overlap the regions of memory vCPUs access. */
+static bool overlap_memory_access;
+
+struct test_params {
+	/* The backing source for the region of memory. */
+	enum vm_mem_backing_src_type backing_src;
+
+	/* The amount of memory to allocate for each vCPU. */
+	uint64_t vcpu_memory_bytes;
+
+	/* The number of vCPUs to create in the VM. */
+	int vcpus;
+};
+
+static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
+{
+	uint64_t value;
+	off_t offset = index * sizeof(value);
+
+	TEST_ASSERT(pread(fd, &value, sizeof(value), offset) == sizeof(value),
+		    "pread from %s offset 0x%" PRIx64 " failed!",
+		    filename, offset);
+
+	return value;
+
+}
+
+static uint64_t lookup_pfn(int pagemap_fd, struct kvm_vm *vm, uint64_t gva)
+{
+	uint64_t hva = (uint64_t) addr_gva2hva(vm, gva);
+	uint64_t entry;
+
+	entry = pread_uint64(pagemap_fd, "pagemap", hva / getpagesize());
+	if (!(entry & (1ULL << 63)))
+		return 0;
+
+	return (entry & ((1ULL << 55) - 1));
+}
+
+static bool is_page_idle(int page_idle_fd, uint64_t pfn)
+{
+	uint64_t bits = pread_uint64(page_idle_fd, "page_idle", pfn / 64);
+
+	return !!((bits >> (pfn % 64)) & 1);
+}
+
+static void mark_page_idle(int page_idle_fd, uint64_t pfn)
+{
+	uint64_t bits = 1ULL << (pfn % 64);
+
+	TEST_ASSERT(pwrite(page_idle_fd, &bits, 8, 8 * (pfn / 64)) == 8,
+		    "Set page_idle bits for PFN 0x%" PRIx64, pfn);
+}
+
+static void mark_vcpu_memory_idle(struct kvm_vm *vm, int vcpu_id)
+{
+	uint64_t base_gva = perf_test_args.vcpu_args[vcpu_id].gva;
+	uint64_t pages = perf_test_args.vcpu_args[vcpu_id].pages;
+	uint64_t page;
+	uint64_t still_idle = 0;
+	uint64_t no_pfn = 0;
+	int page_idle_fd;
+	int pagemap_fd;
+
+	/* If vCPUs are using an overlapping region, let vCPU 0 mark it idle. */
+	if (overlap_memory_access && vcpu_id)
+		return;
+
+	page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+	TEST_ASSERT(page_idle_fd > 0, "Failed to open page_idle.");
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	TEST_ASSERT(pagemap_fd > 0, "Failed to open pagemap.");
+
+	for (page = 0; page < pages; page++) {
+		uint64_t gva = base_gva + page * perf_test_args.guest_page_size;
+		uint64_t pfn = lookup_pfn(pagemap_fd, vm, gva);
+
+		if (!pfn) {
+			no_pfn++;
+			continue;
+		}
+
+		if (is_page_idle(page_idle_fd, pfn)) {
+			still_idle++;
+			continue;
+		}
+
+		mark_page_idle(page_idle_fd, pfn);
+	}
+
+	/*
+	 * Assumption: Less than 1% of pages are going to be swapped out from
+	 * under us during this test.
+	 */
+	TEST_ASSERT(no_pfn < pages / 100,
+		    "vCPU %d: No PFN for %" PRIu64 " out of %" PRIu64 " pages.",
+		    vcpu_id, no_pfn, pages);
+
+	/*
+	 * Test that at least 90% of memory has been marked idle (the rest might
+	 * not be marked idle because the pages have not yet made it to an LRU
+	 * list or the translations are still cached in the TLB). 90% is
+	 * arbitrary; high enough that we ensure most memory access went through
+	 * access tracking but low enough as to not make the test too brittle
+	 * over time and across architectures.
+	 */
+	TEST_ASSERT(still_idle < pages / 10,
+		    "vCPU%d: Too many pages still idle (%"PRIu64 " out of %"
+		    PRIu64 ").\n",
+		    vcpu_id, still_idle, pages);
+
+	close(page_idle_fd);
+	close(pagemap_fd);
+}
+
+static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
+			 uint64_t expected_ucall)
+{
+	struct ucall uc;
+	uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
+
+	TEST_ASSERT(expected_ucall == actual_ucall,
+		    "Guest exited unexpectedly (expected ucall %" PRIu64
+		    ", got %" PRIu64 ")",
+		    expected_ucall, actual_ucall);
+}
+
+static bool spin_wait_for_next_iteration(int *current_iteration)
+{
+	int last_iteration = *current_iteration;
+
+	do {
+		if (READ_ONCE(done))
+			return false;
+
+		*current_iteration = READ_ONCE(iteration);
+	} while (last_iteration == *current_iteration);
+
+	return true;
+}
+
+static void *vcpu_thread_main(void *arg)
+{
+	struct perf_test_vcpu_args *vcpu_args = arg;
+	struct kvm_vm *vm = perf_test_args.vm;
+	int vcpu_id = vcpu_args->vcpu_id;
+	int current_iteration = -1;
+
+	vcpu_args_set(vm, vcpu_id, 1, vcpu_id);
+
+	while (spin_wait_for_next_iteration(&current_iteration)) {
+		switch (READ_ONCE(iteration_work)) {
+		case ITERATION_ACCESS_MEMORY:
+			vcpu_run(vm, vcpu_id);
+			assert_ucall(vm, vcpu_id, UCALL_SYNC);
+			break;
+		case ITERATION_MARK_IDLE:
+			mark_vcpu_memory_idle(vm, vcpu_id);
+			break;
+		};
+
+		vcpu_last_completed_iteration[vcpu_id] = current_iteration;
+	}
+
+	return NULL;
+}
+
+static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
+{
+	while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
+	       target_iteration) {
+		continue;
+	}
+}
+
+/* The type of memory accesses to perform in the VM. */
+enum access_type {
+	ACCESS_READ,
+	ACCESS_WRITE,
+};
+
+static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
+{
+	struct timespec ts_start;
+	struct timespec ts_elapsed;
+	int next_iteration;
+	int vcpu_id;
+
+	/* Kick off the vCPUs by incrementing iteration. */
+	next_iteration = ++iteration;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+	/* Wait for all vCPUs to finish the iteration. */
+	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
+		spin_wait_for_vcpu(vcpu_id, next_iteration);
+
+	ts_elapsed = timespec_elapsed(ts_start);
+	pr_info("%-30s: %ld.%09lds\n",
+		description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+}
+
+static void access_memory(struct kvm_vm *vm, int vcpus, enum access_type access,
+			  const char *description)
+{
+	perf_test_args.wr_fract = (access == ACCESS_READ) ? INT_MAX : 1;
+	sync_global_to_guest(vm, perf_test_args);
+	iteration_work = ITERATION_ACCESS_MEMORY;
+	run_iteration(vm, vcpus, description);
+}
+
+static void mark_memory_idle(struct kvm_vm *vm, int vcpus)
+{
+	/*
+	 * Even though this parallelizes the work across vCPUs, this is still a
+	 * very slow operation because page_idle forces the test to mark one pfn
+	 * at a time and the clear_young notifier serializes on the KVM MMU
+	 * lock.
+	 */
+	pr_debug("Marking VM memory idle (slow)...\n");
+	iteration_work = ITERATION_MARK_IDLE;
+	run_iteration(vm, vcpus, "Mark memory idle");
+}
+
+static pthread_t *create_vcpu_threads(int vcpus)
+{
+	pthread_t *vcpu_threads;
+	int i;
+
+	vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
+	TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
+
+	for (i = 0; i < vcpus; i++) {
+		vcpu_last_completed_iteration[i] = iteration;
+		pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
+			       &perf_test_args.vcpu_args[i]);
+	}
+
+	return vcpu_threads;
+}
+
+static void terminate_vcpu_threads(pthread_t *vcpu_threads, int vcpus)
+{
+	int i;
+
+	/* Set done to signal the vCPU threads to exit */
+	done = true;
+
+	for (i = 0; i < vcpus; i++)
+		pthread_join(vcpu_threads[i], NULL);
+}
+
+static void run_test(enum vm_guest_mode mode, void *arg)
+{
+	struct test_params *params = arg;
+	struct kvm_vm *vm;
+	pthread_t *vcpu_threads;
+	int vcpus = params->vcpus;
+
+	vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes,
+				 params->backing_src);
+
+	perf_test_setup_vcpus(vm, vcpus, params->vcpu_memory_bytes,
+			      !overlap_memory_access);
+
+	vcpu_threads = create_vcpu_threads(vcpus);
+
+	pr_info("\n");
+	access_memory(vm, vcpus, ACCESS_WRITE, "Populating memory");
+
+	/* As a control, read and write to the populated memory first. */
+	access_memory(vm, vcpus, ACCESS_WRITE, "Writing to populated memory");
+	access_memory(vm, vcpus, ACCESS_READ, "Reading from populated memory");
+
+	/* Repeat on memory that has been marked as idle. */
+	mark_memory_idle(vm, vcpus);
+	access_memory(vm, vcpus, ACCESS_WRITE, "Writing to idle memory");
+	mark_memory_idle(vm, vcpus);
+	access_memory(vm, vcpus, ACCESS_READ, "Reading from idle memory");
+
+	terminate_vcpu_threads(vcpu_threads, vcpus);
+	free(vcpu_threads);
+	perf_test_destroy_vm(vm);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
+	       name);
+	puts("");
+	printf(" -h: Display this help message.");
+	guest_modes_help();
+	printf(" -b: specify the size of the memory region which should be\n"
+	       "     dirtied by each vCPU. e.g. 10M or 3G.\n"
+	       "     (default: 1G)\n");
+	printf(" -v: specify the number of vCPUs to run.\n");
+	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
+	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -s: specify the type of memory that should be used to\n"
+	       "     back the guest data region.\n\n");
+	backing_src_help();
+	puts("");
+	exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+	struct test_params params = {
+		.backing_src = VM_MEM_SRC_ANONYMOUS,
+		.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
+		.vcpus = 1,
+	};
+	int page_idle_fd;
+	int opt;
+
+	guest_modes_append_default();
+
+	while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
+		switch (opt) {
+		case 'm':
+			guest_modes_cmdline(optarg);
+			break;
+		case 'b':
+			params.vcpu_memory_bytes = parse_size(optarg);
+			break;
+		case 'v':
+			params.vcpus = atoi(optarg);
+			break;
+		case 'o':
+			overlap_memory_access = true;
+			break;
+		case 's':
+			params.backing_src = parse_backing_src_type(optarg);
+			break;
+		case 'h':
+		default:
+			help(argv[0]);
+			break;
+		}
+	}
+
+	page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+	if (page_idle_fd < 0) {
+		print_skip("CONFIG_IDLE_PAGE_TRACKING is not enabled");
+		exit(KSFT_SKIP);
+	}
+	close(page_idle_fd);
+
+	for_each_guest_mode(run_test, &params);
+
+	return 0;
+}
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-11 23:56 ` [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU David Matlack
@ 2021-06-11 23:59   ` David Matlack
  2021-06-14 17:56     ` Ben Gardon
  0 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-11 23:59 UTC (permalink / raw)
  To: kvm list
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
>
> This commit enables the fast_page_fault handler to work when the TDP MMU
> is enabled by leveraging the new walk_shadow_page_lockless* API to
> collect page walks independent of the TDP MMU.
>
> fast_page_fault was already using
> walk_shadow_page_lockless_{begin,end}(), we just have to change the
> actual walk to use walk_shadow_page_lockless() which does the right
> thing if the TDP MMU is in use.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Adding this feature was suggested by Ben Gardon:

Suggested-by: Ben Gardon <bgardon@google.com>

> ---
>  arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++-------------------------
>  1 file changed, 21 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 765f5b01768d..5562727c3699 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -657,6 +657,9 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>         local_irq_enable();
>  }
>
> +static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
> +                                     struct shadow_page_walk *walk);
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>         int r;
> @@ -2967,14 +2970,9 @@ static bool page_fault_can_be_fast(u32 error_code)
>   * Returns true if the SPTE was fixed successfully. Otherwise,
>   * someone else modified the SPTE from its original value.
>   */
> -static bool
> -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> -                       u64 *sptep, u64 old_spte, u64 new_spte)
> +static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, gpa_t gpa,
> +                                   u64 *sptep, u64 old_spte, u64 new_spte)
>  {
> -       gfn_t gfn;
> -
> -       WARN_ON(!sp->role.direct);
> -
>         /*
>          * Theoretically we could also set dirty bit (and flush TLB) here in
>          * order to eliminate unnecessary PML logging. See comments in
> @@ -2990,14 +2988,8 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>         if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
>                 return false;
>
> -       if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
> -               /*
> -                * The gfn of direct spte is stable since it is
> -                * calculated by sp->gfn.
> -                */
> -               gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> -               kvm_vcpu_mark_page_dirty(vcpu, gfn);
> -       }
> +       if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
> +               kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
>
>         return true;
>  }
> @@ -3019,10 +3011,9 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
>   */
>  static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>  {
> -       struct kvm_shadow_walk_iterator iterator;
> -       struct kvm_mmu_page *sp;
>         int ret = RET_PF_INVALID;
>         u64 spte = 0ull;
> +       u64 *sptep = NULL;
>         uint retry_count = 0;
>
>         if (!page_fault_can_be_fast(error_code))
> @@ -3031,17 +3022,19 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>         walk_shadow_page_lockless_begin(vcpu);
>
>         do {
> +               struct shadow_page_walk walk;
>                 u64 new_spte;
>
> -               for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
> -                       if (!is_shadow_present_pte(spte))
> -                               break;
> +               if (!walk_shadow_page_lockless(vcpu, gpa, &walk))
> +                       break;
> +
> +               spte = walk.sptes[walk.last_level];
> +               sptep = walk.spteps[walk.last_level];
>
>                 if (!is_shadow_present_pte(spte))
>                         break;
>
> -               sp = sptep_to_sp(iterator.sptep);
> -               if (!is_last_spte(spte, sp->role.level))
> +               if (!is_last_spte(spte, walk.last_level))
>                         break;
>
>                 /*
> @@ -3084,7 +3077,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>                          *
>                          * See the comments in kvm_arch_commit_memory_region().
>                          */
> -                       if (sp->role.level > PG_LEVEL_4K)
> +                       if (walk.last_level > PG_LEVEL_4K)
>                                 break;
>                 }
>
> @@ -3098,8 +3091,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>                  * since the gfn is not stable for indirect shadow page. See
>                  * Documentation/virt/kvm/locking.rst to get more detail.
>                  */
> -               if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte,
> -                                           new_spte)) {
> +               if (fast_pf_fix_direct_spte(vcpu, gpa, sptep, spte, new_spte)) {
>                         ret = RET_PF_FIXED;
>                         break;
>                 }
> @@ -3112,7 +3104,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>
>         } while (true);
>
> -       trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
> +       trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret);
>         walk_shadow_page_lockless_end(vcpu);
>
>         return ret;
> @@ -3748,11 +3740,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         if (page_fault_handle_page_track(vcpu, error_code, gfn))
>                 return RET_PF_EMULATE;
>
> -       if (!is_vcpu_using_tdp_mmu(vcpu)) {
> -               r = fast_page_fault(vcpu, gpa, error_code);
> -               if (r != RET_PF_INVALID)
> -                       return r;
> -       }
> +       r = fast_page_fault(vcpu, gpa, error_code);
> +       if (r != RET_PF_INVALID)
> +               return r;
>
>         r = mmu_topup_memory_caches(vcpu, false);
>         if (r)
> --
> 2.32.0.272.g935e593368-goog
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (7 preceding siblings ...)
  2021-06-11 23:57 ` [PATCH 8/8] KVM: selftests: Introduce access_tracking_perf_test David Matlack
@ 2021-06-14  9:54 ` Paolo Bonzini
  2021-06-14 21:08   ` David Matlack
  8 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2021-06-14  9:54 UTC (permalink / raw)
  To: David Matlack, kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid,
	Andrew Jones

On 12/06/21 01:56, David Matlack wrote:
> This patch series adds support for the TDP MMU in the fast_page_fault
> path, which enables certain write-protection and access tracking faults
> to be handled without taking the KVM MMU lock. This series brings the
> performance of these faults up to par with the legacy MMU.

Hi David,

I have one very basic question: is the speedup due to lock contention, 
or to cacheline bouncing, or something else altogether? In other words, 
what do the profiles look like before vs. after these patches?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root()
  2021-06-11 23:56 ` [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root() David Matlack
@ 2021-06-14 17:56   ` Ben Gardon
  2021-06-14 19:07   ` Sean Christopherson
  1 sibling, 0 replies; 32+ messages in thread
From: Ben Gardon @ 2021-06-14 17:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
>
> Refactor is_tdp_mmu_root() into is_vcpu_using_tdp_mmu() to reduce
> duplicated code at call sites and make the code more readable.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Nice cleanup!

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/mmu.c     | 10 +++++-----
>  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
>  arch/x86/kvm/mmu/tdp_mmu.h |  8 +++++---
>  3 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0144c40d09c7..eccd889d20a5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3545,7 +3545,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>                 return reserved;
>         }
>
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_vcpu_using_tdp_mmu(vcpu))
>                 leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
>         else
>                 leaf = get_walk(vcpu, addr, sptes, &root);
> @@ -3729,7 +3729,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         if (page_fault_handle_page_track(vcpu, error_code, gfn))
>                 return RET_PF_EMULATE;
>
> -       if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)) {
> +       if (!is_vcpu_using_tdp_mmu(vcpu)) {
>                 r = fast_page_fault(vcpu, gpa, error_code);
>                 if (r != RET_PF_INVALID)
>                         return r;
> @@ -3751,7 +3751,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>
>         r = RET_PF_RETRY;
>
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_vcpu_using_tdp_mmu(vcpu))
>                 read_lock(&vcpu->kvm->mmu_lock);
>         else
>                 write_lock(&vcpu->kvm->mmu_lock);
> @@ -3762,7 +3762,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         if (r)
>                 goto out_unlock;
>
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_vcpu_using_tdp_mmu(vcpu))
>                 r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
>                                     pfn, prefault);
>         else
> @@ -3770,7 +3770,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>                                  prefault, is_tdp);
>
>  out_unlock:
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_vcpu_using_tdp_mmu(vcpu))
>                 read_unlock(&vcpu->kvm->mmu_lock);
>         else
>                 write_unlock(&vcpu->kvm->mmu_lock);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 237317b1eddd..f4cc79dabeae 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -979,7 +979,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>
>         if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
>                 return RET_PF_RETRY;
> -       if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
> +       if (WARN_ON(!is_vcpu_using_tdp_mmu(vcpu)))
>                 return RET_PF_RETRY;
>
>         level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 5fdf63090451..c8cf12809fcf 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -91,16 +91,18 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
>  #endif
>
> -static inline bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> +static inline bool is_vcpu_using_tdp_mmu(struct kvm_vcpu *vcpu)
>  {
> +       struct kvm *kvm = vcpu->kvm;
>         struct kvm_mmu_page *sp;
> +       hpa_t root_hpa = vcpu->arch.mmu->root_hpa;
>
>         if (!is_tdp_mmu_enabled(kvm))
>                 return false;
> -       if (WARN_ON(!VALID_PAGE(hpa)))
> +       if (WARN_ON(!VALID_PAGE(root_hpa)))
>                 return false;
>
> -       sp = to_shadow_page(hpa);
> +       sp = to_shadow_page(root_hpa);
>         if (WARN_ON(!sp))
>                 return false;
>
> --
> 2.32.0.272.g935e593368-goog
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/8] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault
  2021-06-11 23:56 ` [PATCH 2/8] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
@ 2021-06-14 17:56   ` Ben Gardon
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Gardon @ 2021-06-14 17:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
>
> fast_page_fault is only called from direct_page_fault where we know the
> address is a gpa.
>
> Fixes: 736c291c9f36 ("KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM")
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/mmu.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index eccd889d20a5..1d0fe1445e04 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3007,8 +3007,7 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
>  /*
>   * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
>   */
> -static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> -                          u32 error_code)
> +static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>  {
>         struct kvm_shadow_walk_iterator iterator;
>         struct kvm_mmu_page *sp;
> @@ -3024,7 +3023,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>         do {
>                 u64 new_spte;
>
> -               for_each_shadow_entry_lockless(vcpu, cr2_or_gpa, iterator, spte)
> +               for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
>                         if (!is_shadow_present_pte(spte))
>                                 break;
>
> @@ -3103,8 +3102,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
>         } while (true);
>
> -       trace_fast_page_fault(vcpu, cr2_or_gpa, error_code, iterator.sptep,
> -                             spte, ret);
> +       trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
>         walk_shadow_page_lockless_end(vcpu);
>
>         return ret;
> --
> 2.32.0.272.g935e593368-goog
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/8] KVM: x86/mmu: Common API for lockless shadow page walks
  2021-06-11 23:56 ` [PATCH 4/8] KVM: x86/mmu: Common API for lockless shadow page walks David Matlack
@ 2021-06-14 17:56   ` Ben Gardon
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Gardon @ 2021-06-14 17:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
>
> Introduce a common API for walking the shadow page tables locklessly
> that abstracts away whether the TDP MMU is enabled or not. This will be
> used in a follow-up patch to support the TDP MMU in fast_page_fault.
>
> The API can be used as follows:
>
>   struct shadow_page_walk walk;
>
>   walk_shadow_page_lockless_begin(vcpu);
>   if (!walk_shadow_page_lockless(vcpu, addr, &walk))
>     goto out;
>
>   ... use `walk` ...
>
> out:
>   walk_shadow_page_lockless_end(vcpu);
>
> Note: Separating walk_shadow_page_lockless_begin() from
> walk_shadow_page_lockless() seems superfluous at first glance but is
> needed to support fast_page_fault() since it performs multiple walks
> under the same begin/end block.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/mmu.c          | 96 ++++++++++++++++++++-------------
>  arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 34 ++++++------
>  arch/x86/kvm/mmu/tdp_mmu.h      |  6 ++-
>  4 files changed, 96 insertions(+), 55 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1d0fe1445e04..8140c262f4d3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -623,6 +623,11 @@ static bool mmu_spte_age(u64 *sptep)
>
>  static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
>  {
> +       if (is_vcpu_using_tdp_mmu(vcpu)) {
> +               kvm_tdp_mmu_walk_lockless_begin();
> +               return;
> +       }
> +
>         /*
>          * Prevent page table teardown by making any free-er wait during
>          * kvm_flush_remote_tlbs() IPI to all active vcpus.
> @@ -638,6 +643,11 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
>
>  static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  {
> +       if (is_vcpu_using_tdp_mmu(vcpu)) {
> +               kvm_tdp_mmu_walk_lockless_end();
> +               return;
> +       }
> +
>         /*
>          * Make sure the write to vcpu->mode is not reordered in front of
>          * reads to sptes.  If it does, kvm_mmu_commit_zap_page() can see us
> @@ -3501,59 +3511,61 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
>  }
>
>  /*
> - * Return the level of the lowest level SPTE added to sptes.
> - * That SPTE may be non-present.
> + * Walks the shadow page table for the given address until a leaf or non-present
> + * spte is encountered.
> + *
> + * Returns false if no walk could be performed, in which case `walk` does not
> + * contain any valid data.
> + *
> + * Must be called between walk_shadow_page_lockless_{begin,end}.
>   */
> -static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level)
> +static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
> +                                     struct shadow_page_walk *walk)
>  {
> -       struct kvm_shadow_walk_iterator iterator;
> -       int leaf = -1;
> +       struct kvm_shadow_walk_iterator it;
> +       bool walk_ok = false;
>         u64 spte;
>
> -       walk_shadow_page_lockless_begin(vcpu);
> +       if (is_vcpu_using_tdp_mmu(vcpu))
> +               return kvm_tdp_mmu_walk_lockless(vcpu, addr, walk);
>
> -       for (shadow_walk_init(&iterator, vcpu, addr),
> -            *root_level = iterator.level;
> -            shadow_walk_okay(&iterator);
> -            __shadow_walk_next(&iterator, spte)) {
> -               leaf = iterator.level;
> -               spte = mmu_spte_get_lockless(iterator.sptep);
> +       shadow_walk_init(&it, vcpu, addr);
> +       walk->root_level = it.level;
>
> -               sptes[leaf] = spte;
> +       for (; shadow_walk_okay(&it); __shadow_walk_next(&it, spte)) {
> +               walk_ok = true;
> +
> +               spte = mmu_spte_get_lockless(it.sptep);
> +               walk->last_level = it.level;
> +               walk->sptes[it.level] = spte;
>
>                 if (!is_shadow_present_pte(spte))
>                         break;
>         }
>
> -       walk_shadow_page_lockless_end(vcpu);
> -
> -       return leaf;
> +       return walk_ok;
>  }
>
>  /* return true if reserved bit(s) are detected on a valid, non-MMIO SPTE. */
>  static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>  {
> -       u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
> +       struct shadow_page_walk walk;
>         struct rsvd_bits_validate *rsvd_check;
> -       int root, leaf, level;
> +       int last_level, level;
>         bool reserved = false;
>
> -       if (!VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
> -               *sptep = 0ull;
> +       *sptep = 0ull;
> +
> +       if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
>                 return reserved;
> -       }
>
> -       if (is_vcpu_using_tdp_mmu(vcpu))
> -               leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
> -       else
> -               leaf = get_walk(vcpu, addr, sptes, &root);
> +       walk_shadow_page_lockless_begin(vcpu);
>
> -       if (unlikely(leaf < 0)) {
> -               *sptep = 0ull;
> -               return reserved;
> -       }
> +       if (!walk_shadow_page_lockless(vcpu, addr, &walk))
> +               goto out;
>
> -       *sptep = sptes[leaf];
> +       last_level = walk.last_level;
> +       *sptep = walk.sptes[last_level];
>
>         /*
>          * Skip reserved bits checks on the terminal leaf if it's not a valid
> @@ -3561,29 +3573,37 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>          * design, always have reserved bits set.  The purpose of the checks is
>          * to detect reserved bits on non-MMIO SPTEs. i.e. buggy SPTEs.
>          */
> -       if (!is_shadow_present_pte(sptes[leaf]))
> -               leaf++;
> +       if (!is_shadow_present_pte(walk.sptes[last_level]))
> +               last_level++;
>
>         rsvd_check = &vcpu->arch.mmu->shadow_zero_check;
>
> -       for (level = root; level >= leaf; level--)
> +       for (level = walk.root_level; level >= last_level; level--) {
> +               u64 spte = walk.sptes[level];
> +
>                 /*
>                  * Use a bitwise-OR instead of a logical-OR to aggregate the
>                  * reserved bit and EPT's invalid memtype/XWR checks to avoid
>                  * adding a Jcc in the loop.
>                  */
> -               reserved |= __is_bad_mt_xwr(rsvd_check, sptes[level]) |
> -                           __is_rsvd_bits_set(rsvd_check, sptes[level], level);
> +               reserved |= __is_bad_mt_xwr(rsvd_check, spte) |
> +                           __is_rsvd_bits_set(rsvd_check, spte, level);
> +       }
>
>         if (reserved) {
>                 pr_err("%s: reserved bits set on MMU-present spte, addr 0x%llx, hierarchy:\n",
>                        __func__, addr);
> -               for (level = root; level >= leaf; level--)
> +               for (level = walk.root_level; level >= last_level; level--) {
> +                       u64 spte = walk.sptes[level];
> +
>                         pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
> -                              sptes[level], level,
> -                              rsvd_check->rsvd_bits_mask[(sptes[level] >> 7) & 1][level-1]);
> +                              spte, level,
> +                              rsvd_check->rsvd_bits_mask[(spte >> 7) & 1][level-1]);
> +               }
>         }
>
> +out:
> +       walk_shadow_page_lockless_end(vcpu);
>         return reserved;
>  }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index d64ccb417c60..26da6ca30fbf 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -165,4 +165,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +struct shadow_page_walk {
> +       /* The level of the root spte in the walk. */
> +       int root_level;
> +
> +       /*
> +        * The level of the last spte in the walk. The last spte is either the
> +        * leaf of the walk (which may or may not be present) or the first
> +        * non-present spte encountered during the walk.
> +        */
> +       int last_level;
> +
> +       /* The spte value at each level. */
> +       u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
> +};
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f4cc79dabeae..36f4844a5f95 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1504,28 +1504,32 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>         return spte_set;
>  }
>
> -/*
> - * Return the level of the lowest level SPTE added to sptes.
> - * That SPTE may be non-present.
> - */
> -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> -                        int *root_level)
> +void kvm_tdp_mmu_walk_lockless_begin(void)
> +{
> +       rcu_read_lock();
> +}
> +
> +void kvm_tdp_mmu_walk_lockless_end(void)
> +{
> +       rcu_read_unlock();
> +}
> +
> +bool kvm_tdp_mmu_walk_lockless(struct kvm_vcpu *vcpu, u64 addr,
> +                              struct shadow_page_walk *walk)
>  {
>         struct tdp_iter iter;
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         gfn_t gfn = addr >> PAGE_SHIFT;
> -       int leaf = -1;
> +       bool walk_ok = false;
>
> -       *root_level = vcpu->arch.mmu->shadow_root_level;
> -
> -       rcu_read_lock();
> +       walk->root_level = vcpu->arch.mmu->shadow_root_level;
>
>         tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> -               leaf = iter.level;
> -               sptes[leaf] = iter.old_spte;
> -       }
> +               walk_ok = true;
>
> -       rcu_read_unlock();
> +               walk->last_level = iter.level;
> +               walk->sptes[iter.level] = iter.old_spte;
> +       }
>
> -       return leaf;
> +       return walk_ok;
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index c8cf12809fcf..772d11bbb92a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -76,8 +76,10 @@ bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot, gfn_t gfn);
>
> -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> -                        int *root_level);
> +void kvm_tdp_mmu_walk_lockless_begin(void);
> +void kvm_tdp_mmu_walk_lockless_end(void);
> +bool kvm_tdp_mmu_walk_lockless(struct kvm_vcpu *vcpu, u64 addr,
> +                              struct shadow_page_walk *walk);
>
>  #ifdef CONFIG_X86_64
>  void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> --
> 2.32.0.272.g935e593368-goog
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk
  2021-06-11 23:56 ` [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk David Matlack
@ 2021-06-14 17:56   ` Ben Gardon
  2021-06-14 22:27   ` David Matlack
  2021-06-14 22:59   ` Sean Christopherson
  2 siblings, 0 replies; 32+ messages in thread
From: Ben Gardon @ 2021-06-14 17:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
>
> In order to use walk_shadow_page_lockless() in fast_page_fault() we need
> to also record the spteps.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>


> ---
>  arch/x86/kvm/mmu/mmu.c          | 1 +
>  arch/x86/kvm/mmu/mmu_internal.h | 3 +++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 1 +
>  3 files changed, 5 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8140c262f4d3..765f5b01768d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3538,6 +3538,7 @@ static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
>                 spte = mmu_spte_get_lockless(it.sptep);
>                 walk->last_level = it.level;
>                 walk->sptes[it.level] = spte;
> +               walk->spteps[it.level] = it.sptep;
>
>                 if (!is_shadow_present_pte(spte))
>                         break;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 26da6ca30fbf..0fefbd5d6c95 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -178,6 +178,9 @@ struct shadow_page_walk {
>
>         /* The spte value at each level. */
>         u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
> +
> +       /* The spte pointers at each level. */
> +       u64 *spteps[PT64_ROOT_MAX_LEVEL + 1];
>  };
>
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 36f4844a5f95..7279d17817a1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1529,6 +1529,7 @@ bool kvm_tdp_mmu_walk_lockless(struct kvm_vcpu *vcpu, u64 addr,
>
>                 walk->last_level = iter.level;
>                 walk->sptes[iter.level] = iter.old_spte;
> +               walk->spteps[iter.level] = iter.sptep;
>         }
>
>         return walk_ok;
> --
> 2.32.0.272.g935e593368-goog
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-11 23:59   ` David Matlack
@ 2021-06-14 17:56     ` Ben Gardon
  2021-06-14 22:34       ` David Matlack
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gardon @ 2021-06-14 17:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm list, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:59 PM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
> >
> > This commit enables the fast_page_fault handler to work when the TDP MMU
> > is enabled by leveraging the new walk_shadow_page_lockless* API to
> > collect page walks independent of the TDP MMU.
> >
> > fast_page_fault was already using
> > walk_shadow_page_lockless_{begin,end}(), we just have to change the
> > actual walk to use walk_shadow_page_lockless() which does the right
> > thing if the TDP MMU is in use.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Adding this feature was suggested by Ben Gardon:
>
> Suggested-by: Ben Gardon <bgardon@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++-------------------------
> >  1 file changed, 21 insertions(+), 31 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 765f5b01768d..5562727c3699 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -657,6 +657,9 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> >         local_irq_enable();
> >  }
> >
> > +static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > +                                     struct shadow_page_walk *walk);
> > +
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> >         int r;
> > @@ -2967,14 +2970,9 @@ static bool page_fault_can_be_fast(u32 error_code)
> >   * Returns true if the SPTE was fixed successfully. Otherwise,
> >   * someone else modified the SPTE from its original value.
> >   */
> > -static bool
> > -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > -                       u64 *sptep, u64 old_spte, u64 new_spte)
> > +static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, gpa_t gpa,
> > +                                   u64 *sptep, u64 old_spte, u64 new_spte)
> >  {
> > -       gfn_t gfn;
> > -
> > -       WARN_ON(!sp->role.direct);
> > -
> >         /*
> >          * Theoretically we could also set dirty bit (and flush TLB) here in
> >          * order to eliminate unnecessary PML logging. See comments in
> > @@ -2990,14 +2988,8 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >         if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
> >                 return false;
> >
> > -       if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
> > -               /*
> > -                * The gfn of direct spte is stable since it is
> > -                * calculated by sp->gfn.
> > -                */
> > -               gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > -               kvm_vcpu_mark_page_dirty(vcpu, gfn);
> > -       }
> > +       if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
> > +               kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);

I love how cleanly you've implemented TDP MMU fast PF support here
with so little code duplication. I think this approach will work well.

My only reservation is that this links the way we locklessly change
(and handle changes to) SPTEs between the TDP and legacy MMUs.

fast_pf_fix_direct_spte is certainly a much simpler function than
tdp_mmu_set_spte_atomic, but it might be worth adding a comment
explaining that the function can modify SPTEs in a TDP MMU paging
structure. Alternatively, fast_page_fault could call
tdp_mmu_set_spte_atomic, but I don't know if that would carry a
performance penalty. On the other hand, the handling for this
particular type of SPTE modification is unlikely to change, so it
might not matter.



> >
> >         return true;
> >  }
> > @@ -3019,10 +3011,9 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
> >   */
> >  static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >  {
> > -       struct kvm_shadow_walk_iterator iterator;
> > -       struct kvm_mmu_page *sp;
> >         int ret = RET_PF_INVALID;
> >         u64 spte = 0ull;
> > +       u64 *sptep = NULL;
> >         uint retry_count = 0;
> >
> >         if (!page_fault_can_be_fast(error_code))
> > @@ -3031,17 +3022,19 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >         walk_shadow_page_lockless_begin(vcpu);
> >
> >         do {
> > +               struct shadow_page_walk walk;
> >                 u64 new_spte;
> >
> > -               for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
> > -                       if (!is_shadow_present_pte(spte))
> > -                               break;
> > +               if (!walk_shadow_page_lockless(vcpu, gpa, &walk))
> > +                       break;
> > +
> > +               spte = walk.sptes[walk.last_level];
> > +               sptep = walk.spteps[walk.last_level];
> >
> >                 if (!is_shadow_present_pte(spte))
> >                         break;
> >
> > -               sp = sptep_to_sp(iterator.sptep);
> > -               if (!is_last_spte(spte, sp->role.level))
> > +               if (!is_last_spte(spte, walk.last_level))
> >                         break;
> >
> >                 /*
> > @@ -3084,7 +3077,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >                          *
> >                          * See the comments in kvm_arch_commit_memory_region().
> >                          */
> > -                       if (sp->role.level > PG_LEVEL_4K)
> > +                       if (walk.last_level > PG_LEVEL_4K)
> >                                 break;
> >                 }
> >
> > @@ -3098,8 +3091,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >                  * since the gfn is not stable for indirect shadow page. See
> >                  * Documentation/virt/kvm/locking.rst to get more detail.
> >                  */
> > -               if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte,
> > -                                           new_spte)) {
> > +               if (fast_pf_fix_direct_spte(vcpu, gpa, sptep, spte, new_spte)) {
> >                         ret = RET_PF_FIXED;
> >                         break;
> >                 }
> > @@ -3112,7 +3104,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >
> >         } while (true);
> >
> > -       trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
> > +       trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret);
> >         walk_shadow_page_lockless_end(vcpu);
> >
> >         return ret;
> > @@ -3748,11 +3740,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >         if (page_fault_handle_page_track(vcpu, error_code, gfn))
> >                 return RET_PF_EMULATE;
> >
> > -       if (!is_vcpu_using_tdp_mmu(vcpu)) {
> > -               r = fast_page_fault(vcpu, gpa, error_code);
> > -               if (r != RET_PF_INVALID)
> > -                       return r;
> > -       }
> > +       r = fast_page_fault(vcpu, gpa, error_code);
> > +       if (r != RET_PF_INVALID)
> > +               return r;
> >
> >         r = mmu_topup_memory_caches(vcpu, false);
> >         if (r)
> > --
> > 2.32.0.272.g935e593368-goog
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing
  2021-06-11 23:57 ` [PATCH 7/8] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing David Matlack
@ 2021-06-14 17:56   ` Ben Gardon
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Gardon @ 2021-06-14 17:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
>
> There is a missing break statement which causes a fallthrough to the
> next statement where optarg will be null and a segmentation fault will
> be generated.
>
> Fixes: 9e965bb75aae ("KVM: selftests: Add backing src parameter to dirty_log_perf_test")
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon


> ---
>  tools/testing/selftests/kvm/dirty_log_perf_test.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 04a2641261be..80cbd3a748c0 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -312,6 +312,7 @@ int main(int argc, char *argv[])
>                         break;
>                 case 'o':
>                         p.partition_vcpu_memory_access = false;
> +                       break;
>                 case 's':
>                         p.backing_src = parse_backing_src_type(optarg);
>                         break;
> --
> 2.32.0.272.g935e593368-goog
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: Introduce access_tracking_perf_test
  2021-06-11 23:57 ` [PATCH 8/8] KVM: selftests: Introduce access_tracking_perf_test David Matlack
@ 2021-06-14 17:56   ` Ben Gardon
  2021-06-14 21:47     ` David Matlack
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gardon @ 2021-06-14 17:56 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
>
> This test measures the performance effects of KVM's access tracking.
> Access tracking is driven by the MMU notifiers test_young, clear_young,
> and clear_flush_young. These notifiers do not have a direct userspace
> API, however the clear_young notifier can be triggered by marking a
> pages as idle in /sys/kernel/mm/page_idle/bitmap. This test leverages
> that mechanism to enable access tracking on guest memory.
>
> To measure performance this test runs a VM with a configurable number of
> vCPUs that each touch every page in disjoint regions of memory.
> Performance is measured in the time it takes all vCPUs to finish
> touching their predefined region.
>
> Example invocation:
>
>   $ ./access_tracking_perf_test -v 8
>   Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
>   guest physical test memory offset: 0xffdfffff000
>
>   Populating memory             : 1.337752570s
>   Writing to populated memory   : 0.010177640s
>   Reading from populated memory : 0.009548239s
>   Mark memory idle              : 23.973131748s
>   Writing to idle memory        : 0.063584496s
>   Mark memory idle              : 24.924652964s
>   Reading from idle memory      : 0.062042814s
>
> Breaking down the results:
>
>  * "Populating memory": The time it takes for all vCPUs to perform the
>    first write to every page in their region.
>
>  * "Writing to populated memory" / "Reading from populated memory": The
>    time it takes for all vCPUs to write and read to every page in their
>    region after it has been populated. This serves as a control for the
>    later results.
>
>  * "Mark memory idle": The time it takes for every vCPU to mark every
>    page in their region as idle through page_idle.
>
>  * "Writing to idle memory" / "Reading from idle memory": The time it
>    takes for all vCPUs to write and read to every page in their region
>    after it has been marked idle.
>
> This test should be portable across architectures but it is only enabled
> for x86_64 since that's all I have tested.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/access_tracking_perf_test.c | 419 ++++++++++++++++++
>  3 files changed, 421 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/access_tracking_perf_test.c
>
> diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> index bd83158e0e0b..32a362d71e05 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -34,6 +34,7 @@
>  /x86_64/xen_vmcall_test
>  /x86_64/xss_msr_test
>  /x86_64/vmx_pmu_msrs_test
> +/access_tracking_perf_test
>  /demand_paging_test
>  /dirty_log_test
>  /dirty_log_perf_test
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index e439d027939d..9f1b478da92b 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -67,6 +67,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
>  TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_msrs_test
>  TEST_GEN_PROGS_x86_64 += x86_64/xen_shinfo_test
>  TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
> +TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
>  TEST_GEN_PROGS_x86_64 += demand_paging_test
>  TEST_GEN_PROGS_x86_64 += dirty_log_test
>  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
> diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> new file mode 100644
> index 000000000000..60828f2d780f
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> @@ -0,0 +1,419 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * access_tracking_test
> + *
> + * Copyright (C) 2021, Google, Inc.
> + *
> + * This test measures the performance effects of KVM's access tracking.
> + * Access tracking is driven by the MMU notifiers test_young, clear_young, and
> + * clear_flush_young. These notifiers do not have a direct userspace API,
> + * however the clear_young notifier can be triggered by marking a pages as idle
> + * in /sys/kernel/mm/page_idle/bitmap. This test leverages that mechanism to
> + * enable access tracking on guest memory.
> + *
> + * To measure performance this test runs a VM with a configurable number of
> + * vCPUs that each touch every page in disjoint regions of memory. Performance
> + * is measured in the time it takes all vCPUs to finish touching their
> + * predefined region.
> + *
> + * Note that a deterministic correctness test of access tracking is not possible
> + * by using page_idle as it exists today. This is for a few reasons:
> + *
> + * 1. page_idle only issues clear_young notifiers, which lack a TLB flush. This
> + *    means subsequent guest accesses are not guaranteed to see page table
> + *    updates made by KVM until some time in the future.
> + *
> + * 2. page_idle only operates on LRU pages. Newly allocated pages are not
> + *    immediately allocated to LRU lists. Instead they are held in a "pagevec",
> + *    which is drained to LRU lists some time in the future. There is no
> + *    userspace API to force this drain to occur.
> + *
> + * These limitations are worked around in this test by using a large enough
> + * region of memory for each vCPU such that the number of translations cached in
> + * the TLB and the number of pages held in pagevecs are a small fraction of the
> + * overall workload. And if either of those conditions are not true this test
> + * will fail rather than silently passing.
> + */
> +#include <inttypes.h>
> +#include <limits.h>
> +#include <pthread.h>
> +#include <sys/mman.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +
> +#include "kvm_util.h"
> +#include "test_util.h"
> +#include "perf_test_util.h"
> +#include "guest_modes.h"
> +
> +/* Global variable used to synchronize all of the vCPU threads. */
> +static int iteration = -1;
> +
> +/* Defines what vCPU threads should do during a given iteration. */
> +static enum {
> +       /* Run the vCPU to access all its memory. */
> +       ITERATION_ACCESS_MEMORY,
> +       /* Mark the vCPU's memory idle in page_idle. */
> +       ITERATION_MARK_IDLE,
> +} iteration_work;
> +
> +/* Set to true when vCPU threads should exit. */
> +static bool done;
> +
> +/* The iteration that was last completed by each vCPU. */
> +static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
> +
> +/* Whether to overlap the regions of memory vCPUs access. */
> +static bool overlap_memory_access;
> +
> +struct test_params {
> +       /* The backing source for the region of memory. */
> +       enum vm_mem_backing_src_type backing_src;
> +
> +       /* The amount of memory to allocate for each vCPU. */
> +       uint64_t vcpu_memory_bytes;
> +
> +       /* The number of vCPUs to create in the VM. */
> +       int vcpus;
> +};
> +
> +static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> +{
> +       uint64_t value;
> +       off_t offset = index * sizeof(value);
> +
> +       TEST_ASSERT(pread(fd, &value, sizeof(value), offset) == sizeof(value),
> +                   "pread from %s offset 0x%" PRIx64 " failed!",
> +                   filename, offset);
> +
> +       return value;
> +
> +}
> +
> +static uint64_t lookup_pfn(int pagemap_fd, struct kvm_vm *vm, uint64_t gva)
> +{
> +       uint64_t hva = (uint64_t) addr_gva2hva(vm, gva);
> +       uint64_t entry;
> +
> +       entry = pread_uint64(pagemap_fd, "pagemap", hva / getpagesize());
> +       if (!(entry & (1ULL << 63)))
> +               return 0;
> +
> +       return (entry & ((1ULL << 55) - 1));

It might be helpful to document these shifts and other constants in this test.

> +}
> +
> +static bool is_page_idle(int page_idle_fd, uint64_t pfn)
> +{
> +       uint64_t bits = pread_uint64(page_idle_fd, "page_idle", pfn / 64);
> +
> +       return !!((bits >> (pfn % 64)) & 1);
> +}
> +
> +static void mark_page_idle(int page_idle_fd, uint64_t pfn)
> +{
> +       uint64_t bits = 1ULL << (pfn % 64);
> +
> +       TEST_ASSERT(pwrite(page_idle_fd, &bits, 8, 8 * (pfn / 64)) == 8,
> +                   "Set page_idle bits for PFN 0x%" PRIx64, pfn);
> +}
> +
> +static void mark_vcpu_memory_idle(struct kvm_vm *vm, int vcpu_id)
> +{
> +       uint64_t base_gva = perf_test_args.vcpu_args[vcpu_id].gva;
> +       uint64_t pages = perf_test_args.vcpu_args[vcpu_id].pages;
> +       uint64_t page;
> +       uint64_t still_idle = 0;
> +       uint64_t no_pfn = 0;
> +       int page_idle_fd;
> +       int pagemap_fd;
> +
> +       /* If vCPUs are using an overlapping region, let vCPU 0 mark it idle. */
> +       if (overlap_memory_access && vcpu_id)
> +               return;
> +
> +       page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> +       TEST_ASSERT(page_idle_fd > 0, "Failed to open page_idle.");
> +
> +       pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
> +       TEST_ASSERT(pagemap_fd > 0, "Failed to open pagemap.");

Should this have an early skip-check too? I assume not all users of
this test will be running with the privileges required to access
pagemap.


> +
> +       for (page = 0; page < pages; page++) {
> +               uint64_t gva = base_gva + page * perf_test_args.guest_page_size;
> +               uint64_t pfn = lookup_pfn(pagemap_fd, vm, gva);
> +
> +               if (!pfn) {
> +                       no_pfn++;
> +                       continue;
> +               }
> +
> +               if (is_page_idle(page_idle_fd, pfn)) {
> +                       still_idle++;
> +                       continue;
> +               }
> +
> +               mark_page_idle(page_idle_fd, pfn);
> +       }
> +
> +       /*
> +        * Assumption: Less than 1% of pages are going to be swapped out from
> +        * under us during this test.
> +        */
> +       TEST_ASSERT(no_pfn < pages / 100,
> +                   "vCPU %d: No PFN for %" PRIu64 " out of %" PRIu64 " pages.",
> +                   vcpu_id, no_pfn, pages);
> +
> +       /*
> +        * Test that at least 90% of memory has been marked idle (the rest might
> +        * not be marked idle because the pages have not yet made it to an LRU
> +        * list or the translations are still cached in the TLB). 90% is
> +        * arbitrary; high enough that we ensure most memory access went through
> +        * access tracking but low enough as to not make the test too brittle
> +        * over time and across architectures.
> +        */
> +       TEST_ASSERT(still_idle < pages / 10,
> +                   "vCPU%d: Too many pages still idle (%"PRIu64 " out of %"
> +                   PRIu64 ").\n",
> +                   vcpu_id, still_idle, pages);
> +
> +       close(page_idle_fd);
> +       close(pagemap_fd);
> +}
> +
> +static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
> +                        uint64_t expected_ucall)
> +{
> +       struct ucall uc;
> +       uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
> +
> +       TEST_ASSERT(expected_ucall == actual_ucall,
> +                   "Guest exited unexpectedly (expected ucall %" PRIu64
> +                   ", got %" PRIu64 ")",
> +                   expected_ucall, actual_ucall);
> +}
> +
> +static bool spin_wait_for_next_iteration(int *current_iteration)
> +{
> +       int last_iteration = *current_iteration;
> +
> +       do {
> +               if (READ_ONCE(done))
> +                       return false;
> +
> +               *current_iteration = READ_ONCE(iteration);
> +       } while (last_iteration == *current_iteration);
> +
> +       return true;
> +}
> +
> +static void *vcpu_thread_main(void *arg)
> +{
> +       struct perf_test_vcpu_args *vcpu_args = arg;
> +       struct kvm_vm *vm = perf_test_args.vm;
> +       int vcpu_id = vcpu_args->vcpu_id;
> +       int current_iteration = -1;
> +
> +       vcpu_args_set(vm, vcpu_id, 1, vcpu_id);
> +
> +       while (spin_wait_for_next_iteration(&current_iteration)) {
> +               switch (READ_ONCE(iteration_work)) {
> +               case ITERATION_ACCESS_MEMORY:
> +                       vcpu_run(vm, vcpu_id);
> +                       assert_ucall(vm, vcpu_id, UCALL_SYNC);
> +                       break;
> +               case ITERATION_MARK_IDLE:
> +                       mark_vcpu_memory_idle(vm, vcpu_id);
> +                       break;
> +               };
> +
> +               vcpu_last_completed_iteration[vcpu_id] = current_iteration;
> +       }
> +
> +       return NULL;
> +}
> +
> +static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
> +{
> +       while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
> +              target_iteration) {
> +               continue;
> +       }
> +}
> +
> +/* The type of memory accesses to perform in the VM. */
> +enum access_type {
> +       ACCESS_READ,
> +       ACCESS_WRITE,
> +};
> +
> +static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
> +{
> +       struct timespec ts_start;
> +       struct timespec ts_elapsed;
> +       int next_iteration;
> +       int vcpu_id;
> +
> +       /* Kick off the vCPUs by incrementing iteration. */
> +       next_iteration = ++iteration;
> +
> +       clock_gettime(CLOCK_MONOTONIC, &ts_start);
> +
> +       /* Wait for all vCPUs to finish the iteration. */
> +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
> +               spin_wait_for_vcpu(vcpu_id, next_iteration);
> +
> +       ts_elapsed = timespec_elapsed(ts_start);
> +       pr_info("%-30s: %ld.%09lds\n",
> +               description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> +}
> +
> +static void access_memory(struct kvm_vm *vm, int vcpus, enum access_type access,
> +                         const char *description)
> +{
> +       perf_test_args.wr_fract = (access == ACCESS_READ) ? INT_MAX : 1;
> +       sync_global_to_guest(vm, perf_test_args);
> +       iteration_work = ITERATION_ACCESS_MEMORY;
> +       run_iteration(vm, vcpus, description);
> +}
> +
> +static void mark_memory_idle(struct kvm_vm *vm, int vcpus)
> +{
> +       /*
> +        * Even though this parallelizes the work across vCPUs, this is still a
> +        * very slow operation because page_idle forces the test to mark one pfn
> +        * at a time and the clear_young notifier serializes on the KVM MMU
> +        * lock.
> +        */
> +       pr_debug("Marking VM memory idle (slow)...\n");
> +       iteration_work = ITERATION_MARK_IDLE;
> +       run_iteration(vm, vcpus, "Mark memory idle");
> +}
> +
> +static pthread_t *create_vcpu_threads(int vcpus)
> +{
> +       pthread_t *vcpu_threads;
> +       int i;
> +
> +       vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
> +       TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
> +
> +       for (i = 0; i < vcpus; i++) {
> +               vcpu_last_completed_iteration[i] = iteration;
> +               pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
> +                              &perf_test_args.vcpu_args[i]);
> +       }
> +
> +       return vcpu_threads;
> +}
> +
> +static void terminate_vcpu_threads(pthread_t *vcpu_threads, int vcpus)
> +{
> +       int i;
> +
> +       /* Set done to signal the vCPU threads to exit */
> +       done = true;
> +
> +       for (i = 0; i < vcpus; i++)
> +               pthread_join(vcpu_threads[i], NULL);
> +}
> +
> +static void run_test(enum vm_guest_mode mode, void *arg)
> +{
> +       struct test_params *params = arg;
> +       struct kvm_vm *vm;
> +       pthread_t *vcpu_threads;
> +       int vcpus = params->vcpus;
> +
> +       vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes,
> +                                params->backing_src);
> +
> +       perf_test_setup_vcpus(vm, vcpus, params->vcpu_memory_bytes,
> +                             !overlap_memory_access);
> +
> +       vcpu_threads = create_vcpu_threads(vcpus);
> +
> +       pr_info("\n");
> +       access_memory(vm, vcpus, ACCESS_WRITE, "Populating memory");
> +
> +       /* As a control, read and write to the populated memory first. */
> +       access_memory(vm, vcpus, ACCESS_WRITE, "Writing to populated memory");
> +       access_memory(vm, vcpus, ACCESS_READ, "Reading from populated memory");
> +
> +       /* Repeat on memory that has been marked as idle. */
> +       mark_memory_idle(vm, vcpus);
> +       access_memory(vm, vcpus, ACCESS_WRITE, "Writing to idle memory");
> +       mark_memory_idle(vm, vcpus);
> +       access_memory(vm, vcpus, ACCESS_READ, "Reading from idle memory");
> +
> +       terminate_vcpu_threads(vcpu_threads, vcpus);
> +       free(vcpu_threads);
> +       perf_test_destroy_vm(vm);
> +}
> +
> +static void help(char *name)
> +{
> +       puts("");
> +       printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
> +              name);
> +       puts("");
> +       printf(" -h: Display this help message.");
> +       guest_modes_help();
> +       printf(" -b: specify the size of the memory region which should be\n"
> +              "     dirtied by each vCPU. e.g. 10M or 3G.\n"
> +              "     (default: 1G)\n");
> +       printf(" -v: specify the number of vCPUs to run.\n");
> +       printf(" -o: Overlap guest memory accesses instead of partitioning\n"
> +              "     them into a separate region of memory for each vCPU.\n");
> +       printf(" -s: specify the type of memory that should be used to\n"
> +              "     back the guest data region.\n\n");
> +       backing_src_help();
> +       puts("");
> +       exit(0);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       struct test_params params = {
> +               .backing_src = VM_MEM_SRC_ANONYMOUS,
> +               .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
> +               .vcpus = 1,
> +       };
> +       int page_idle_fd;
> +       int opt;
> +
> +       guest_modes_append_default();
> +
> +       while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> +               switch (opt) {
> +               case 'm':
> +                       guest_modes_cmdline(optarg);
> +                       break;
> +               case 'b':
> +                       params.vcpu_memory_bytes = parse_size(optarg);
> +                       break;
> +               case 'v':
> +                       params.vcpus = atoi(optarg);
> +                       break;
> +               case 'o':
> +                       overlap_memory_access = true;
> +                       break;
> +               case 's':
> +                       params.backing_src = parse_backing_src_type(optarg);
> +                       break;
> +               case 'h':
> +               default:
> +                       help(argv[0]);
> +                       break;
> +               }
> +       }
> +
> +       page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> +       if (page_idle_fd < 0) {
> +               print_skip("CONFIG_IDLE_PAGE_TRACKING is not enabled");
> +               exit(KSFT_SKIP);
> +       }
> +       close(page_idle_fd);
> +
> +       for_each_guest_mode(run_test, &params);
> +
> +       return 0;
> +}
> --
> 2.32.0.272.g935e593368-goog
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root()
  2021-06-11 23:56 ` [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root() David Matlack
  2021-06-14 17:56   ` Ben Gardon
@ 2021-06-14 19:07   ` Sean Christopherson
  2021-06-14 21:23     ` David Matlack
  1 sibling, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2021-06-14 19:07 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021, David Matlack wrote:
> Refactor is_tdp_mmu_root() into is_vcpu_using_tdp_mmu() to reduce
> duplicated code at call sites and make the code more readable.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 10 +++++-----
>  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
>  arch/x86/kvm/mmu/tdp_mmu.h |  8 +++++---
>  3 files changed, 11 insertions(+), 9 deletions(-)
> 

...

> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 5fdf63090451..c8cf12809fcf 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -91,16 +91,18 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
>  #endif
>  
> -static inline bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> +static inline bool is_vcpu_using_tdp_mmu(struct kvm_vcpu *vcpu)

I normally like code reduction, but I think in this case we should stay with
something closer to the existing code.

If interpreted literally as "is the vCPU using the TDP MMU", the helper is flat
out broken if arch.mmu points at guest_mmu, in which case the function will
evaluate "false" even if the _vCPU_ is using the TDP MMU for the non-nested MMU.

We could dance around that by doing something like:

	if (is_current_mmu_tdp_mmu(vcpu))
		....

but IMO that's a net negative versus:

	if (is_tdp_mmu(vcpu, vcpu->arch.mmu))
		...

because it's specifically the MMU that is of interest.

>  {
> +	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_mmu_page *sp;
> +	hpa_t root_hpa = vcpu->arch.mmu->root_hpa;
>  
>  	if (!is_tdp_mmu_enabled(kvm))

If we want to eliminate the second param from the prototype, I think we'd be
better off evaluating whether or not this check buys us anything.  Or even better,
eliminate redundant calls to minimize the benefits so that the "fast" check is
unnecessary.

The extra check in kvm_tdp_mmu_map() can simply be dropped; it's sole caller
literally wraps it with a is_tdp_mmu... check.

>  		return false;
> -	if (WARN_ON(!VALID_PAGE(hpa)))
> +	if (WARN_ON(!VALID_PAGE(root_hpa)))

And hoist this check out of is_tdp_mmu().  I think it's entirely reasonable to
expect the caller to first test the validity of the root before caring about
whether it's a TDP MMU or legacy MMU.

The VALID_PAGE() checks here and in__direct_map() and FNAME(page_fault) are also
mostly worthless, but they're at least a marginally not-awful sanity check and
not fully redundant.

E.g. achieve this over 2-4 patches:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f96fc3cd5c18..527305c8cede 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2899,9 +2899,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
        gfn_t gfn = gpa >> PAGE_SHIFT;
        gfn_t base_gfn = gfn;

-       if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
-               return RET_PF_RETRY;
-
        level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
                                        huge_page_disallowed, &req_level);

@@ -3635,7 +3632,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
                return reserved;
        }

-       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+       if (is_tdp_mmu(vcpu->arch.mmu))
                leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
        else
                leaf = get_walk(vcpu, addr, sptes, &root);
@@ -3801,6 +3798,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
                             bool prefault, int max_level, bool is_tdp)
 {
+       bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
        bool write = error_code & PFERR_WRITE_MASK;
        bool map_writable;

@@ -3810,10 +3808,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
        hva_t hva;
        int r;

+       if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+               return RET_PF_RETRY;
+
        if (page_fault_handle_page_track(vcpu, error_code, gfn))
                return RET_PF_EMULATE;

-       if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)) {
+       if (!is_tdp_mmu_fault) {
                r = fast_page_fault(vcpu, gpa, error_code);
                if (r != RET_PF_INVALID)
                        return r;
@@ -3835,7 +3836,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,

        r = RET_PF_RETRY;

-       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+       if (is_tdp_mmu_fault)
                read_lock(&vcpu->kvm->mmu_lock);
        else
                write_lock(&vcpu->kvm->mmu_lock);
@@ -3846,7 +3847,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
        if (r)
                goto out_unlock;

-       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+       if (is_tdp_mmu_fault)
                r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
                                    pfn, prefault);
        else
@@ -3854,7 +3855,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
                                 prefault, is_tdp);

 out_unlock:
-       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+       if (is_tdp_mmu_fault)
                read_unlock(&vcpu->kvm->mmu_lock);
        else
                write_unlock(&vcpu->kvm->mmu_lock);
		diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cc13e001f3de..7ce90bc50774 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -979,11 +979,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
        int level;
        int req_level;

-       if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
-               return RET_PF_RETRY;
-       if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
-               return RET_PF_RETRY;
-
        level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
                                        huge_page_disallowed, &req_level);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index f7a7990da11d..6ebe2331a641 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -92,16 +92,11 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
 #endif

-static inline bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
+static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
 {
        struct kvm_mmu_page *sp;

-       if (!is_tdp_mmu_enabled(kvm))
-               return false;
-       if (WARN_ON(!VALID_PAGE(hpa)))
-               return false;
-
-       sp = to_shadow_page(hpa);
+       sp = to_shadow_page(mmu->root_hpa);
        if (WARN_ON(!sp))
                return false;


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-06-14  9:54 ` [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU Paolo Bonzini
@ 2021-06-14 21:08   ` David Matlack
  2021-06-15  7:16     ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-14 21:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid,
	Andrew Jones

On Mon, Jun 14, 2021 at 11:54:59AM +0200, Paolo Bonzini wrote:
> On 12/06/21 01:56, David Matlack wrote:
> > This patch series adds support for the TDP MMU in the fast_page_fault
> > path, which enables certain write-protection and access tracking faults
> > to be handled without taking the KVM MMU lock. This series brings the
> > performance of these faults up to par with the legacy MMU.
> 
> Hi David,
> 
> I have one very basic question: is the speedup due to lock contention, or to
> cacheline bouncing, or something else altogether? In other words, what do
> the profiles look like before vs. after these patches?

The speed up comes from a combination of:
 - Less time spent in kvm_vcpu_gfn_to_memslot.
 - Less lock contention on the MMU lock in read mode.

Before:

  Overhead  Symbol
-   45.59%  [k] kvm_vcpu_gfn_to_memslot
   - 45.57% kvm_vcpu_gfn_to_memslot
      - 29.25% kvm_page_track_is_active
         + 15.90% direct_page_fault
         + 13.35% mmu_need_write_protect
      + 9.10% kvm_mmu_hugepage_adjust
      + 7.20% try_async_pf
+   18.16%  [k] _raw_read_lock
+   10.57%  [k] direct_page_fault
+    8.77%  [k] handle_changed_spte_dirty_log
+    4.65%  [k] mark_page_dirty_in_slot
     1.62%  [.] run_test
+    1.35%  [k] x86_virt_spec_ctrl
+    1.18%  [k] try_grab_compound_head
[...]

After:

  Overhead  Symbol
+   26.23%  [k] x86_virt_spec_ctrl
+   15.93%  [k] vmx_vmexit
+    6.33%  [k] vmx_vcpu_run
+    4.31%  [k] vcpu_enter_guest
+    3.71%  [k] tdp_iter_next
+    3.47%  [k] __vmx_vcpu_run
+    2.92%  [k] kvm_vcpu_gfn_to_memslot
+    2.71%  [k] vcpu_run
+    2.71%  [k] fast_page_fault
+    2.51%  [k] kvm_vcpu_mark_page_dirty

(Both profiles were captured during "Iteration 2 dirty memory" of
dirty_log_perf_test.)

Related to the kvm_vcpu_gfn_to_memslot overhead: I actually have a set of
patches from Ben I am planning to send soon that will reduce the number of
redundant gfn-to-memslot lookups in the page fault path.

> 
> Thanks,
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root()
  2021-06-14 19:07   ` Sean Christopherson
@ 2021-06-14 21:23     ` David Matlack
  2021-06-14 21:39       ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-14 21:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Mon, Jun 14, 2021 at 07:07:56PM +0000, Sean Christopherson wrote:
> On Fri, Jun 11, 2021, David Matlack wrote:
> > Refactor is_tdp_mmu_root() into is_vcpu_using_tdp_mmu() to reduce
> > duplicated code at call sites and make the code more readable.
> > 
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     | 10 +++++-----
> >  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
> >  arch/x86/kvm/mmu/tdp_mmu.h |  8 +++++---
> >  3 files changed, 11 insertions(+), 9 deletions(-)
> > 
> 
> ...

I'm not sure how to interpret this.

> 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 5fdf63090451..c8cf12809fcf 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -91,16 +91,18 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
> >  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
> >  #endif
> >  
> > -static inline bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> > +static inline bool is_vcpu_using_tdp_mmu(struct kvm_vcpu *vcpu)
> 
> I normally like code reduction, but I think in this case we should stay with
> something closer to the existing code.
> 
> If interpreted literally as "is the vCPU using the TDP MMU", the helper is flat
> out broken if arch.mmu points at guest_mmu, in which case the function will
> evaluate "false" even if the _vCPU_ is using the TDP MMU for the non-nested MMU.

Good point. is_vcpu_using_tdp_mmu is ambiguous: It's not clear if it's
checking if the vcpu is using the TDP MMU "right now" versus "at all".

> 
> We could dance around that by doing something like:
> 
> 	if (is_current_mmu_tdp_mmu(vcpu))
> 		....
> 
> but IMO that's a net negative versus:
> 
> 	if (is_tdp_mmu(vcpu, vcpu->arch.mmu))
> 		...
> 
> because it's specifically the MMU that is of interest.

I agree this is more clear.

> 
> >  {
> > +	struct kvm *kvm = vcpu->kvm;
> >  	struct kvm_mmu_page *sp;
> > +	hpa_t root_hpa = vcpu->arch.mmu->root_hpa;
> >  
> >  	if (!is_tdp_mmu_enabled(kvm))
> 
> If we want to eliminate the second param from the prototype, I think we'd be
> better off evaluating whether or not this check buys us anything.  Or even better,
> eliminate redundant calls to minimize the benefits so that the "fast" check is
> unnecessary.
> 
> The extra check in kvm_tdp_mmu_map() can simply be dropped; it's sole caller
> literally wraps it with a is_tdp_mmu... check.
> 
> >  		return false;
> > -	if (WARN_ON(!VALID_PAGE(hpa)))
> > +	if (WARN_ON(!VALID_PAGE(root_hpa)))
> 
> And hoist this check out of is_tdp_mmu().  I think it's entirely reasonable to
> expect the caller to first test the validity of the root before caring about
> whether it's a TDP MMU or legacy MMU.
> 
> The VALID_PAGE() checks here and in__direct_map() and FNAME(page_fault) are also
> mostly worthless, but they're at least a marginally not-awful sanity check and
> not fully redundant.
> 
> E.g. achieve this over 2-4 patches:

Thanks for the suggestions, I'll take a look at cleaning that up. I am
thinking of making that a separate patch series (including removing this
patch from this series) as the interaction between the two is entirely
superficial. Let me know if that makes sense.

> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f96fc3cd5c18..527305c8cede 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2899,9 +2899,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         gfn_t gfn = gpa >> PAGE_SHIFT;
>         gfn_t base_gfn = gfn;
> 
> -       if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> -               return RET_PF_RETRY;
> -
>         level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
>                                         huge_page_disallowed, &req_level);
> 
> @@ -3635,7 +3632,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>                 return reserved;
>         }
> 
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_tdp_mmu(vcpu->arch.mmu))
>                 leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
>         else
>                 leaf = get_walk(vcpu, addr, sptes, &root);
> @@ -3801,6 +3798,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>                              bool prefault, int max_level, bool is_tdp)
>  {
> +       bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu);
>         bool write = error_code & PFERR_WRITE_MASK;
>         bool map_writable;
> 
> @@ -3810,10 +3808,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         hva_t hva;
>         int r;
> 
> +       if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> +               return RET_PF_RETRY;
> +
>         if (page_fault_handle_page_track(vcpu, error_code, gfn))
>                 return RET_PF_EMULATE;
> 
> -       if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)) {
> +       if (!is_tdp_mmu_fault) {
>                 r = fast_page_fault(vcpu, gpa, error_code);
>                 if (r != RET_PF_INVALID)
>                         return r;
> @@ -3835,7 +3836,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> 
>         r = RET_PF_RETRY;
> 
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_tdp_mmu_fault)
>                 read_lock(&vcpu->kvm->mmu_lock);
>         else
>                 write_lock(&vcpu->kvm->mmu_lock);
> @@ -3846,7 +3847,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         if (r)
>                 goto out_unlock;
> 
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_tdp_mmu_fault)
>                 r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
>                                     pfn, prefault);
>         else
> @@ -3854,7 +3855,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>                                  prefault, is_tdp);
> 
>  out_unlock:
> -       if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +       if (is_tdp_mmu_fault)
>                 read_unlock(&vcpu->kvm->mmu_lock);
>         else
>                 write_unlock(&vcpu->kvm->mmu_lock);
> 		diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cc13e001f3de..7ce90bc50774 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -979,11 +979,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         int level;
>         int req_level;
> 
> -       if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> -               return RET_PF_RETRY;
> -       if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
> -               return RET_PF_RETRY;
> -
>         level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
>                                         huge_page_disallowed, &req_level);
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index f7a7990da11d..6ebe2331a641 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -92,16 +92,11 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
>  static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
>  #endif
> 
> -static inline bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> +static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
>  {
>         struct kvm_mmu_page *sp;
> 
> -       if (!is_tdp_mmu_enabled(kvm))
> -               return false;
> -       if (WARN_ON(!VALID_PAGE(hpa)))
> -               return false;
> -
> -       sp = to_shadow_page(hpa);
> +       sp = to_shadow_page(mmu->root_hpa);
>         if (WARN_ON(!sp))
>                 return false;
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root()
  2021-06-14 21:23     ` David Matlack
@ 2021-06-14 21:39       ` Sean Christopherson
  2021-06-14 22:01         ` David Matlack
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2021-06-14 21:39 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Mon, Jun 14, 2021, David Matlack wrote:
> On Mon, Jun 14, 2021 at 07:07:56PM +0000, Sean Christopherson wrote:
> > On Fri, Jun 11, 2021, David Matlack wrote:
> > > Refactor is_tdp_mmu_root() into is_vcpu_using_tdp_mmu() to reduce
> > > duplicated code at call sites and make the code more readable.
> > > 
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c     | 10 +++++-----
> > >  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
> > >  arch/x86/kvm/mmu/tdp_mmu.h |  8 +++++---
> > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > > 
> > 
> > ...
> 
> I'm not sure how to interpret this.

Ha!  It took me a second to figure out what "this" was.  The ellipsis is just a
way of saying that I trimmed a chunk of text.  I throw it in specifically when
I'm trimming part of a patch to try to make it clear that my response is jumping
to a new context.

> > E.g. achieve this over 2-4 patches:
> 
> Thanks for the suggestions, I'll take a look at cleaning that up. I am
> thinking of making that a separate patch series (including removing this
> patch from this series) as the interaction between the two is entirely
> superficial. Let me know if that makes sense.

Hmm, make 'em a separate series unless there a semantic conflicts, e.g. if you
want to cache the result of in get_mmio_spte() or something.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: Introduce access_tracking_perf_test
  2021-06-14 17:56   ` Ben Gardon
@ 2021-06-14 21:47     ` David Matlack
  0 siblings, 0 replies; 32+ messages in thread
From: David Matlack @ 2021-06-14 21:47 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Mon, Jun 14, 2021 at 10:56:56AM -0700, Ben Gardon wrote:
> On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
> >
> > This test measures the performance effects of KVM's access tracking.
> > Access tracking is driven by the MMU notifiers test_young, clear_young,
> > and clear_flush_young. These notifiers do not have a direct userspace
> > API, however the clear_young notifier can be triggered by marking a
> > pages as idle in /sys/kernel/mm/page_idle/bitmap. This test leverages
> > that mechanism to enable access tracking on guest memory.
> >
> > To measure performance this test runs a VM with a configurable number of
> > vCPUs that each touch every page in disjoint regions of memory.
> > Performance is measured in the time it takes all vCPUs to finish
> > touching their predefined region.
> >
> > Example invocation:
> >
> >   $ ./access_tracking_perf_test -v 8
> >   Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> >   guest physical test memory offset: 0xffdfffff000
> >
> >   Populating memory             : 1.337752570s
> >   Writing to populated memory   : 0.010177640s
> >   Reading from populated memory : 0.009548239s
> >   Mark memory idle              : 23.973131748s
> >   Writing to idle memory        : 0.063584496s
> >   Mark memory idle              : 24.924652964s
> >   Reading from idle memory      : 0.062042814s
> >
> > Breaking down the results:
> >
> >  * "Populating memory": The time it takes for all vCPUs to perform the
> >    first write to every page in their region.
> >
> >  * "Writing to populated memory" / "Reading from populated memory": The
> >    time it takes for all vCPUs to write and read to every page in their
> >    region after it has been populated. This serves as a control for the
> >    later results.
> >
> >  * "Mark memory idle": The time it takes for every vCPU to mark every
> >    page in their region as idle through page_idle.
> >
> >  * "Writing to idle memory" / "Reading from idle memory": The time it
> >    takes for all vCPUs to write and read to every page in their region
> >    after it has been marked idle.
> >
> > This test should be portable across architectures but it is only enabled
> > for x86_64 since that's all I have tested.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> 
> > ---
> >  tools/testing/selftests/kvm/.gitignore        |   1 +
> >  tools/testing/selftests/kvm/Makefile          |   1 +
> >  .../selftests/kvm/access_tracking_perf_test.c | 419 ++++++++++++++++++
> >  3 files changed, 421 insertions(+)
> >  create mode 100644 tools/testing/selftests/kvm/access_tracking_perf_test.c
> >
> > diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> > index bd83158e0e0b..32a362d71e05 100644
> > --- a/tools/testing/selftests/kvm/.gitignore
> > +++ b/tools/testing/selftests/kvm/.gitignore
> > @@ -34,6 +34,7 @@
> >  /x86_64/xen_vmcall_test
> >  /x86_64/xss_msr_test
> >  /x86_64/vmx_pmu_msrs_test
> > +/access_tracking_perf_test
> >  /demand_paging_test
> >  /dirty_log_test
> >  /dirty_log_perf_test
> > diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> > index e439d027939d..9f1b478da92b 100644
> > --- a/tools/testing/selftests/kvm/Makefile
> > +++ b/tools/testing/selftests/kvm/Makefile
> > @@ -67,6 +67,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_msrs_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/xen_shinfo_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
> > +TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
> >  TEST_GEN_PROGS_x86_64 += demand_paging_test
> >  TEST_GEN_PROGS_x86_64 += dirty_log_test
> >  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
> > diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > new file mode 100644
> > index 000000000000..60828f2d780f
> > --- /dev/null
> > +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
> > @@ -0,0 +1,419 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * access_tracking_test
> > + *
> > + * Copyright (C) 2021, Google, Inc.
> > + *
> > + * This test measures the performance effects of KVM's access tracking.
> > + * Access tracking is driven by the MMU notifiers test_young, clear_young, and
> > + * clear_flush_young. These notifiers do not have a direct userspace API,
> > + * however the clear_young notifier can be triggered by marking a pages as idle
> > + * in /sys/kernel/mm/page_idle/bitmap. This test leverages that mechanism to
> > + * enable access tracking on guest memory.
> > + *
> > + * To measure performance this test runs a VM with a configurable number of
> > + * vCPUs that each touch every page in disjoint regions of memory. Performance
> > + * is measured in the time it takes all vCPUs to finish touching their
> > + * predefined region.
> > + *
> > + * Note that a deterministic correctness test of access tracking is not possible
> > + * by using page_idle as it exists today. This is for a few reasons:
> > + *
> > + * 1. page_idle only issues clear_young notifiers, which lack a TLB flush. This
> > + *    means subsequent guest accesses are not guaranteed to see page table
> > + *    updates made by KVM until some time in the future.
> > + *
> > + * 2. page_idle only operates on LRU pages. Newly allocated pages are not
> > + *    immediately allocated to LRU lists. Instead they are held in a "pagevec",
> > + *    which is drained to LRU lists some time in the future. There is no
> > + *    userspace API to force this drain to occur.
> > + *
> > + * These limitations are worked around in this test by using a large enough
> > + * region of memory for each vCPU such that the number of translations cached in
> > + * the TLB and the number of pages held in pagevecs are a small fraction of the
> > + * overall workload. And if either of those conditions are not true this test
> > + * will fail rather than silently passing.
> > + */
> > +#include <inttypes.h>
> > +#include <limits.h>
> > +#include <pthread.h>
> > +#include <sys/mman.h>
> > +#include <sys/types.h>
> > +#include <sys/stat.h>
> > +
> > +#include "kvm_util.h"
> > +#include "test_util.h"
> > +#include "perf_test_util.h"
> > +#include "guest_modes.h"
> > +
> > +/* Global variable used to synchronize all of the vCPU threads. */
> > +static int iteration = -1;
> > +
> > +/* Defines what vCPU threads should do during a given iteration. */
> > +static enum {
> > +       /* Run the vCPU to access all its memory. */
> > +       ITERATION_ACCESS_MEMORY,
> > +       /* Mark the vCPU's memory idle in page_idle. */
> > +       ITERATION_MARK_IDLE,
> > +} iteration_work;
> > +
> > +/* Set to true when vCPU threads should exit. */
> > +static bool done;
> > +
> > +/* The iteration that was last completed by each vCPU. */
> > +static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
> > +
> > +/* Whether to overlap the regions of memory vCPUs access. */
> > +static bool overlap_memory_access;
> > +
> > +struct test_params {
> > +       /* The backing source for the region of memory. */
> > +       enum vm_mem_backing_src_type backing_src;
> > +
> > +       /* The amount of memory to allocate for each vCPU. */
> > +       uint64_t vcpu_memory_bytes;
> > +
> > +       /* The number of vCPUs to create in the VM. */
> > +       int vcpus;
> > +};
> > +
> > +static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
> > +{
> > +       uint64_t value;
> > +       off_t offset = index * sizeof(value);
> > +
> > +       TEST_ASSERT(pread(fd, &value, sizeof(value), offset) == sizeof(value),
> > +                   "pread from %s offset 0x%" PRIx64 " failed!",
> > +                   filename, offset);
> > +
> > +       return value;
> > +
> > +}
> > +
> > +static uint64_t lookup_pfn(int pagemap_fd, struct kvm_vm *vm, uint64_t gva)
> > +{
> > +       uint64_t hva = (uint64_t) addr_gva2hva(vm, gva);
> > +       uint64_t entry;
> > +
> > +       entry = pread_uint64(pagemap_fd, "pagemap", hva / getpagesize());
> > +       if (!(entry & (1ULL << 63)))
> > +               return 0;
> > +
> > +       return (entry & ((1ULL << 55) - 1));
> 
> It might be helpful to document these shifts and other constants in this test.

Good suggestion. Will do in v2.

> 
> > +}
> > +
> > +static bool is_page_idle(int page_idle_fd, uint64_t pfn)
> > +{
> > +       uint64_t bits = pread_uint64(page_idle_fd, "page_idle", pfn / 64);
> > +
> > +       return !!((bits >> (pfn % 64)) & 1);
> > +}
> > +
> > +static void mark_page_idle(int page_idle_fd, uint64_t pfn)
> > +{
> > +       uint64_t bits = 1ULL << (pfn % 64);
> > +
> > +       TEST_ASSERT(pwrite(page_idle_fd, &bits, 8, 8 * (pfn / 64)) == 8,
> > +                   "Set page_idle bits for PFN 0x%" PRIx64, pfn);
> > +}
> > +
> > +static void mark_vcpu_memory_idle(struct kvm_vm *vm, int vcpu_id)
> > +{
> > +       uint64_t base_gva = perf_test_args.vcpu_args[vcpu_id].gva;
> > +       uint64_t pages = perf_test_args.vcpu_args[vcpu_id].pages;
> > +       uint64_t page;
> > +       uint64_t still_idle = 0;
> > +       uint64_t no_pfn = 0;
> > +       int page_idle_fd;
> > +       int pagemap_fd;
> > +
> > +       /* If vCPUs are using an overlapping region, let vCPU 0 mark it idle. */
> > +       if (overlap_memory_access && vcpu_id)
> > +               return;
> > +
> > +       page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> > +       TEST_ASSERT(page_idle_fd > 0, "Failed to open page_idle.");
> > +
> > +       pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
> > +       TEST_ASSERT(pagemap_fd > 0, "Failed to open pagemap.");
> 
> Should this have an early skip-check too? I assume not all users of
> this test will be running with the privileges required to access
> pagemap.

Yes good point. Will do in v2.

> 
> 
> > +
> > +       for (page = 0; page < pages; page++) {
> > +               uint64_t gva = base_gva + page * perf_test_args.guest_page_size;
> > +               uint64_t pfn = lookup_pfn(pagemap_fd, vm, gva);
> > +
> > +               if (!pfn) {
> > +                       no_pfn++;
> > +                       continue;
> > +               }
> > +
> > +               if (is_page_idle(page_idle_fd, pfn)) {
> > +                       still_idle++;
> > +                       continue;
> > +               }
> > +
> > +               mark_page_idle(page_idle_fd, pfn);
> > +       }
> > +
> > +       /*
> > +        * Assumption: Less than 1% of pages are going to be swapped out from
> > +        * under us during this test.
> > +        */
> > +       TEST_ASSERT(no_pfn < pages / 100,
> > +                   "vCPU %d: No PFN for %" PRIu64 " out of %" PRIu64 " pages.",
> > +                   vcpu_id, no_pfn, pages);
> > +
> > +       /*
> > +        * Test that at least 90% of memory has been marked idle (the rest might
> > +        * not be marked idle because the pages have not yet made it to an LRU
> > +        * list or the translations are still cached in the TLB). 90% is
> > +        * arbitrary; high enough that we ensure most memory access went through
> > +        * access tracking but low enough as to not make the test too brittle
> > +        * over time and across architectures.
> > +        */
> > +       TEST_ASSERT(still_idle < pages / 10,
> > +                   "vCPU%d: Too many pages still idle (%"PRIu64 " out of %"
> > +                   PRIu64 ").\n",
> > +                   vcpu_id, still_idle, pages);
> > +
> > +       close(page_idle_fd);
> > +       close(pagemap_fd);
> > +}
> > +
> > +static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
> > +                        uint64_t expected_ucall)
> > +{
> > +       struct ucall uc;
> > +       uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
> > +
> > +       TEST_ASSERT(expected_ucall == actual_ucall,
> > +                   "Guest exited unexpectedly (expected ucall %" PRIu64
> > +                   ", got %" PRIu64 ")",
> > +                   expected_ucall, actual_ucall);
> > +}
> > +
> > +static bool spin_wait_for_next_iteration(int *current_iteration)
> > +{
> > +       int last_iteration = *current_iteration;
> > +
> > +       do {
> > +               if (READ_ONCE(done))
> > +                       return false;
> > +
> > +               *current_iteration = READ_ONCE(iteration);
> > +       } while (last_iteration == *current_iteration);
> > +
> > +       return true;
> > +}
> > +
> > +static void *vcpu_thread_main(void *arg)
> > +{
> > +       struct perf_test_vcpu_args *vcpu_args = arg;
> > +       struct kvm_vm *vm = perf_test_args.vm;
> > +       int vcpu_id = vcpu_args->vcpu_id;
> > +       int current_iteration = -1;
> > +
> > +       vcpu_args_set(vm, vcpu_id, 1, vcpu_id);
> > +
> > +       while (spin_wait_for_next_iteration(&current_iteration)) {
> > +               switch (READ_ONCE(iteration_work)) {
> > +               case ITERATION_ACCESS_MEMORY:
> > +                       vcpu_run(vm, vcpu_id);
> > +                       assert_ucall(vm, vcpu_id, UCALL_SYNC);
> > +                       break;
> > +               case ITERATION_MARK_IDLE:
> > +                       mark_vcpu_memory_idle(vm, vcpu_id);
> > +                       break;
> > +               };
> > +
> > +               vcpu_last_completed_iteration[vcpu_id] = current_iteration;
> > +       }
> > +
> > +       return NULL;
> > +}
> > +
> > +static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
> > +{
> > +       while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
> > +              target_iteration) {
> > +               continue;
> > +       }
> > +}
> > +
> > +/* The type of memory accesses to perform in the VM. */
> > +enum access_type {
> > +       ACCESS_READ,
> > +       ACCESS_WRITE,
> > +};
> > +
> > +static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
> > +{
> > +       struct timespec ts_start;
> > +       struct timespec ts_elapsed;
> > +       int next_iteration;
> > +       int vcpu_id;
> > +
> > +       /* Kick off the vCPUs by incrementing iteration. */
> > +       next_iteration = ++iteration;
> > +
> > +       clock_gettime(CLOCK_MONOTONIC, &ts_start);
> > +
> > +       /* Wait for all vCPUs to finish the iteration. */
> > +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
> > +               spin_wait_for_vcpu(vcpu_id, next_iteration);
> > +
> > +       ts_elapsed = timespec_elapsed(ts_start);
> > +       pr_info("%-30s: %ld.%09lds\n",
> > +               description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > +}
> > +
> > +static void access_memory(struct kvm_vm *vm, int vcpus, enum access_type access,
> > +                         const char *description)
> > +{
> > +       perf_test_args.wr_fract = (access == ACCESS_READ) ? INT_MAX : 1;
> > +       sync_global_to_guest(vm, perf_test_args);
> > +       iteration_work = ITERATION_ACCESS_MEMORY;
> > +       run_iteration(vm, vcpus, description);
> > +}
> > +
> > +static void mark_memory_idle(struct kvm_vm *vm, int vcpus)
> > +{
> > +       /*
> > +        * Even though this parallelizes the work across vCPUs, this is still a
> > +        * very slow operation because page_idle forces the test to mark one pfn
> > +        * at a time and the clear_young notifier serializes on the KVM MMU
> > +        * lock.
> > +        */
> > +       pr_debug("Marking VM memory idle (slow)...\n");
> > +       iteration_work = ITERATION_MARK_IDLE;
> > +       run_iteration(vm, vcpus, "Mark memory idle");
> > +}
> > +
> > +static pthread_t *create_vcpu_threads(int vcpus)
> > +{
> > +       pthread_t *vcpu_threads;
> > +       int i;
> > +
> > +       vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
> > +       TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
> > +
> > +       for (i = 0; i < vcpus; i++) {
> > +               vcpu_last_completed_iteration[i] = iteration;
> > +               pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
> > +                              &perf_test_args.vcpu_args[i]);
> > +       }
> > +
> > +       return vcpu_threads;
> > +}
> > +
> > +static void terminate_vcpu_threads(pthread_t *vcpu_threads, int vcpus)
> > +{
> > +       int i;
> > +
> > +       /* Set done to signal the vCPU threads to exit */
> > +       done = true;
> > +
> > +       for (i = 0; i < vcpus; i++)
> > +               pthread_join(vcpu_threads[i], NULL);
> > +}
> > +
> > +static void run_test(enum vm_guest_mode mode, void *arg)
> > +{
> > +       struct test_params *params = arg;
> > +       struct kvm_vm *vm;
> > +       pthread_t *vcpu_threads;
> > +       int vcpus = params->vcpus;
> > +
> > +       vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes,
> > +                                params->backing_src);
> > +
> > +       perf_test_setup_vcpus(vm, vcpus, params->vcpu_memory_bytes,
> > +                             !overlap_memory_access);
> > +
> > +       vcpu_threads = create_vcpu_threads(vcpus);
> > +
> > +       pr_info("\n");
> > +       access_memory(vm, vcpus, ACCESS_WRITE, "Populating memory");
> > +
> > +       /* As a control, read and write to the populated memory first. */
> > +       access_memory(vm, vcpus, ACCESS_WRITE, "Writing to populated memory");
> > +       access_memory(vm, vcpus, ACCESS_READ, "Reading from populated memory");
> > +
> > +       /* Repeat on memory that has been marked as idle. */
> > +       mark_memory_idle(vm, vcpus);
> > +       access_memory(vm, vcpus, ACCESS_WRITE, "Writing to idle memory");
> > +       mark_memory_idle(vm, vcpus);
> > +       access_memory(vm, vcpus, ACCESS_READ, "Reading from idle memory");
> > +
> > +       terminate_vcpu_threads(vcpu_threads, vcpus);
> > +       free(vcpu_threads);
> > +       perf_test_destroy_vm(vm);
> > +}
> > +
> > +static void help(char *name)
> > +{
> > +       puts("");
> > +       printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
> > +              name);
> > +       puts("");
> > +       printf(" -h: Display this help message.");
> > +       guest_modes_help();
> > +       printf(" -b: specify the size of the memory region which should be\n"
> > +              "     dirtied by each vCPU. e.g. 10M or 3G.\n"
> > +              "     (default: 1G)\n");
> > +       printf(" -v: specify the number of vCPUs to run.\n");
> > +       printf(" -o: Overlap guest memory accesses instead of partitioning\n"
> > +              "     them into a separate region of memory for each vCPU.\n");
> > +       printf(" -s: specify the type of memory that should be used to\n"
> > +              "     back the guest data region.\n\n");
> > +       backing_src_help();
> > +       puts("");
> > +       exit(0);
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +       struct test_params params = {
> > +               .backing_src = VM_MEM_SRC_ANONYMOUS,
> > +               .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
> > +               .vcpus = 1,
> > +       };
> > +       int page_idle_fd;
> > +       int opt;
> > +
> > +       guest_modes_append_default();
> > +
> > +       while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> > +               switch (opt) {
> > +               case 'm':
> > +                       guest_modes_cmdline(optarg);
> > +                       break;
> > +               case 'b':
> > +                       params.vcpu_memory_bytes = parse_size(optarg);
> > +                       break;
> > +               case 'v':
> > +                       params.vcpus = atoi(optarg);
> > +                       break;
> > +               case 'o':
> > +                       overlap_memory_access = true;
> > +                       break;
> > +               case 's':
> > +                       params.backing_src = parse_backing_src_type(optarg);
> > +                       break;
> > +               case 'h':
> > +               default:
> > +                       help(argv[0]);
> > +                       break;
> > +               }
> > +       }
> > +
> > +       page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
> > +       if (page_idle_fd < 0) {
> > +               print_skip("CONFIG_IDLE_PAGE_TRACKING is not enabled");
> > +               exit(KSFT_SKIP);
> > +       }
> > +       close(page_idle_fd);
> > +
> > +       for_each_guest_mode(run_test, &params);
> > +
> > +       return 0;
> > +}
> > --
> > 2.32.0.272.g935e593368-goog
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root()
  2021-06-14 21:39       ` Sean Christopherson
@ 2021-06-14 22:01         ` David Matlack
  0 siblings, 0 replies; 32+ messages in thread
From: David Matlack @ 2021-06-14 22:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Mon, Jun 14, 2021 at 09:39:33PM +0000, Sean Christopherson wrote:
> On Mon, Jun 14, 2021, David Matlack wrote:
> > On Mon, Jun 14, 2021 at 07:07:56PM +0000, Sean Christopherson wrote:
> > > On Fri, Jun 11, 2021, David Matlack wrote:
> > > > Refactor is_tdp_mmu_root() into is_vcpu_using_tdp_mmu() to reduce
> > > > duplicated code at call sites and make the code more readable.
> > > > 
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c     | 10 +++++-----
> > > >  arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
> > > >  arch/x86/kvm/mmu/tdp_mmu.h |  8 +++++---
> > > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > > > 
> > > 
> > > ...
> > 
> > I'm not sure how to interpret this.
> 
> Ha!  It took me a second to figure out what "this" was.  The ellipsis is just a
> way of saying that I trimmed a chunk of text.  I throw it in specifically when
> I'm trimming part of a patch to try to make it clear that my response is jumping
> to a new context.

Ohhh, that makes sense. Thanks.

> 
> > > E.g. achieve this over 2-4 patches:
> > 
> > Thanks for the suggestions, I'll take a look at cleaning that up. I am
> > thinking of making that a separate patch series (including removing this
> > patch from this series) as the interaction between the two is entirely
> > superficial. Let me know if that makes sense.
> 
> Hmm, make 'em a separate series unless there a semantic conflicts, e.g. if you
> want to cache the result of in get_mmio_spte() or something.

Ack, will do.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk
  2021-06-11 23:56 ` [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk David Matlack
  2021-06-14 17:56   ` Ben Gardon
@ 2021-06-14 22:27   ` David Matlack
  2021-06-14 22:59   ` Sean Christopherson
  2 siblings, 0 replies; 32+ messages in thread
From: David Matlack @ 2021-06-14 22:27 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021 at 11:56:58PM +0000, David Matlack wrote:
> In order to use walk_shadow_page_lockless() in fast_page_fault() we need
> to also record the spteps.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 1 +
>  arch/x86/kvm/mmu/mmu_internal.h | 3 +++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8140c262f4d3..765f5b01768d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3538,6 +3538,7 @@ static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
>  		spte = mmu_spte_get_lockless(it.sptep);
>  		walk->last_level = it.level;
>  		walk->sptes[it.level] = spte;
> +		walk->spteps[it.level] = it.sptep;
>  
>  		if (!is_shadow_present_pte(spte))
>  			break;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 26da6ca30fbf..0fefbd5d6c95 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -178,6 +178,9 @@ struct shadow_page_walk {
>  
>  	/* The spte value at each level. */
>  	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
> +
> +	/* The spte pointers at each level. */
> +	u64 *spteps[PT64_ROOT_MAX_LEVEL + 1];
>  };
>  
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 36f4844a5f95..7279d17817a1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1529,6 +1529,7 @@ bool kvm_tdp_mmu_walk_lockless(struct kvm_vcpu *vcpu, u64 addr,
>  
>  		walk->last_level = iter.level;
>  		walk->sptes[iter.level] = iter.old_spte;
> +		walk->spteps[iter.level] = iter.sptep;

I think this should technically be:

		walk->spteps[iter.level] = rcu_dereference(iter.sptep);

>  	}
>  
>  	return walk_ok;
> -- 
> 2.32.0.272.g935e593368-goog
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-14 17:56     ` Ben Gardon
@ 2021-06-14 22:34       ` David Matlack
  0 siblings, 0 replies; 32+ messages in thread
From: David Matlack @ 2021-06-14 22:34 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm list, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones

On Mon, Jun 14, 2021 at 10:56:47AM -0700, Ben Gardon wrote:
> On Fri, Jun 11, 2021 at 4:59 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Fri, Jun 11, 2021 at 4:57 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > This commit enables the fast_page_fault handler to work when the TDP MMU
> > > is enabled by leveraging the new walk_shadow_page_lockless* API to
> > > collect page walks independent of the TDP MMU.
> > >
> > > fast_page_fault was already using
> > > walk_shadow_page_lockless_{begin,end}(), we just have to change the
> > > actual walk to use walk_shadow_page_lockless() which does the right
> > > thing if the TDP MMU is in use.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> >
> > Adding this feature was suggested by Ben Gardon:
> >
> > Suggested-by: Ben Gardon <bgardon@google.com>
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> 
> >
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++-------------------------
> > >  1 file changed, 21 insertions(+), 31 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 765f5b01768d..5562727c3699 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -657,6 +657,9 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > >         local_irq_enable();
> > >  }
> > >
> > > +static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > > +                                     struct shadow_page_walk *walk);
> > > +
> > >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > >  {
> > >         int r;
> > > @@ -2967,14 +2970,9 @@ static bool page_fault_can_be_fast(u32 error_code)
> > >   * Returns true if the SPTE was fixed successfully. Otherwise,
> > >   * someone else modified the SPTE from its original value.
> > >   */
> > > -static bool
> > > -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > > -                       u64 *sptep, u64 old_spte, u64 new_spte)
> > > +static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, gpa_t gpa,
> > > +                                   u64 *sptep, u64 old_spte, u64 new_spte)
> > >  {
> > > -       gfn_t gfn;
> > > -
> > > -       WARN_ON(!sp->role.direct);
> > > -
> > >         /*
> > >          * Theoretically we could also set dirty bit (and flush TLB) here in
> > >          * order to eliminate unnecessary PML logging. See comments in
> > > @@ -2990,14 +2988,8 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > >         if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
> > >                 return false;
> > >
> > > -       if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
> > > -               /*
> > > -                * The gfn of direct spte is stable since it is
> > > -                * calculated by sp->gfn.
> > > -                */
> > > -               gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > > -               kvm_vcpu_mark_page_dirty(vcpu, gfn);
> > > -       }
> > > +       if (is_writable_pte(new_spte) && !is_writable_pte(old_spte))
> > > +               kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> 
> I love how cleanly you've implemented TDP MMU fast PF support here
> with so little code duplication. I think this approach will work well.
> 
> My only reservation is that this links the way we locklessly change
> (and handle changes to) SPTEs between the TDP and legacy MMUs.
> 
> fast_pf_fix_direct_spte is certainly a much simpler function than
> tdp_mmu_set_spte_atomic, but it might be worth adding a comment
> explaining that the function can modify SPTEs in a TDP MMU paging
> structure. Alternatively, fast_page_fault could call
> tdp_mmu_set_spte_atomic, but I don't know if that would carry a
> performance penalty. On the other hand, the handling for this
> particular type of SPTE modification is unlikely to change, so it
> might not matter.

I'm open to delegating the cmpxchg64 to separate functions for the TDP
and legacy MMU, but didn't see any reason why it would be necessary.
That combined with the fact that it would require a bunch of new code
and helper functions in the TDP MMU (e.g. tdp_mmu_set_spte_atomic has to
be split up further so it doesn't do the lockdep check) led me to just
re-use fast_pf_fix_direct_spte.

I can add a comment though in v2 in fast_pf_fix_direct_spte and another
in tdp_mmu_set_spte_atomic explaining how the two interact.

> 
> 
> 
> > >
> > >         return true;
> > >  }
> > > @@ -3019,10 +3011,9 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
> > >   */
> > >  static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> > >  {
> > > -       struct kvm_shadow_walk_iterator iterator;
> > > -       struct kvm_mmu_page *sp;
> > >         int ret = RET_PF_INVALID;
> > >         u64 spte = 0ull;
> > > +       u64 *sptep = NULL;
> > >         uint retry_count = 0;
> > >
> > >         if (!page_fault_can_be_fast(error_code))
> > > @@ -3031,17 +3022,19 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> > >         walk_shadow_page_lockless_begin(vcpu);
> > >
> > >         do {
> > > +               struct shadow_page_walk walk;
> > >                 u64 new_spte;
> > >
> > > -               for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
> > > -                       if (!is_shadow_present_pte(spte))
> > > -                               break;
> > > +               if (!walk_shadow_page_lockless(vcpu, gpa, &walk))
> > > +                       break;
> > > +
> > > +               spte = walk.sptes[walk.last_level];
> > > +               sptep = walk.spteps[walk.last_level];
> > >
> > >                 if (!is_shadow_present_pte(spte))
> > >                         break;
> > >
> > > -               sp = sptep_to_sp(iterator.sptep);
> > > -               if (!is_last_spte(spte, sp->role.level))
> > > +               if (!is_last_spte(spte, walk.last_level))
> > >                         break;
> > >
> > >                 /*
> > > @@ -3084,7 +3077,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> > >                          *
> > >                          * See the comments in kvm_arch_commit_memory_region().
> > >                          */
> > > -                       if (sp->role.level > PG_LEVEL_4K)
> > > +                       if (walk.last_level > PG_LEVEL_4K)
> > >                                 break;
> > >                 }
> > >
> > > @@ -3098,8 +3091,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> > >                  * since the gfn is not stable for indirect shadow page. See
> > >                  * Documentation/virt/kvm/locking.rst to get more detail.
> > >                  */
> > > -               if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte,
> > > -                                           new_spte)) {
> > > +               if (fast_pf_fix_direct_spte(vcpu, gpa, sptep, spte, new_spte)) {
> > >                         ret = RET_PF_FIXED;
> > >                         break;
> > >                 }
> > > @@ -3112,7 +3104,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> > >
> > >         } while (true);
> > >
> > > -       trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
> > > +       trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret);
> > >         walk_shadow_page_lockless_end(vcpu);
> > >
> > >         return ret;
> > > @@ -3748,11 +3740,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > >         if (page_fault_handle_page_track(vcpu, error_code, gfn))
> > >                 return RET_PF_EMULATE;
> > >
> > > -       if (!is_vcpu_using_tdp_mmu(vcpu)) {
> > > -               r = fast_page_fault(vcpu, gpa, error_code);
> > > -               if (r != RET_PF_INVALID)
> > > -                       return r;
> > > -       }
> > > +       r = fast_page_fault(vcpu, gpa, error_code);
> > > +       if (r != RET_PF_INVALID)
> > > +               return r;
> > >
> > >         r = mmu_topup_memory_caches(vcpu, false);
> > >         if (r)
> > > --
> > > 2.32.0.272.g935e593368-goog
> > >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk
  2021-06-11 23:56 ` [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk David Matlack
  2021-06-14 17:56   ` Ben Gardon
  2021-06-14 22:27   ` David Matlack
@ 2021-06-14 22:59   ` Sean Christopherson
  2021-06-14 23:39     ` David Matlack
  2 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2021-06-14 22:59 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Fri, Jun 11, 2021, David Matlack wrote:
> In order to use walk_shadow_page_lockless() in fast_page_fault() we need
> to also record the spteps.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 1 +
>  arch/x86/kvm/mmu/mmu_internal.h | 3 +++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8140c262f4d3..765f5b01768d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3538,6 +3538,7 @@ static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
>  		spte = mmu_spte_get_lockless(it.sptep);
>  		walk->last_level = it.level;
>  		walk->sptes[it.level] = spte;
> +		walk->spteps[it.level] = it.sptep;
>  
>  		if (!is_shadow_present_pte(spte))
>  			break;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 26da6ca30fbf..0fefbd5d6c95 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -178,6 +178,9 @@ struct shadow_page_walk {
>  
>  	/* The spte value at each level. */
>  	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
> +
> +	/* The spte pointers at each level. */
> +	u64 *spteps[PT64_ROOT_MAX_LEVEL + 1];

Hrm.  I'm not sure how I feel about this patch, or about shadow_page_walk in
general.  On the one hand, I like having a common API.  On the other hand, I
really don't like mixing two different protection schemes, e.g. this really
should be

        tdp_ptep_t spteps;

in order to gain the RCU annotations for TDP, but those RCU annotations are
problematic because the legacy MMU doesn't use RCU to protect its page tables.

I also don't like forcing the caller to hold the "lock" for longer than is
necessary, e.g. get_mmio_spte() doesn't require access to the page tables after
the initial walk, but the common spteps[] kinda sorta forces its hand.

The two use cases (and the only common use cases I can see) have fairly different
requirements.  The MMIO check wants the SPTEs at _all_ levels, whereas the fast
page fault handler wants the SPTE _and_ its pointer at a single level.  So I
wonder if by providing a super generic API we'd actually increase complexity.

I.e. rather than provide a completely generic API, maybe it would be better to
have two distinct API.  That wouldn't fix the tdp_ptep_t issue, but it would at
least bound it to some degree and make the code more obvious.  I suspect it would
also reduce the code churn, though that's not necessarily a goal in and of itself.

E.g. for fast_page_fault():

        walk_shadow_page_lockless_begin(vcpu);

        do {
                sptep = get_spte_lockless(..., &spte);
                if (!is_shadow_present_pte(spte))
                        break;

                sp = sptep_to_sp(sptep);
                if (!is_last_spte(spte, sp->role.level))
                        break;

                ...
        } while(true);

        walk_shadow_page_lockless_end(vcpu);


and for get_mmio_spte():
        walk_shadow_page_lockless_begin(vcpu);
        leaf = get_sptes_lockless(vcpu, addr, sptes, &root);
        if (unlikely(leaf < 0)) {
                *sptep = 0ull;
                return reserved;
        }

        walk_shadow_page_lockless_end(vcpu);


And if we look at the TDP MMU implementations, they aren't sharing _that_ much
code, and the code that is shared isn't all that interesting, e.g. if we really
wanted to we could macro-magic away the boilerplate, but I think even I would
balk at the result :-)

int kvm_tdp_mmu_get_sptes_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
				   int *root_level)
{
	struct tdp_iter iter;
	struct kvm_mmu *mmu = vcpu->arch.mmu;
	gfn_t gfn = addr >> PAGE_SHIFT;
	int leaf = -1;

	*root_level = vcpu->arch.mmu->shadow_root_level;

	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
		leaf = iter.level;
		sptes[leaf] = iter.old_spte;
	}

	return leaf;
}

u64 *kvm_tdp_mmu_get_spte_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *spte)
{
	struct kvm_mmu *mmu = vcpu->arch.mmu;
	gfn_t gfn = addr >> PAGE_SHIFT;
	struct tdp_iter iter;
	u64 *sptep = NULL;

	*spte = 0ull;

	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
		/*
		 * Here be a comment about the unfortunate differences between
		 * the TDP MMU and the legacy MMU.
		 */
		sptep = (u64 * __force)iter.sptep;
		*spte = iter.old_spte;
	}
	return sptep;
}


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk
  2021-06-14 22:59   ` Sean Christopherson
@ 2021-06-14 23:39     ` David Matlack
  2021-06-15  0:22       ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-14 23:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Mon, Jun 14, 2021 at 10:59:14PM +0000, Sean Christopherson wrote:
> On Fri, Jun 11, 2021, David Matlack wrote:
> > In order to use walk_shadow_page_lockless() in fast_page_fault() we need
> > to also record the spteps.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          | 1 +
> >  arch/x86/kvm/mmu/mmu_internal.h | 3 +++
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 1 +
> >  3 files changed, 5 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8140c262f4d3..765f5b01768d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3538,6 +3538,7 @@ static bool walk_shadow_page_lockless(struct kvm_vcpu *vcpu, u64 addr,
> >  		spte = mmu_spte_get_lockless(it.sptep);
> >  		walk->last_level = it.level;
> >  		walk->sptes[it.level] = spte;
> > +		walk->spteps[it.level] = it.sptep;
> >  
> >  		if (!is_shadow_present_pte(spte))
> >  			break;
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 26da6ca30fbf..0fefbd5d6c95 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -178,6 +178,9 @@ struct shadow_page_walk {
> >  
> >  	/* The spte value at each level. */
> >  	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
> > +
> > +	/* The spte pointers at each level. */
> > +	u64 *spteps[PT64_ROOT_MAX_LEVEL + 1];
> 
> Hrm.  I'm not sure how I feel about this patch, or about shadow_page_walk in
> general.  On the one hand, I like having a common API.  On the other hand, I
> really don't like mixing two different protection schemes, e.g. this really
> should be
> 
>         tdp_ptep_t spteps;
> 
> in order to gain the RCU annotations for TDP, but those RCU annotations are
> problematic because the legacy MMU doesn't use RCU to protect its page tables.
> 
> I also don't like forcing the caller to hold the "lock" for longer than is
> necessary, e.g. get_mmio_spte() doesn't require access to the page tables after
> the initial walk, but the common spteps[] kinda sorta forces its hand.

Yeah this felt gross implementing. I like your idea to create separate
APIs instead.

> 
> The two use cases (and the only common use cases I can see) have fairly different
> requirements.  The MMIO check wants the SPTEs at _all_ levels, whereas the fast
> page fault handler wants the SPTE _and_ its pointer at a single level.  So I
> wonder if by providing a super generic API we'd actually increase complexity.
> 
> I.e. rather than provide a completely generic API, maybe it would be better to
> have two distinct API.  That wouldn't fix the tdp_ptep_t issue, but it would at
> least bound it to some degree and make the code more obvious.

Does the tdp_ptep_t issue go away if kvm_tdp_mmu_get_spte_lockless
returns an rcu_dereference'd version of the pointer? See below.

> I suspect it would
> also reduce the code churn, though that's not necessarily a goal in and of itself.

Makes sense. So keep walk_shadow_page_lockless_{begin,end}() as a
generic API but provide separate helper functions for get_mmio_spte and
fast_page_fault to get each exactly what each needs.

> 
> E.g. for fast_page_fault():
> 
>         walk_shadow_page_lockless_begin(vcpu);
> 
>         do {
>                 sptep = get_spte_lockless(..., &spte);
>                 if (!is_shadow_present_pte(spte))
>                         break;
> 
>                 sp = sptep_to_sp(sptep);
>                 if (!is_last_spte(spte, sp->role.level))
>                         break;
> 
>                 ...
>         } while(true);
> 
>         walk_shadow_page_lockless_end(vcpu);
> 
> 
> and for get_mmio_spte():
>         walk_shadow_page_lockless_begin(vcpu);
>         leaf = get_sptes_lockless(vcpu, addr, sptes, &root);
>         if (unlikely(leaf < 0)) {
>                 *sptep = 0ull;
>                 return reserved;
>         }
> 
>         walk_shadow_page_lockless_end(vcpu);
> 
> 
> And if we look at the TDP MMU implementations, they aren't sharing _that_ much
> code, and the code that is shared isn't all that interesting, e.g. if we really
> wanted to we could macro-magic away the boilerplate, but I think even I would
> balk at the result :-)

Agreed.

> 
> int kvm_tdp_mmu_get_sptes_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> 				   int *root_level)
> {
> 	struct tdp_iter iter;
> 	struct kvm_mmu *mmu = vcpu->arch.mmu;
> 	gfn_t gfn = addr >> PAGE_SHIFT;
> 	int leaf = -1;
> 
> 	*root_level = vcpu->arch.mmu->shadow_root_level;
> 
> 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> 		leaf = iter.level;
> 		sptes[leaf] = iter.old_spte;
> 	}
> 
> 	return leaf;
> }
> 
> u64 *kvm_tdp_mmu_get_spte_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *spte)
> {
> 	struct kvm_mmu *mmu = vcpu->arch.mmu;
> 	gfn_t gfn = addr >> PAGE_SHIFT;
> 	struct tdp_iter iter;
> 	u64 *sptep = NULL;
> 
> 	*spte = 0ull;
> 
> 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> 		/*
> 		 * Here be a comment about the unfortunate differences between
> 		 * the TDP MMU and the legacy MMU.
> 		 */
> 		sptep = (u64 * __force)iter.sptep;

Instead, should this be:

		sptep = rcu_dereference(iter.sptep);

?

> 		*spte = iter.old_spte;
> 	}
> 	return sptep;
> }
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk
  2021-06-14 23:39     ` David Matlack
@ 2021-06-15  0:22       ` Sean Christopherson
  0 siblings, 0 replies; 32+ messages in thread
From: Sean Christopherson @ 2021-06-15  0:22 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones

On Mon, Jun 14, 2021, David Matlack wrote:
> On Mon, Jun 14, 2021 at 10:59:14PM +0000, Sean Christopherson wrote:
> > The two use cases (and the only common use cases I can see) have fairly different
> > requirements.  The MMIO check wants the SPTEs at _all_ levels, whereas the fast
> > page fault handler wants the SPTE _and_ its pointer at a single level.  So I
> > wonder if by providing a super generic API we'd actually increase complexity.
> > 
> > I.e. rather than provide a completely generic API, maybe it would be better to
> > have two distinct API.  That wouldn't fix the tdp_ptep_t issue, but it would at
> > least bound it to some degree and make the code more obvious.
> 
> Does the tdp_ptep_t issue go away if kvm_tdp_mmu_get_spte_lockless
> returns an rcu_dereference'd version of the pointer? See below.

Sort of?

> > u64 *kvm_tdp_mmu_get_spte_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *spte)
> > {
> > 	struct kvm_mmu *mmu = vcpu->arch.mmu;
> > 	gfn_t gfn = addr >> PAGE_SHIFT;
> > 	struct tdp_iter iter;
> > 	u64 *sptep = NULL;
> > 
> > 	*spte = 0ull;
> > 
> > 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> > 		/*
> > 		 * Here be a comment about the unfortunate differences between
> > 		 * the TDP MMU and the legacy MMU.
> > 		 */
> > 		sptep = (u64 * __force)iter.sptep;
> 
> Instead, should this be:
> 
> 		sptep = rcu_dereference(iter.sptep);
> 
> ?

It's not wrong per se, but it's cheating in some sense.

The problem is that it's not the pointer itself that's RCU-protected, rather it's
what it's pointing at (the page tables) that's RCU-protected.  E.g. if this were
reading the _value_, it would be a-ok to do:

	spte = READ_ONCE(*rcu_dereference(iter.sptep));

and return the value because the caller gets a copy of the RCU-protected data.

Reading multiple times from a single dereference is also ok, e.g. see
kvm_bitmap_or_dest_vcpus() and several other APIC helpers, but note how they all
take and release the RCU lock in a single block and don't return the protected
pointer to the caller.

Stripping the __rcu annotation without actually being able to guarantee that
the caller is going to do the right thing with the pointer is why I say it's
"cheating".  E.g. if the caller were to do:

  u64 get_spte_lockess_broken(...)
  {
	u64 *sptep;

	rcu_read_lock();
	sptep = kvm_tdp_mmu_get_spte_lockless(...);
	rcu_read_unlock();

	return READ_ONCE(*sptep);
  }

it would violate RCU but Sparse would be none the wiser.

The __force ugliness is also cheating, but it's also loudly stating that we're
intentionally cheating, hence the request for a comment.

> > 		*spte = iter.old_spte;
> > 	}
> > 	return sptep;
> > }
> > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-06-14 21:08   ` David Matlack
@ 2021-06-15  7:16     ` Paolo Bonzini
  2021-06-16 19:27       ` David Matlack
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2021-06-15  7:16 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid,
	Andrew Jones

On 14/06/21 23:08, David Matlack wrote:
> I actually have a set of
> patches from Ben I am planning to send soon that will reduce the number of
> redundant gfn-to-memslot lookups in the page fault path.

That seems to be a possible 5.14 candidate, while this series is 
probably a bit too much for now.

Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-06-15  7:16     ` Paolo Bonzini
@ 2021-06-16 19:27       ` David Matlack
  2021-06-16 19:31         ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: David Matlack @ 2021-06-16 19:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid,
	Andrew Jones

On Tue, Jun 15, 2021 at 09:16:00AM +0200, Paolo Bonzini wrote:
> On 14/06/21 23:08, David Matlack wrote:
> > I actually have a set of
> > patches from Ben I am planning to send soon that will reduce the number of
> > redundant gfn-to-memslot lookups in the page fault path.
> 
> That seems to be a possible 5.14 candidate, while this series is probably a
> bit too much for now.

Thanks for the feedback. I am not in a rush to get either series into
5.14 so that sounds fine with me. Here is how I am planning to proceed:

1. Send a new series with the cleanups to is_tdp_mmu_root Sean suggested
   in patch 1/8 [1].
2. Send v2 of the TDP MMU Fast Page Fault series without patch 1/8.
3. Send out the memslot lookup optimization series.

Does that sound reasonable to you? Do you have any reservations with
taking (2) before (3)?

[1] https://lore.kernel.org/kvm/YMepDK40DLkD4DSy@google.com/

> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-06-16 19:27       ` David Matlack
@ 2021-06-16 19:31         ` Paolo Bonzini
  0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2021-06-16 19:31 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid,
	Andrew Jones

On 16/06/21 21:27, David Matlack wrote:
>>> I actually have a set of
>>> patches from Ben I am planning to send soon that will reduce the number of
>>> redundant gfn-to-memslot lookups in the page fault path.
>> That seems to be a possible 5.14 candidate, while this series is probably a
>> bit too much for now.
> Thanks for the feedback. I am not in a rush to get either series into
> 5.14 so that sounds fine with me. Here is how I am planning to proceed:
> 
> 1. Send a new series with the cleanups to is_tdp_mmu_root Sean suggested
>     in patch 1/8 [1].
> 2. Send v2 of the TDP MMU Fast Page Fault series without patch 1/8.
> 3. Send out the memslot lookup optimization series.
> 
> Does that sound reasonable to you? Do you have any reservations with
> taking (2) before (3)?
> 
> [1]https://lore.kernel.org/kvm/YMepDK40DLkD4DSy@google.com/

They all seem reasonably independent, so use the order that is easier 
for you.

Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-06-16 19:31 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-11 23:56 [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
2021-06-11 23:56 ` [PATCH 1/8] KVM: x86/mmu: Refactor is_tdp_mmu_root() David Matlack
2021-06-14 17:56   ` Ben Gardon
2021-06-14 19:07   ` Sean Christopherson
2021-06-14 21:23     ` David Matlack
2021-06-14 21:39       ` Sean Christopherson
2021-06-14 22:01         ` David Matlack
2021-06-11 23:56 ` [PATCH 2/8] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
2021-06-14 17:56   ` Ben Gardon
2021-06-11 23:56 ` [PATCH 3/8] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault David Matlack
2021-06-11 23:56 ` [PATCH 4/8] KVM: x86/mmu: Common API for lockless shadow page walks David Matlack
2021-06-14 17:56   ` Ben Gardon
2021-06-11 23:56 ` [PATCH 5/8] KVM: x86/mmu: Also record spteps in shadow_page_walk David Matlack
2021-06-14 17:56   ` Ben Gardon
2021-06-14 22:27   ` David Matlack
2021-06-14 22:59   ` Sean Christopherson
2021-06-14 23:39     ` David Matlack
2021-06-15  0:22       ` Sean Christopherson
2021-06-11 23:56 ` [PATCH 6/8] KVM: x86/mmu: fast_page_fault support for the TDP MMU David Matlack
2021-06-11 23:59   ` David Matlack
2021-06-14 17:56     ` Ben Gardon
2021-06-14 22:34       ` David Matlack
2021-06-11 23:57 ` [PATCH 7/8] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing David Matlack
2021-06-14 17:56   ` Ben Gardon
2021-06-11 23:57 ` [PATCH 8/8] KVM: selftests: Introduce access_tracking_perf_test David Matlack
2021-06-14 17:56   ` Ben Gardon
2021-06-14 21:47     ` David Matlack
2021-06-14  9:54 ` [PATCH 0/8] KVM: x86/mmu: Fast page fault support for the TDP MMU Paolo Bonzini
2021-06-14 21:08   ` David Matlack
2021-06-15  7:16     ` Paolo Bonzini
2021-06-16 19:27       ` David Matlack
2021-06-16 19:31         ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).