All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU
@ 2021-06-30 21:47 David Matlack
  2021-06-30 21:47 ` [PATCH v2 1/6] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
                   ` (7 more replies)
  0 siblings, 8 replies; 33+ messages in thread
From: David Matlack @ 2021-06-30 21:47 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton, David Matlack

This patch series adds support for the TDP MMU in the fast_page_fault
path, which enables certain write-protection and access tracking faults
to be handled without taking the KVM MMU lock. This series brings the
performance of these faults up to par with the legacy MMU.

Since there is not currently any KVM test coverage for access tracking
faults, this series introduces a new KVM selftest,
access_tracking_perf_test. Note that this test relies on page_idle to
enable access tracking from userspace (since it is the only available
usersapce API to do so) and page_idle is being considered for removal
from Linux
(https://lore.kernel.org/linux-mm/20210612000714.775825-1-willy@infradead.org/).

Design
------

This series enables the existing fast_page_fault handler to operate
independent of whether the TDP MMU is enabled or not by abstracting out
the details behind a new lockless page walk API.

An alternative design considered was to add a separate fast_page_fault
handler to the TDP MMU. The code that inspects the spte and genereates
the new spte can be shared with the legacy MMU. However with this
design the retry loop has to be duplicated, there are many calls back
and forth between mmu.c and tdp_mmu.c, and passing around the RET_PF_*
values gets complicated.

Testing
-------

This series was tested on an Intel Cascade Lake machine. The kvm_intel
parameters eptad and pml were disabled to force access and dirty
tracking to go through fast_page_fault. All tests were run with the
TDP MMU enabled and then again disabled.

Tests ran:
 - All KVM selftests with default arguments
 - All x86_64 kvm-unit-tests.
 - ./access_tracking_perf_test -v 4
 - ./access_tracking_perf_test -v 4 -o
 - ./access_tracking_perf_test -v 4 -s anonymous_thp
 - ./access_tracking_perf_test -v 4 -s anonymous_thp -o
 - ./access_tracking_perf_test -v 64
 - ./dirty_log_perf_test -v 4
 - ./dirty_log_perf_test -v 4 -o
 - ./dirty_log_perf_test -v 4 -s anonymous_thp
 - ./dirty_log_perf_test -v 4 -s anonymous_thp -o
 - ./dirty_log_perf_test -v 64

For certain tests I also collected the fast_page_fault tracepoint to
manually make sure it was getting triggered properly:

  perf record -e kvmmmu:fast_page_fault --filter "old_spte != 0" -- <test>

Performance Results
-------------------

To measure performance I ran dirty_log_perf_test and
access_tracking_perf_test with 64 vCPUs. For dirty_log_perf_test
performance is measured by "Iteration 2 dirty memory time", the time it
takes for all vCPUs to write to their memory after it has been
write-protected. For access_tracking_perf_test performance is measured
by "Writing to idle memory", the time it takes for all vCPUs to write to
their memory after it has been access-protected.

Metric                            | tdp_mmu=Y before   | tdp_mmu=Y after
--------------------------------- | ------------------ | --------------------
Iteration 2 dirty memory time     | 3.545234984s       | 0.315209625s
Writing to idle memory            | 3.249645416s       | 0.301676189s

The performance improvement comes from less time spent acquiring the
mmu lock in read mode and less time looking up the memslot for the
faulting gpa.

The TDP MMU is now on par with the legacy MMU:

Metric                            | tdp_mmu=N          | tdp_mmu=Y
--------------------------------- | ------------------ | --------------------
Iteration 2 dirty memory time     | 0.303452990s       | 0.315209625s
Writing to idle memory            | 0.291742127s       | 0.301676189s

v2:
 * Split is_tdp_mmu_root cleanup into a separate series. [Sean]
   https://lore.kernel.org/kvm/20210617231948.2591431-1-dmatlack@google.com/
 * Split walk_shadow_page_lockless into 2 APIs. [Sean]
 * Perform rcu_dereference on TDP MMU sptep.
 * Add comment to tdp_mmu_set_spte_atomic explaining new interaction
 * with fast_pf_fix_direct_spte. [Ben]
 * Document pagemap shifts in access_tracking_perf_test. [Ben]
 * Skip test if lacking pagemap permissions (present pfn is 0). [Ben]
 * Add Ben's Reviewed-by tags.

v1: https://lore.kernel.org/kvm/20210611235701.3941724-1-dmatlack@google.com/

David Matlack (6):
  KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault
  KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate
    with the TDP MMU
  KVM: x86/mmu: fast_page_fault support for the TDP MMU
  KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing
  KVM: selftests: Introduce access_tracking_perf_test

 arch/x86/kvm/mmu/mmu.c                        |  83 +++-
 arch/x86/kvm/mmu/mmutrace.h                   |   3 +
 arch/x86/kvm/mmu/tdp_mmu.c                    |  56 ++-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   8 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 429 ++++++++++++++++++
 .../selftests/kvm/dirty_log_perf_test.c       |   1 +
 8 files changed, 550 insertions(+), 32 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/access_tracking_perf_test.c

-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/6] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
@ 2021-06-30 21:47 ` David Matlack
  2021-07-12 20:01   ` Sean Christopherson
  2021-06-30 21:47 ` [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault David Matlack
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: David Matlack @ 2021-06-30 21:47 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton, David Matlack

fast_page_fault is only called from direct_page_fault where we know the
address is a gpa.

Fixes: 736c291c9f36 ("KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM")
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b888385d1933..45274436d3c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3098,8 +3098,7 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
 /*
  * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
  */
-static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-			   u32 error_code)
+static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 {
 	struct kvm_shadow_walk_iterator iterator;
 	struct kvm_mmu_page *sp;
@@ -3115,7 +3114,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	do {
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, cr2_or_gpa, iterator, spte)
+		for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
 			if (!is_shadow_present_pte(spte))
 				break;
 
@@ -3194,8 +3193,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 	} while (true);
 
-	trace_fast_page_fault(vcpu, cr2_or_gpa, error_code, iterator.sptep,
-			      spte, ret);
+	trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
 
 	return ret;
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
  2021-06-30 21:47 ` [PATCH v2 1/6] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
@ 2021-06-30 21:47 ` David Matlack
  2021-07-12 16:14   ` Ben Gardon
  2021-06-30 21:47 ` [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU David Matlack
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: David Matlack @ 2021-06-30 21:47 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton, David Matlack

Enum values have to be exported to userspace since the formatting is not
done in the kernel. Without doing this perf maps RET_PF_FIXED and
RET_PF_SPURIOUS to 0, which results in incorrect output:

  $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
  $ perf script | head -1
   [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1

Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.

Fixes: c4371c2a682e ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmutrace.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index efbad33a0645..55c7e0fcda52 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -244,6 +244,9 @@ TRACE_EVENT(
 		  __entry->access)
 );
 
+TRACE_DEFINE_ENUM(RET_PF_FIXED);
+TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
+
 TRACE_EVENT(
 	fast_page_fault,
 	TP_PROTO(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 error_code,
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
  2021-06-30 21:47 ` [PATCH v2 1/6] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
  2021-06-30 21:47 ` [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault David Matlack
@ 2021-06-30 21:47 ` David Matlack
  2021-07-12 17:02   ` Ben Gardon
  2021-07-12 20:23   ` Sean Christopherson
  2021-06-30 21:48 ` [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for " David Matlack
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: David Matlack @ 2021-06-30 21:47 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton, David Matlack

Acquire the RCU read lock in walk_shadow_page_lockless_begin and release
it in walk_shadow_page_lockless_end when the TDP MMU is enabled.  This
should not introduce any functional changes but is used in the following
commit to make fast_page_fault interoperate with the TDP MMU.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 24 ++++++++++++++++++------
 arch/x86/kvm/mmu/tdp_mmu.c | 20 ++++++++++++++------
 arch/x86/kvm/mmu/tdp_mmu.h |  6 ++++--
 3 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 45274436d3c0..88c71a8a55f1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -686,6 +686,11 @@ static bool mmu_spte_age(u64 *sptep)
 
 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 {
+	if (is_tdp_mmu(vcpu->arch.mmu)) {
+		kvm_tdp_mmu_walk_lockless_begin();
+		return;
+	}
+
 	/*
 	 * Prevent page table teardown by making any free-er wait during
 	 * kvm_flush_remote_tlbs() IPI to all active vcpus.
@@ -701,6 +706,11 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 
 static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 {
+	if (is_tdp_mmu(vcpu->arch.mmu)) {
+		kvm_tdp_mmu_walk_lockless_end();
+		return;
+	}
+
 	/*
 	 * Make sure the write to vcpu->mode is not reordered in front of
 	 * reads to sptes.  If it does, kvm_mmu_commit_zap_page() can see us
@@ -3621,6 +3631,12 @@ static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level
 
 	walk_shadow_page_lockless_begin(vcpu);
 
+	if (is_tdp_mmu(vcpu->arch.mmu)) {
+		leaf = kvm_tdp_mmu_get_walk_lockless(vcpu, addr, sptes,
+						     root_level);
+		goto out;
+	}
+
 	for (shadow_walk_init(&iterator, vcpu, addr),
 	     *root_level = iterator.level;
 	     shadow_walk_okay(&iterator);
@@ -3634,8 +3650,8 @@ static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level
 			break;
 	}
 
+out:
 	walk_shadow_page_lockless_end(vcpu);
-
 	return leaf;
 }
 
@@ -3647,11 +3663,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 	int root, leaf, level;
 	bool reserved = false;
 
-	if (is_tdp_mmu(vcpu->arch.mmu))
-		leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
-	else
-		leaf = get_walk(vcpu, addr, sptes, &root);
-
+	leaf = get_walk(vcpu, addr, sptes, &root);
 	if (unlikely(leaf < 0)) {
 		*sptep = 0ull;
 		return reserved;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index caac4ddb46df..c6fa8d00bf9f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1513,12 +1513,24 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	return spte_set;
 }
 
+void kvm_tdp_mmu_walk_lockless_begin(void)
+{
+	rcu_read_lock();
+}
+
+void kvm_tdp_mmu_walk_lockless_end(void)
+{
+	rcu_read_unlock();
+}
+
 /*
  * Return the level of the lowest level SPTE added to sptes.
  * That SPTE may be non-present.
+ *
+ * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
+int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+				  int *root_level)
 {
 	struct tdp_iter iter;
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
@@ -1527,14 +1539,10 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 
 	*root_level = vcpu->arch.mmu->shadow_root_level;
 
-	rcu_read_lock();
-
 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
 
-	rcu_read_unlock();
-
 	return leaf;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1cae4485b3bc..e9dde5f9c0ef 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -77,8 +77,10 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level);
+void kvm_tdp_mmu_walk_lockless_begin(void);
+void kvm_tdp_mmu_walk_lockless_end(void);
+int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+				  int *root_level);
 
 #ifdef CONFIG_X86_64
 bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (2 preceding siblings ...)
  2021-06-30 21:47 ` [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU David Matlack
@ 2021-06-30 21:48 ` David Matlack
  2021-07-01  2:54     ` kernel test robot
                     ` (3 more replies)
  2021-06-30 21:48 ` [PATCH v2 5/6] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing David Matlack
                   ` (3 subsequent siblings)
  7 siblings, 4 replies; 33+ messages in thread
From: David Matlack @ 2021-06-30 21:48 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton, David Matlack

Make fast_page_fault interoperate with the TDP MMU by leveraging
walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and
introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to
grab the lowest level sptep.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 55 +++++++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h |  2 ++
 3 files changed, 78 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 88c71a8a55f1..1d410278a4cc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3105,15 +3105,45 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
 	return spte & PT_PRESENT_MASK;
 }
 
+/*
+ * Returns the last level spte pointer of the shadow page walk for the given
+ * gpa, and sets *spte to the spte value. This spte may be non-preset.
+ *
+ * If no walk could be performed, returns NULL and *spte does not contain valid
+ * data.
+ *
+ * Constraints:
+ *  - Must be called between walk_shadow_page_lockless_{begin,end}.
+ *  - The returned sptep must not be used after walk_shadow_page_lockless_end.
+ */
+u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
+{
+	struct kvm_shadow_walk_iterator iterator;
+	u64 old_spte;
+	u64 *sptep = NULL;
+
+	if (is_tdp_mmu(vcpu->arch.mmu))
+		return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte);
+
+	for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
+		sptep = iterator.sptep;
+		*spte = old_spte;
+
+		if (!is_shadow_present_pte(old_spte))
+			break;
+	}
+
+	return sptep;
+}
+
 /*
  * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
  */
 static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 {
-	struct kvm_shadow_walk_iterator iterator;
-	struct kvm_mmu_page *sp;
 	int ret = RET_PF_INVALID;
 	u64 spte = 0ull;
+	u64 *sptep = NULL;
 	uint retry_count = 0;
 
 	if (!page_fault_can_be_fast(error_code))
@@ -3122,16 +3152,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 	walk_shadow_page_lockless_begin(vcpu);
 
 	do {
+		struct kvm_mmu_page *sp;
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
-			if (!is_shadow_present_pte(spte))
-				break;
-
+		sptep = get_last_sptep_lockless(vcpu, gpa, &spte);
 		if (!is_shadow_present_pte(spte))
 			break;
 
-		sp = sptep_to_sp(iterator.sptep);
+		sp = sptep_to_sp(sptep);
 		if (!is_last_spte(spte, sp->role.level))
 			break;
 
@@ -3189,8 +3217,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 		 * since the gfn is not stable for indirect shadow page. See
 		 * Documentation/virt/kvm/locking.rst to get more detail.
 		 */
-		if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte,
-					    new_spte)) {
+		if (fast_pf_fix_direct_spte(vcpu, sp, sptep, spte, new_spte)) {
 			ret = RET_PF_FIXED;
 			break;
 		}
@@ -3203,7 +3230,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
 
 	} while (true);
 
-	trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
+	trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
 
 	return ret;
@@ -3838,11 +3865,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
 
-	if (!is_tdp_mmu_fault) {
-		r = fast_page_fault(vcpu, gpa, error_code);
-		if (r != RET_PF_INVALID)
-			return r;
-	}
+	r = fast_page_fault(vcpu, gpa, error_code);
+	if (r != RET_PF_INVALID)
+		return r;
 
 	r = mmu_topup_memory_caches(vcpu, false);
 	if (r)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c6fa8d00bf9f..2c9e0ed71fa0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
 	if (is_removed_spte(iter->old_spte))
 		return false;
 
+	/*
+	 * TDP MMU sptes can also be concurrently cmpxchg'd in
+	 * fast_pf_fix_direct_spte as part of fast_page_fault.
+	 */
 	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
 		      new_spte) != iter->old_spte)
 		return false;
@@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 
 	return leaf;
 }
+
+/*
+ * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}.
+ *
+ * The returned sptep must not be used after
+ * kvm_tdp_mmu_walk_shadow_page_lockless_end.
+ */
+u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
+					 u64 *spte)
+{
+	struct tdp_iter iter;
+	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	gfn_t gfn = addr >> PAGE_SHIFT;
+	tdp_ptep_t sptep = NULL;
+
+	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+		*spte = iter.old_spte;
+		sptep = iter.sptep;
+	}
+
+	if (sptep)
+		/*
+		 * Perform the rcu dereference here since we are passing the
+		 * sptep up to the generic MMU code which does not know the
+		 * synchronization details of the TDP MMU. This is safe as long
+		 * as the caller obeys the contract that the sptep is not used
+		 * after kvm_tdp_mmu_walk_shadow_page_lockless_end.
+		 */
+		return rcu_dereference(sptep);
+
+	return NULL;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index e9dde5f9c0ef..508a23bdf7da 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void);
 void kvm_tdp_mmu_walk_lockless_end(void);
 int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 				  int *root_level);
+u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
+					 u64 *spte);
 
 #ifdef CONFIG_X86_64
 bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 5/6] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (3 preceding siblings ...)
  2021-06-30 21:48 ` [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for " David Matlack
@ 2021-06-30 21:48 ` David Matlack
  2021-06-30 21:48 ` [PATCH v2 6/6] KVM: selftests: Introduce access_tracking_perf_test David Matlack
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-06-30 21:48 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton, David Matlack

There is a missing break statement which causes a fallthrough to the
next statement where optarg will be null and a segmentation fault will
be generated.

Fixes: 9e965bb75aae ("KVM: selftests: Add backing src parameter to dirty_log_perf_test")
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 04a2641261be..80cbd3a748c0 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -312,6 +312,7 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			break;
 		case 's':
 			p.backing_src = parse_backing_src_type(optarg);
 			break;
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 6/6] KVM: selftests: Introduce access_tracking_perf_test
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (4 preceding siblings ...)
  2021-06-30 21:48 ` [PATCH v2 5/6] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing David Matlack
@ 2021-06-30 21:48 ` David Matlack
  2021-07-01  1:15 ` [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU Matthew Wilcox
  2021-07-01 17:00 ` David Hildenbrand
  7 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-06-30 21:48 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton, David Matlack

This test measures the performance effects of KVM's access tracking.
Access tracking is driven by the MMU notifiers test_young, clear_young,
and clear_flush_young. These notifiers do not have a direct userspace
API, however the clear_young notifier can be triggered by marking a
pages as idle in /sys/kernel/mm/page_idle/bitmap. This test leverages
that mechanism to enable access tracking on guest memory.

To measure performance this test runs a VM with a configurable number of
vCPUs that each touch every page in disjoint regions of memory.
Performance is measured in the time it takes all vCPUs to finish
touching their predefined region.

Example invocation:

  $ ./access_tracking_perf_test -v 8
  Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
  guest physical test memory offset: 0xffdfffff000

  Populating memory             : 1.337752570s
  Writing to populated memory   : 0.010177640s
  Reading from populated memory : 0.009548239s
  Mark memory idle              : 23.973131748s
  Writing to idle memory        : 0.063584496s
  Mark memory idle              : 24.924652964s
  Reading from idle memory      : 0.062042814s

Breaking down the results:

 * "Populating memory": The time it takes for all vCPUs to perform the
   first write to every page in their region.

 * "Writing to populated memory" / "Reading from populated memory": The
   time it takes for all vCPUs to write and read to every page in their
   region after it has been populated. This serves as a control for the
   later results.

 * "Mark memory idle": The time it takes for every vCPU to mark every
   page in their region as idle through page_idle.

 * "Writing to idle memory" / "Reading from idle memory": The time it
   takes for all vCPUs to write and read to every page in their region
   after it has been marked idle.

This test should be portable across architectures but it is only enabled
for x86_64 since that's all I have tested.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 429 ++++++++++++++++++
 3 files changed, 431 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/access_tracking_perf_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index d5bc9bf3b528..e16cbe7b9ce3 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -38,6 +38,7 @@
 /x86_64/xen_vmcall_test
 /x86_64/xss_msr_test
 /x86_64/vmx_pmu_msrs_test
+/access_tracking_perf_test
 /demand_paging_test
 /dirty_log_test
 /dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index bc65c57ae40d..e1ce404f3c67 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -71,6 +71,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_msrs_test
 TEST_GEN_PROGS_x86_64 += x86_64/xen_shinfo_test
 TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
+TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
new file mode 100644
index 000000000000..e2baa187a21e
--- /dev/null
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -0,0 +1,429 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * access_tracking_perf_test
+ *
+ * Copyright (C) 2021, Google, Inc.
+ *
+ * This test measures the performance effects of KVM's access tracking.
+ * Access tracking is driven by the MMU notifiers test_young, clear_young, and
+ * clear_flush_young. These notifiers do not have a direct userspace API,
+ * however the clear_young notifier can be triggered by marking a pages as idle
+ * in /sys/kernel/mm/page_idle/bitmap. This test leverages that mechanism to
+ * enable access tracking on guest memory.
+ *
+ * To measure performance this test runs a VM with a configurable number of
+ * vCPUs that each touch every page in disjoint regions of memory. Performance
+ * is measured in the time it takes all vCPUs to finish touching their
+ * predefined region.
+ *
+ * Note that a deterministic correctness test of access tracking is not possible
+ * by using page_idle as it exists today. This is for a few reasons:
+ *
+ * 1. page_idle only issues clear_young notifiers, which lack a TLB flush. This
+ *    means subsequent guest accesses are not guaranteed to see page table
+ *    updates made by KVM until some time in the future.
+ *
+ * 2. page_idle only operates on LRU pages. Newly allocated pages are not
+ *    immediately allocated to LRU lists. Instead they are held in a "pagevec",
+ *    which is drained to LRU lists some time in the future. There is no
+ *    userspace API to force this drain to occur.
+ *
+ * These limitations are worked around in this test by using a large enough
+ * region of memory for each vCPU such that the number of translations cached in
+ * the TLB and the number of pages held in pagevecs are a small fraction of the
+ * overall workload. And if either of those conditions are not true this test
+ * will fail rather than silently passing.
+ */
+#include <inttypes.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "perf_test_util.h"
+#include "guest_modes.h"
+
+/* Global variable used to synchronize all of the vCPU threads. */
+static int iteration = -1;
+
+/* Defines what vCPU threads should do during a given iteration. */
+static enum {
+	/* Run the vCPU to access all its memory. */
+	ITERATION_ACCESS_MEMORY,
+	/* Mark the vCPU's memory idle in page_idle. */
+	ITERATION_MARK_IDLE,
+} iteration_work;
+
+/* Set to true when vCPU threads should exit. */
+static bool done;
+
+/* The iteration that was last completed by each vCPU. */
+static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
+
+/* Whether to overlap the regions of memory vCPUs access. */
+static bool overlap_memory_access;
+
+struct test_params {
+	/* The backing source for the region of memory. */
+	enum vm_mem_backing_src_type backing_src;
+
+	/* The amount of memory to allocate for each vCPU. */
+	uint64_t vcpu_memory_bytes;
+
+	/* The number of vCPUs to create in the VM. */
+	int vcpus;
+};
+
+static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
+{
+	uint64_t value;
+	off_t offset = index * sizeof(value);
+
+	TEST_ASSERT(pread(fd, &value, sizeof(value), offset) == sizeof(value),
+		    "pread from %s offset 0x%" PRIx64 " failed!",
+		    filename, offset);
+
+	return value;
+
+}
+
+#define PAGEMAP_PRESENT (1ULL << 63)
+#define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)
+
+static uint64_t lookup_pfn(int pagemap_fd, struct kvm_vm *vm, uint64_t gva)
+{
+	uint64_t hva = (uint64_t) addr_gva2hva(vm, gva);
+	uint64_t entry;
+	uint64_t pfn;
+
+	entry = pread_uint64(pagemap_fd, "pagemap", hva / getpagesize());
+	if (!(entry & PAGEMAP_PRESENT))
+		return 0;
+
+	pfn = entry & PAGEMAP_PFN_MASK;
+	if (!pfn) {
+		print_skip("Looking up PFNs requires CAP_SYS_ADMIN");
+		exit(KSFT_SKIP);
+	}
+
+	return pfn;
+}
+
+static bool is_page_idle(int page_idle_fd, uint64_t pfn)
+{
+	uint64_t bits = pread_uint64(page_idle_fd, "page_idle", pfn / 64);
+
+	return !!((bits >> (pfn % 64)) & 1);
+}
+
+static void mark_page_idle(int page_idle_fd, uint64_t pfn)
+{
+	uint64_t bits = 1ULL << (pfn % 64);
+
+	TEST_ASSERT(pwrite(page_idle_fd, &bits, 8, 8 * (pfn / 64)) == 8,
+		    "Set page_idle bits for PFN 0x%" PRIx64, pfn);
+}
+
+static void mark_vcpu_memory_idle(struct kvm_vm *vm, int vcpu_id)
+{
+	uint64_t base_gva = perf_test_args.vcpu_args[vcpu_id].gva;
+	uint64_t pages = perf_test_args.vcpu_args[vcpu_id].pages;
+	uint64_t page;
+	uint64_t still_idle = 0;
+	uint64_t no_pfn = 0;
+	int page_idle_fd;
+	int pagemap_fd;
+
+	/* If vCPUs are using an overlapping region, let vCPU 0 mark it idle. */
+	if (overlap_memory_access && vcpu_id)
+		return;
+
+	page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+	TEST_ASSERT(page_idle_fd > 0, "Failed to open page_idle.");
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	TEST_ASSERT(pagemap_fd > 0, "Failed to open pagemap.");
+
+	for (page = 0; page < pages; page++) {
+		uint64_t gva = base_gva + page * perf_test_args.guest_page_size;
+		uint64_t pfn = lookup_pfn(pagemap_fd, vm, gva);
+
+		if (!pfn) {
+			no_pfn++;
+			continue;
+		}
+
+		if (is_page_idle(page_idle_fd, pfn)) {
+			still_idle++;
+			continue;
+		}
+
+		mark_page_idle(page_idle_fd, pfn);
+	}
+
+	/*
+	 * Assumption: Less than 1% of pages are going to be swapped out from
+	 * under us during this test.
+	 */
+	TEST_ASSERT(no_pfn < pages / 100,
+		    "vCPU %d: No PFN for %" PRIu64 " out of %" PRIu64 " pages.",
+		    vcpu_id, no_pfn, pages);
+
+	/*
+	 * Test that at least 90% of memory has been marked idle (the rest might
+	 * not be marked idle because the pages have not yet made it to an LRU
+	 * list or the translations are still cached in the TLB). 90% is
+	 * arbitrary; high enough that we ensure most memory access went through
+	 * access tracking but low enough as to not make the test too brittle
+	 * over time and across architectures.
+	 */
+	TEST_ASSERT(still_idle < pages / 10,
+		    "vCPU%d: Too many pages still idle (%"PRIu64 " out of %"
+		    PRIu64 ").\n",
+		    vcpu_id, still_idle, pages);
+
+	close(page_idle_fd);
+	close(pagemap_fd);
+}
+
+static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
+			 uint64_t expected_ucall)
+{
+	struct ucall uc;
+	uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
+
+	TEST_ASSERT(expected_ucall == actual_ucall,
+		    "Guest exited unexpectedly (expected ucall %" PRIu64
+		    ", got %" PRIu64 ")",
+		    expected_ucall, actual_ucall);
+}
+
+static bool spin_wait_for_next_iteration(int *current_iteration)
+{
+	int last_iteration = *current_iteration;
+
+	do {
+		if (READ_ONCE(done))
+			return false;
+
+		*current_iteration = READ_ONCE(iteration);
+	} while (last_iteration == *current_iteration);
+
+	return true;
+}
+
+static void *vcpu_thread_main(void *arg)
+{
+	struct perf_test_vcpu_args *vcpu_args = arg;
+	struct kvm_vm *vm = perf_test_args.vm;
+	int vcpu_id = vcpu_args->vcpu_id;
+	int current_iteration = -1;
+
+	vcpu_args_set(vm, vcpu_id, 1, vcpu_id);
+
+	while (spin_wait_for_next_iteration(&current_iteration)) {
+		switch (READ_ONCE(iteration_work)) {
+		case ITERATION_ACCESS_MEMORY:
+			vcpu_run(vm, vcpu_id);
+			assert_ucall(vm, vcpu_id, UCALL_SYNC);
+			break;
+		case ITERATION_MARK_IDLE:
+			mark_vcpu_memory_idle(vm, vcpu_id);
+			break;
+		};
+
+		vcpu_last_completed_iteration[vcpu_id] = current_iteration;
+	}
+
+	return NULL;
+}
+
+static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
+{
+	while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
+	       target_iteration) {
+		continue;
+	}
+}
+
+/* The type of memory accesses to perform in the VM. */
+enum access_type {
+	ACCESS_READ,
+	ACCESS_WRITE,
+};
+
+static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
+{
+	struct timespec ts_start;
+	struct timespec ts_elapsed;
+	int next_iteration;
+	int vcpu_id;
+
+	/* Kick off the vCPUs by incrementing iteration. */
+	next_iteration = ++iteration;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+	/* Wait for all vCPUs to finish the iteration. */
+	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
+		spin_wait_for_vcpu(vcpu_id, next_iteration);
+
+	ts_elapsed = timespec_elapsed(ts_start);
+	pr_info("%-30s: %ld.%09lds\n",
+		description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+}
+
+static void access_memory(struct kvm_vm *vm, int vcpus, enum access_type access,
+			  const char *description)
+{
+	perf_test_args.wr_fract = (access == ACCESS_READ) ? INT_MAX : 1;
+	sync_global_to_guest(vm, perf_test_args);
+	iteration_work = ITERATION_ACCESS_MEMORY;
+	run_iteration(vm, vcpus, description);
+}
+
+static void mark_memory_idle(struct kvm_vm *vm, int vcpus)
+{
+	/*
+	 * Even though this parallelizes the work across vCPUs, this is still a
+	 * very slow operation because page_idle forces the test to mark one pfn
+	 * at a time and the clear_young notifier serializes on the KVM MMU
+	 * lock.
+	 */
+	pr_debug("Marking VM memory idle (slow)...\n");
+	iteration_work = ITERATION_MARK_IDLE;
+	run_iteration(vm, vcpus, "Mark memory idle");
+}
+
+static pthread_t *create_vcpu_threads(int vcpus)
+{
+	pthread_t *vcpu_threads;
+	int i;
+
+	vcpu_threads = malloc(vcpus * sizeof(vcpu_threads[0]));
+	TEST_ASSERT(vcpu_threads, "Failed to allocate vcpu_threads.");
+
+	for (i = 0; i < vcpus; i++) {
+		vcpu_last_completed_iteration[i] = iteration;
+		pthread_create(&vcpu_threads[i], NULL, vcpu_thread_main,
+			       &perf_test_args.vcpu_args[i]);
+	}
+
+	return vcpu_threads;
+}
+
+static void terminate_vcpu_threads(pthread_t *vcpu_threads, int vcpus)
+{
+	int i;
+
+	/* Set done to signal the vCPU threads to exit */
+	done = true;
+
+	for (i = 0; i < vcpus; i++)
+		pthread_join(vcpu_threads[i], NULL);
+}
+
+static void run_test(enum vm_guest_mode mode, void *arg)
+{
+	struct test_params *params = arg;
+	struct kvm_vm *vm;
+	pthread_t *vcpu_threads;
+	int vcpus = params->vcpus;
+
+	vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes,
+				 params->backing_src);
+
+	perf_test_setup_vcpus(vm, vcpus, params->vcpu_memory_bytes,
+			      !overlap_memory_access);
+
+	vcpu_threads = create_vcpu_threads(vcpus);
+
+	pr_info("\n");
+	access_memory(vm, vcpus, ACCESS_WRITE, "Populating memory");
+
+	/* As a control, read and write to the populated memory first. */
+	access_memory(vm, vcpus, ACCESS_WRITE, "Writing to populated memory");
+	access_memory(vm, vcpus, ACCESS_READ, "Reading from populated memory");
+
+	/* Repeat on memory that has been marked as idle. */
+	mark_memory_idle(vm, vcpus);
+	access_memory(vm, vcpus, ACCESS_WRITE, "Writing to idle memory");
+	mark_memory_idle(vm, vcpus);
+	access_memory(vm, vcpus, ACCESS_READ, "Reading from idle memory");
+
+	terminate_vcpu_threads(vcpu_threads, vcpus);
+	free(vcpu_threads);
+	perf_test_destroy_vm(vm);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
+	       name);
+	puts("");
+	printf(" -h: Display this help message.");
+	guest_modes_help();
+	printf(" -b: specify the size of the memory region which should be\n"
+	       "     dirtied by each vCPU. e.g. 10M or 3G.\n"
+	       "     (default: 1G)\n");
+	printf(" -v: specify the number of vCPUs to run.\n");
+	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
+	       "     them into a separate region of memory for each vCPU.\n");
+	printf(" -s: specify the type of memory that should be used to\n"
+	       "     back the guest data region.\n\n");
+	backing_src_help();
+	puts("");
+	exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+	struct test_params params = {
+		.backing_src = VM_MEM_SRC_ANONYMOUS,
+		.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
+		.vcpus = 1,
+	};
+	int page_idle_fd;
+	int opt;
+
+	guest_modes_append_default();
+
+	while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
+		switch (opt) {
+		case 'm':
+			guest_modes_cmdline(optarg);
+			break;
+		case 'b':
+			params.vcpu_memory_bytes = parse_size(optarg);
+			break;
+		case 'v':
+			params.vcpus = atoi(optarg);
+			break;
+		case 'o':
+			overlap_memory_access = true;
+			break;
+		case 's':
+			params.backing_src = parse_backing_src_type(optarg);
+			break;
+		case 'h':
+		default:
+			help(argv[0]);
+			break;
+		}
+	}
+
+	page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+	if (page_idle_fd < 0) {
+		print_skip("CONFIG_IDLE_PAGE_TRACKING is not enabled");
+		exit(KSFT_SKIP);
+	}
+	close(page_idle_fd);
+
+	for_each_guest_mode(run_test, &params);
+
+	return 0;
+}
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (5 preceding siblings ...)
  2021-06-30 21:48 ` [PATCH v2 6/6] KVM: selftests: Introduce access_tracking_perf_test David Matlack
@ 2021-07-01  1:15 ` Matthew Wilcox
       [not found]   ` <CABgObfZUFWCAvKoxDzGjmksFnwZgbnpX9GuC+nhiVLa-Fhwj6A@mail.gmail.com>
  2021-07-01 17:00 ` David Hildenbrand
  7 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2021-07-01  1:15 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Yu Zhao, David Hildenbrand,
	Andrew Morton

On Wed, Jun 30, 2021 at 09:47:56PM +0000, David Matlack wrote:
> This patch series adds support for the TDP MMU in the fast_page_fault

Nowhere in this message do you explain what the TDP MMU is.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-30 21:48 ` [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for " David Matlack
@ 2021-07-01  2:54     ` kernel test robot
  2021-07-01  4:27     ` kernel test robot
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: kernel test robot @ 2021-07-01  2:54 UTC (permalink / raw)
  To: David Matlack, kvm
  Cc: kbuild-all, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones

[-- Attachment #1: Type: text/plain, Size: 2782 bytes --]

Hi David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kvm/queue]
[also build test WARNING on linus/master next-20210630]
[cannot apply to vhost/linux-next v5.13]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
config: x86_64-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/7709823e6135aaf4aeac8235973f37e679064356
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
        git checkout 7709823e6135aaf4aeac8235973f37e679064356
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:3119:6: warning: no previous prototype for 'get_last_sptep_lockless' [-Wmissing-prototypes]
    3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
         |      ^~~~~~~~~~~~~~~~~~~~~~~


vim +/get_last_sptep_lockless +3119 arch/x86/kvm/mmu/mmu.c

  3107	
  3108	/*
  3109	 * Returns the last level spte pointer of the shadow page walk for the given
  3110	 * gpa, and sets *spte to the spte value. This spte may be non-preset.
  3111	 *
  3112	 * If no walk could be performed, returns NULL and *spte does not contain valid
  3113	 * data.
  3114	 *
  3115	 * Constraints:
  3116	 *  - Must be called between walk_shadow_page_lockless_{begin,end}.
  3117	 *  - The returned sptep must not be used after walk_shadow_page_lockless_end.
  3118	 */
> 3119	u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
  3120	{
  3121		struct kvm_shadow_walk_iterator iterator;
  3122		u64 old_spte;
  3123		u64 *sptep = NULL;
  3124	
  3125		if (is_tdp_mmu(vcpu->arch.mmu))
  3126			return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte);
  3127	
  3128		for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
  3129			sptep = iterator.sptep;
  3130			*spte = old_spte;
  3131	
  3132			if (!is_shadow_present_pte(old_spte))
  3133				break;
  3134		}
  3135	
  3136		return sptep;
  3137	}
  3138	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 65272 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
@ 2021-07-01  2:54     ` kernel test robot
  0 siblings, 0 replies; 33+ messages in thread
From: kernel test robot @ 2021-07-01  2:54 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2855 bytes --]

Hi David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kvm/queue]
[also build test WARNING on linus/master next-20210630]
[cannot apply to vhost/linux-next v5.13]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
config: x86_64-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/7709823e6135aaf4aeac8235973f37e679064356
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
        git checkout 7709823e6135aaf4aeac8235973f37e679064356
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:3119:6: warning: no previous prototype for 'get_last_sptep_lockless' [-Wmissing-prototypes]
    3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
         |      ^~~~~~~~~~~~~~~~~~~~~~~


vim +/get_last_sptep_lockless +3119 arch/x86/kvm/mmu/mmu.c

  3107	
  3108	/*
  3109	 * Returns the last level spte pointer of the shadow page walk for the given
  3110	 * gpa, and sets *spte to the spte value. This spte may be non-preset.
  3111	 *
  3112	 * If no walk could be performed, returns NULL and *spte does not contain valid
  3113	 * data.
  3114	 *
  3115	 * Constraints:
  3116	 *  - Must be called between walk_shadow_page_lockless_{begin,end}.
  3117	 *  - The returned sptep must not be used after walk_shadow_page_lockless_end.
  3118	 */
> 3119	u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
  3120	{
  3121		struct kvm_shadow_walk_iterator iterator;
  3122		u64 old_spte;
  3123		u64 *sptep = NULL;
  3124	
  3125		if (is_tdp_mmu(vcpu->arch.mmu))
  3126			return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte);
  3127	
  3128		for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
  3129			sptep = iterator.sptep;
  3130			*spte = old_spte;
  3131	
  3132			if (!is_shadow_present_pte(old_spte))
  3133				break;
  3134		}
  3135	
  3136		return sptep;
  3137	}
  3138	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 65272 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-30 21:48 ` [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for " David Matlack
@ 2021-07-01  4:27     ` kernel test robot
  2021-07-01  4:27     ` kernel test robot
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: kernel test robot @ 2021-07-01  4:27 UTC (permalink / raw)
  To: David Matlack, kvm
  Cc: kbuild-all, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones

[-- Attachment #1: Type: text/plain, Size: 2839 bytes --]

Hi David,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kvm/queue]
[also build test ERROR on linus/master next-20210630]
[cannot apply to vhost/linux-next v5.13]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
config: i386-buildonly-randconfig-r001-20210630 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/7709823e6135aaf4aeac8235973f37e679064356
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
        git checkout 7709823e6135aaf4aeac8235973f37e679064356
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:3119:6: error: no previous prototype for 'get_last_sptep_lockless' [-Werror=missing-prototypes]
    3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors


vim +/get_last_sptep_lockless +3119 arch/x86/kvm/mmu/mmu.c

  3107	
  3108	/*
  3109	 * Returns the last level spte pointer of the shadow page walk for the given
  3110	 * gpa, and sets *spte to the spte value. This spte may be non-preset.
  3111	 *
  3112	 * If no walk could be performed, returns NULL and *spte does not contain valid
  3113	 * data.
  3114	 *
  3115	 * Constraints:
  3116	 *  - Must be called between walk_shadow_page_lockless_{begin,end}.
  3117	 *  - The returned sptep must not be used after walk_shadow_page_lockless_end.
  3118	 */
> 3119	u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
  3120	{
  3121		struct kvm_shadow_walk_iterator iterator;
  3122		u64 old_spte;
  3123		u64 *sptep = NULL;
  3124	
  3125		if (is_tdp_mmu(vcpu->arch.mmu))
  3126			return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte);
  3127	
  3128		for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
  3129			sptep = iterator.sptep;
  3130			*spte = old_spte;
  3131	
  3132			if (!is_shadow_present_pte(old_spte))
  3133				break;
  3134		}
  3135	
  3136		return sptep;
  3137	}
  3138	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26683 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
@ 2021-07-01  4:27     ` kernel test robot
  0 siblings, 0 replies; 33+ messages in thread
From: kernel test robot @ 2021-07-01  4:27 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2913 bytes --]

Hi David,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kvm/queue]
[also build test ERROR on linus/master next-20210630]
[cannot apply to vhost/linux-next v5.13]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
config: i386-buildonly-randconfig-r001-20210630 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/7709823e6135aaf4aeac8235973f37e679064356
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Matlack/KVM-x86-mmu-Fast-page-fault-support-for-the-TDP-MMU/20210701-055009
        git checkout 7709823e6135aaf4aeac8235973f37e679064356
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:3119:6: error: no previous prototype for 'get_last_sptep_lockless' [-Werror=missing-prototypes]
    3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors


vim +/get_last_sptep_lockless +3119 arch/x86/kvm/mmu/mmu.c

  3107	
  3108	/*
  3109	 * Returns the last level spte pointer of the shadow page walk for the given
  3110	 * gpa, and sets *spte to the spte value. This spte may be non-preset.
  3111	 *
  3112	 * If no walk could be performed, returns NULL and *spte does not contain valid
  3113	 * data.
  3114	 *
  3115	 * Constraints:
  3116	 *  - Must be called between walk_shadow_page_lockless_{begin,end}.
  3117	 *  - The returned sptep must not be used after walk_shadow_page_lockless_end.
  3118	 */
> 3119	u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
  3120	{
  3121		struct kvm_shadow_walk_iterator iterator;
  3122		u64 old_spte;
  3123		u64 *sptep = NULL;
  3124	
  3125		if (is_tdp_mmu(vcpu->arch.mmu))
  3126			return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte);
  3127	
  3128		for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
  3129			sptep = iterator.sptep;
  3130			*spte = old_spte;
  3131	
  3132			if (!is_shadow_present_pte(old_spte))
  3133				break;
  3134		}
  3135	
  3136		return sptep;
  3137	}
  3138	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 26683 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU
       [not found]   ` <CABgObfZUFWCAvKoxDzGjmksFnwZgbnpX9GuC+nhiVLa-Fhwj6A@mail.gmail.com>
@ 2021-07-01 12:08     ` Matthew Wilcox
  2021-07-01 16:50       ` David Matlack
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2021-07-01 12:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Matlack, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid,
	Andrew Jones, Yu Zhao, David Hildenbrand, Andrew Morton

On Thu, Jul 01, 2021 at 08:47:06AM +0200, Paolo Bonzini wrote:
> Il gio 1 lug 2021, 03:17 Matthew Wilcox <willy@infradead.org> ha scritto:
> 
> > On Wed, Jun 30, 2021 at 09:47:56PM +0000, David Matlack wrote:
> > > This patch series adds support for the TDP MMU in the fast_page_fault
> >
> > Nowhere in this message do you explain what the TDP MMU is.
> >
> 
> I don't know if you rolled a 1 too much in your last D&D session or what,
> but this seems like a slightly petty remark given the wealth of information
> that was in the commit message.

Yes, there was a wealth of information, but I couldn't understand any of
it because I have no idea what a TDP MMU is.  Actually, the length of
the intro message annoyed me, because I had to scan the whole thing to
try to figure out whether he explained what a TDP MMU is anywhere in it.

> The patches only touch kvm code, therefore David probably expected only kvm
> reviewers to be interested---in which case, anyone who hasn't lived under a
> rock for a year or two would know.

He cc'd me.  That's usually a request for a review.  So I had to try
to understand why I would care.  At this point, I don't think I care.
Somebody explain to me why I should care?  That's the point of a cover
letter.

> Anyway the acronym is fairly
> Google-friendly:
> https://lore.kernel.org/kvm/20201014182700.2888246-1-bgardon@google.com/
> http://lkml.iu.edu/hypermail/linux/kernel/2010.2/07021.html

Now I'm really confused, and I don't understand why I was cc'd at all.
get_maintainer.pl going wild?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-07-01 12:08     ` Matthew Wilcox
@ 2021-07-01 16:50       ` David Matlack
  0 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-07-01 16:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid,
	Andrew Jones, Yu Zhao, David Hildenbrand, Andrew Morton

On Thu, Jul 1, 2021 at 5:08 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Now I'm really confused, and I don't understand why I was cc'd at all.
> get_maintainer.pl going wild?

Apologies for the confusion. I could have made it more clear why I
cc'd you in the cover letter.

I cc'd you, Yu, David, and Andrew (Morton) since this series
introduces a new selftest that uses on page_idle and you recently
discussed removing page_idle. See the second paragraph of the cover
letter with the link to your page_idle patch, and patch 6 which
introduces the selftest.

I will make it more explicit in the future when cc'ing non-KVM
developers on my KVM patches.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
                   ` (6 preceding siblings ...)
  2021-07-01  1:15 ` [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU Matthew Wilcox
@ 2021-07-01 17:00 ` David Hildenbrand
  2021-07-01 22:11   ` David Matlack
  7 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2021-07-01 17:00 UTC (permalink / raw)
  To: David Matlack, kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	Andrew Morton

On 30.06.21 23:47, David Matlack wrote:
> This patch series adds support for the TDP MMU in the fast_page_fault
> path, which enables certain write-protection and access tracking faults
> to be handled without taking the KVM MMU lock. This series brings the
> performance of these faults up to par with the legacy MMU.
> 
> Since there is not currently any KVM test coverage for access tracking
> faults, this series introduces a new KVM selftest,
> access_tracking_perf_test. Note that this test relies on page_idle to
> enable access tracking from userspace (since it is the only available
> usersapce API to do so) and page_idle is being considered for removal
> from Linux
> (https://lore.kernel.org/linux-mm/20210612000714.775825-1-willy@infradead.org/).

Well, at least a new selftest that implicitly tests a part of page_idle 
-- nice :)

Haven't looked into the details, but if you can live with page tables 
starting unpopulated and only monitoring what gets populated on r/w 
access, you might be able to achieve something similar using 
/proc/self/pagemap and softdirty handling.

Unpopulated page (e.g., via MADV_DISCARD) -> trigger read or write 
access -> sense if page populated in pagemap
Populated page-> clear all softdirty bits -> trigger write access -> 
sense if page is softdirty in pagemap

See https://lkml.kernel.org/r/20210419135443.12822-6-david@redhat.com 
for an example.

But I'm actually fairly happy to see page_idel getting used. Maybe you 
could extend that test using pagemap, if it's applicable to your test setup.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-07-01  4:27     ` kernel test robot
@ 2021-07-01 18:27       ` David Matlack
  -1 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-07-01 18:27 UTC (permalink / raw)
  To: kernel test robot
  Cc: kvm, kbuild-all, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones

On Thu, Jul 01, 2021 at 12:27:58PM +0800, kernel test robot wrote:
> 
> >> arch/x86/kvm/mmu/mmu.c:3119:6: error: no previous prototype for 'get_last_sptep_lockless' [-Werror=missing-prototypes]
>     3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
>          |      ^~~~~~~~~~~~~~~~~~~~~~~

get_last_sptep_lockless should be static. I will include a fix in the
next version of this series.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
@ 2021-07-01 18:27       ` David Matlack
  0 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-07-01 18:27 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]

On Thu, Jul 01, 2021 at 12:27:58PM +0800, kernel test robot wrote:
> 
> >> arch/x86/kvm/mmu/mmu.c:3119:6: error: no previous prototype for 'get_last_sptep_lockless' [-Werror=missing-prototypes]
>     3119 | u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
>          |      ^~~~~~~~~~~~~~~~~~~~~~~

get_last_sptep_lockless should be static. I will include a fix in the
next version of this series.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-07-01 17:00 ` David Hildenbrand
@ 2021-07-01 22:11   ` David Matlack
  2021-07-02  7:53     ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: David Matlack @ 2021-07-01 22:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	Andrew Morton

On Thu, Jul 01, 2021 at 07:00:51PM +0200, David Hildenbrand wrote:
> On 30.06.21 23:47, David Matlack wrote:
> > This patch series adds support for the TDP MMU in the fast_page_fault
> > path, which enables certain write-protection and access tracking faults
> > to be handled without taking the KVM MMU lock. This series brings the
> > performance of these faults up to par with the legacy MMU.
> > 
> > Since there is not currently any KVM test coverage for access tracking
> > faults, this series introduces a new KVM selftest,
> > access_tracking_perf_test. Note that this test relies on page_idle to
> > enable access tracking from userspace (since it is the only available
> > usersapce API to do so) and page_idle is being considered for removal
> > from Linux
> > (https://lore.kernel.org/linux-mm/20210612000714.775825-1-willy@infradead.org/).
> 
> Well, at least a new selftest that implicitly tests a part of page_idle --
> nice :)
> 
> Haven't looked into the details, but if you can live with page tables
> starting unpopulated and only monitoring what gets populated on r/w access,
> you might be able to achieve something similar using /proc/self/pagemap and
> softdirty handling.
> 
> Unpopulated page (e.g., via MADV_DISCARD) -> trigger read or write access ->
> sense if page populated in pagemap
> Populated page-> clear all softdirty bits -> trigger write access -> sense
> if page is softdirty in pagemap

Thanks for the suggestion. I modified by test to write 4 to
/proc/self/clear_refs rather than marking pages in page_idle. However,
by doing so I was no longer able to exercise KVM's fast_page_fault
handler [1].

It looks like the reason why is that clear_refs issues the
invalidate_range mmu notifiers, which will cause KVM to fully refault
the page from the host MM upon subsequent guest memory accesses.  In
contrast, page_idle uses clear_young which KVM can handle with
fast_page_fault.

Let me know if I misunderstood your suggestion though.

[1] https://www.kernel.org/doc/html/latest/virt/kvm/locking.html#exception

> 
> See https://lkml.kernel.org/r/20210419135443.12822-6-david@redhat.com for an
> example.
> 
> But I'm actually fairly happy to see page_idel getting used. Maybe you could
> extend that test using pagemap, if it's applicable to your test setup.
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU
  2021-07-01 22:11   ` David Matlack
@ 2021-07-02  7:53     ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2021-07-02  7:53 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	Andrew Morton

On 02.07.21 00:11, David Matlack wrote:
> On Thu, Jul 01, 2021 at 07:00:51PM +0200, David Hildenbrand wrote:
>> On 30.06.21 23:47, David Matlack wrote:
>>> This patch series adds support for the TDP MMU in the fast_page_fault
>>> path, which enables certain write-protection and access tracking faults
>>> to be handled without taking the KVM MMU lock. This series brings the
>>> performance of these faults up to par with the legacy MMU.
>>>
>>> Since there is not currently any KVM test coverage for access tracking
>>> faults, this series introduces a new KVM selftest,
>>> access_tracking_perf_test. Note that this test relies on page_idle to
>>> enable access tracking from userspace (since it is the only available
>>> usersapce API to do so) and page_idle is being considered for removal
>>> from Linux
>>> (https://lore.kernel.org/linux-mm/20210612000714.775825-1-willy@infradead.org/).
>>
>> Well, at least a new selftest that implicitly tests a part of page_idle --
>> nice :)
>>
>> Haven't looked into the details, but if you can live with page tables
>> starting unpopulated and only monitoring what gets populated on r/w access,
>> you might be able to achieve something similar using /proc/self/pagemap and
>> softdirty handling.
>>
>> Unpopulated page (e.g., via MADV_DISCARD) -> trigger read or write access ->
>> sense if page populated in pagemap
>> Populated page-> clear all softdirty bits -> trigger write access -> sense
>> if page is softdirty in pagemap
> 
> Thanks for the suggestion. I modified by test to write 4 to
> /proc/self/clear_refs rather than marking pages in page_idle. However,
> by doing so I was no longer able to exercise KVM's fast_page_fault
> handler [1].
> 
> It looks like the reason why is that clear_refs issues the
> invalidate_range mmu notifiers, which will cause KVM to fully refault
> the page from the host MM upon subsequent guest memory accesses.  In
> contrast, page_idle uses clear_young which KVM can handle with
> fast_page_fault.

Right, the only thing you could provoke may be (again, did not look into 
the details):

1. 4 > /proc/self/clear_refs to clear all softdirty bits
2. Trigger a read fault. This will populate the shared zeropage on 
private anonymous memory and populate an actual page on e.g., shmem.
3. Trigger a write fault. This should result in a COW fault on private 
anonymous memory and (IIRC) a WP-fault on shmem.

The COW on private anonymous memory would invalidate the secondary MMU 
again via MMU notifiers I guess, so it's not what we want. It *might* 
work on shmem, I might be wrong, though.

But again, I haven't looked into the details/checked the code and we 
might actually not be able to test with softdirty at all. IMHO, page 
idle tracking might be good enough for a test.

Consider it rather a brain dump than a suggestion ;)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-30 21:48 ` [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for " David Matlack
  2021-07-01  2:54     ` kernel test robot
  2021-07-01  4:27     ` kernel test robot
@ 2021-07-09 18:45   ` David Matlack
  2021-07-12 17:49   ` Ben Gardon
  3 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-07-09 18:45 UTC (permalink / raw)
  To: kvm
  Cc: Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
	Junaid Shahid, Andrew Jones, Matthew Wilcox, Yu Zhao,
	David Hildenbrand, Andrew Morton

On Wed, Jun 30, 2021 at 09:48:00PM +0000, David Matlack wrote:
> Make fast_page_fault interoperate with the TDP MMU by leveraging
> walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and
> introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to
> grab the lowest level sptep.
> 
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 55 +++++++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h |  2 ++
>  3 files changed, 78 insertions(+), 15 deletions(-)
> 

...

> +/*
> + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}.
> + *
> + * The returned sptep must not be used after
> + * kvm_tdp_mmu_walk_shadow_page_lockless_end.
> + */

The function names in the comment are spelled wrong and should be:

   /*
    * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
    *
    * The returned sptep must not be used after kvm_tdp_mmu_walk_lockless_end.
    */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  2021-06-30 21:47 ` [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault David Matlack
@ 2021-07-12 16:14   ` Ben Gardon
  2021-07-12 18:11     ` David Matlack
  0 siblings, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2021-07-12 16:14 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
>
> Enum values have to be exported to userspace since the formatting is not
> done in the kernel. Without doing this perf maps RET_PF_FIXED and
> RET_PF_SPURIOUS to 0, which results in incorrect output:
>
>   $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
>   $ perf script | head -1
>    [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1
>
> Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.
>
> Fixes: c4371c2a682e ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmutrace.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index efbad33a0645..55c7e0fcda52 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -244,6 +244,9 @@ TRACE_EVENT(
>                   __entry->access)
>  );
>
> +TRACE_DEFINE_ENUM(RET_PF_FIXED);
> +TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> +

If you're planning to send out a v3 anyway, it might be worth adding
all the PF return code enums:

enum {
RET_PF_RETRY = 0,
RET_PF_EMULATE,
RET_PF_INVALID,
RET_PF_FIXED,
RET_PF_SPURIOUS,
};

Just so that no one has to worry about this in the future.

>  TRACE_EVENT(
>         fast_page_fault,
>         TP_PROTO(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 error_code,
> --
> 2.32.0.93.g670b81a890-goog
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU
  2021-06-30 21:47 ` [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU David Matlack
@ 2021-07-12 17:02   ` Ben Gardon
  2021-07-12 18:11     ` David Matlack
  2021-07-12 20:23   ` Sean Christopherson
  1 sibling, 1 reply; 33+ messages in thread
From: Ben Gardon @ 2021-07-12 17:02 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
>
> Acquire the RCU read lock in walk_shadow_page_lockless_begin and release
> it in walk_shadow_page_lockless_end when the TDP MMU is enabled.  This
> should not introduce any functional changes but is used in the following
> commit to make fast_page_fault interoperate with the TDP MMU.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

This as I understand this, we're just lifting the rcu_lock/unlock out
of kvm_tdp_mmu_get_walk, and then putting all the TDP MMU specific
stuff down a level under walk_shadow_page_lockless_begin/end and
get_walk.

Instead of moving kvm_tdp_mmu_get_walk_lockless into get_walk, it
could also be open-coded as:

walk_shadow_page_lockless_begin
 if (is_tdp_mmu(vcpu->arch.mmu))
               leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
 else
               leaf = get_walk(vcpu, addr, sptes, &root);
walk_shadow_page_lockless_end

in get_mmio_spte, since get_walk isn't used anywhere else. Then
walk_shadow_page_lockless_begin/end could also be moved up out of
get_walk instead of having to add a goto to that function.
I don't have a strong preference either way, but the above feels like
a slightly simpler refactor.


> ---
>  arch/x86/kvm/mmu/mmu.c     | 24 ++++++++++++++++++------
>  arch/x86/kvm/mmu/tdp_mmu.c | 20 ++++++++++++++------
>  arch/x86/kvm/mmu/tdp_mmu.h |  6 ++++--
>  3 files changed, 36 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 45274436d3c0..88c71a8a55f1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -686,6 +686,11 @@ static bool mmu_spte_age(u64 *sptep)
>
>  static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
>  {
> +       if (is_tdp_mmu(vcpu->arch.mmu)) {
> +               kvm_tdp_mmu_walk_lockless_begin();
> +               return;
> +       }
> +
>         /*
>          * Prevent page table teardown by making any free-er wait during
>          * kvm_flush_remote_tlbs() IPI to all active vcpus.
> @@ -701,6 +706,11 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
>
>  static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  {
> +       if (is_tdp_mmu(vcpu->arch.mmu)) {
> +               kvm_tdp_mmu_walk_lockless_end();
> +               return;
> +       }
> +
>         /*
>          * Make sure the write to vcpu->mode is not reordered in front of
>          * reads to sptes.  If it does, kvm_mmu_commit_zap_page() can see us
> @@ -3621,6 +3631,12 @@ static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level
>
>         walk_shadow_page_lockless_begin(vcpu);
>
> +       if (is_tdp_mmu(vcpu->arch.mmu)) {
> +               leaf = kvm_tdp_mmu_get_walk_lockless(vcpu, addr, sptes,
> +                                                    root_level);
> +               goto out;
> +       }
> +
>         for (shadow_walk_init(&iterator, vcpu, addr),
>              *root_level = iterator.level;
>              shadow_walk_okay(&iterator);
> @@ -3634,8 +3650,8 @@ static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level
>                         break;
>         }
>
> +out:
>         walk_shadow_page_lockless_end(vcpu);
> -
>         return leaf;
>  }
>
> @@ -3647,11 +3663,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>         int root, leaf, level;
>         bool reserved = false;
>
> -       if (is_tdp_mmu(vcpu->arch.mmu))
> -               leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
> -       else
> -               leaf = get_walk(vcpu, addr, sptes, &root);
> -
> +       leaf = get_walk(vcpu, addr, sptes, &root);
>         if (unlikely(leaf < 0)) {
>                 *sptep = 0ull;
>                 return reserved;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index caac4ddb46df..c6fa8d00bf9f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1513,12 +1513,24 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>         return spte_set;
>  }
>
> +void kvm_tdp_mmu_walk_lockless_begin(void)
> +{
> +       rcu_read_lock();
> +}
> +
> +void kvm_tdp_mmu_walk_lockless_end(void)
> +{
> +       rcu_read_unlock();
> +}
> +
>  /*
>   * Return the level of the lowest level SPTE added to sptes.
>   * That SPTE may be non-present.
> + *
> + * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
>   */
> -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> -                        int *root_level)
> +int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> +                                 int *root_level)
>  {
>         struct tdp_iter iter;
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
> @@ -1527,14 +1539,10 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>
>         *root_level = vcpu->arch.mmu->shadow_root_level;
>
> -       rcu_read_lock();
> -
>         tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
>                 leaf = iter.level;
>                 sptes[leaf] = iter.old_spte;
>         }
>
> -       rcu_read_unlock();
> -
>         return leaf;
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 1cae4485b3bc..e9dde5f9c0ef 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -77,8 +77,10 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot, gfn_t gfn,
>                                    int min_level);
>
> -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> -                        int *root_level);
> +void kvm_tdp_mmu_walk_lockless_begin(void);
> +void kvm_tdp_mmu_walk_lockless_end(void);
> +int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> +                                 int *root_level);
>
>  #ifdef CONFIG_X86_64
>  bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> --
> 2.32.0.93.g670b81a890-goog
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-06-30 21:48 ` [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for " David Matlack
                     ` (2 preceding siblings ...)
  2021-07-09 18:45   ` David Matlack
@ 2021-07-12 17:49   ` Ben Gardon
  2021-07-12 18:20     ` David Matlack
  2021-07-12 21:03     ` Sean Christopherson
  3 siblings, 2 replies; 33+ messages in thread
From: Ben Gardon @ 2021-07-12 17:49 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
>
> Make fast_page_fault interoperate with the TDP MMU by leveraging
> walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and
> introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to
> grab the lowest level sptep.
>
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 55 +++++++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h |  2 ++
>  3 files changed, 78 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 88c71a8a55f1..1d410278a4cc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3105,15 +3105,45 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
>         return spte & PT_PRESENT_MASK;
>  }
>
> +/*
> + * Returns the last level spte pointer of the shadow page walk for the given
> + * gpa, and sets *spte to the spte value. This spte may be non-preset.
> + *
> + * If no walk could be performed, returns NULL and *spte does not contain valid
> + * data.
> + *
> + * Constraints:
> + *  - Must be called between walk_shadow_page_lockless_{begin,end}.
> + *  - The returned sptep must not be used after walk_shadow_page_lockless_end.
> + */
> +u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
> +{
> +       struct kvm_shadow_walk_iterator iterator;
> +       u64 old_spte;
> +       u64 *sptep = NULL;
> +
> +       if (is_tdp_mmu(vcpu->arch.mmu))
> +               return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte);
> +
> +       for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
> +               sptep = iterator.sptep;
> +               *spte = old_spte;
> +
> +               if (!is_shadow_present_pte(old_spte))
> +                       break;
> +       }
> +
> +       return sptep;
> +}
> +
>  /*
>   * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
>   */
>  static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>  {
> -       struct kvm_shadow_walk_iterator iterator;
> -       struct kvm_mmu_page *sp;
>         int ret = RET_PF_INVALID;
>         u64 spte = 0ull;
> +       u64 *sptep = NULL;
>         uint retry_count = 0;
>
>         if (!page_fault_can_be_fast(error_code))
> @@ -3122,16 +3152,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>         walk_shadow_page_lockless_begin(vcpu);
>
>         do {
> +               struct kvm_mmu_page *sp;
>                 u64 new_spte;
>
> -               for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
> -                       if (!is_shadow_present_pte(spte))
> -                               break;
> -
> +               sptep = get_last_sptep_lockless(vcpu, gpa, &spte);
>                 if (!is_shadow_present_pte(spte))
>                         break;
>
> -               sp = sptep_to_sp(iterator.sptep);
> +               sp = sptep_to_sp(sptep);
>                 if (!is_last_spte(spte, sp->role.level))
>                         break;
>
> @@ -3189,8 +3217,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>                  * since the gfn is not stable for indirect shadow page. See
>                  * Documentation/virt/kvm/locking.rst to get more detail.
>                  */
> -               if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte,
> -                                           new_spte)) {
> +               if (fast_pf_fix_direct_spte(vcpu, sp, sptep, spte, new_spte)) {
>                         ret = RET_PF_FIXED;
>                         break;
>                 }
> @@ -3203,7 +3230,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
>
>         } while (true);
>
> -       trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
> +       trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret);
>         walk_shadow_page_lockless_end(vcpu);
>
>         return ret;
> @@ -3838,11 +3865,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         if (page_fault_handle_page_track(vcpu, error_code, gfn))
>                 return RET_PF_EMULATE;
>
> -       if (!is_tdp_mmu_fault) {
> -               r = fast_page_fault(vcpu, gpa, error_code);
> -               if (r != RET_PF_INVALID)
> -                       return r;
> -       }
> +       r = fast_page_fault(vcpu, gpa, error_code);
> +       if (r != RET_PF_INVALID)
> +               return r;
>
>         r = mmu_topup_memory_caches(vcpu, false);
>         if (r)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c6fa8d00bf9f..2c9e0ed71fa0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
>         if (is_removed_spte(iter->old_spte))
>                 return false;
>
> +       /*
> +        * TDP MMU sptes can also be concurrently cmpxchg'd in
> +        * fast_pf_fix_direct_spte as part of fast_page_fault.
> +        */
>         if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
>                       new_spte) != iter->old_spte)
>                 return false;

I'm a little nervous about not going through the handle_changed_spte
flow for the TDP MMU, but as things are now, I think it's safe.

> @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>
>         return leaf;
>  }
> +
> +/*
> + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}.
> + *
> + * The returned sptep must not be used after
> + * kvm_tdp_mmu_walk_shadow_page_lockless_end.
> + */
> +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> +                                        u64 *spte)
> +{
> +       struct tdp_iter iter;
> +       struct kvm_mmu *mmu = vcpu->arch.mmu;
> +       gfn_t gfn = addr >> PAGE_SHIFT;
> +       tdp_ptep_t sptep = NULL;
> +
> +       tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> +               *spte = iter.old_spte;
> +               sptep = iter.sptep;
> +       }
> +
> +       if (sptep)
> +               /*
> +                * Perform the rcu dereference here since we are passing the
> +                * sptep up to the generic MMU code which does not know the
> +                * synchronization details of the TDP MMU. This is safe as long
> +                * as the caller obeys the contract that the sptep is not used
> +                * after kvm_tdp_mmu_walk_shadow_page_lockless_end.
> +                */

There's a little more to this contract:
1. The caller should only modify the SPTE using an atomic cmpxchg with
the returned spte value.
2. The caller should not modify the mapped PFN or present <-> not
present state of the SPTE.
3. There are other bits the caller can't modify too. (lpage, mt, etc.)

If the comments on this function don't document all the constraints on
how the returned sptep can be used, it might be safer to specify that
this is only meant to be used as part of the fast page fault handler.

> +               return rcu_dereference(sptep);
> +
> +       return NULL;
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index e9dde5f9c0ef..508a23bdf7da 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void);
>  void kvm_tdp_mmu_walk_lockless_end(void);
>  int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>                                   int *root_level);
> +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> +                                        u64 *spte);
>
>  #ifdef CONFIG_X86_64
>  bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> --
> 2.32.0.93.g670b81a890-goog
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU
  2021-07-12 17:02   ` Ben Gardon
@ 2021-07-12 18:11     ` David Matlack
  2021-07-12 20:20       ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: David Matlack @ 2021-07-12 18:11 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021 at 10:02:31AM -0700, Ben Gardon wrote:
> On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Acquire the RCU read lock in walk_shadow_page_lockless_begin and release
> > it in walk_shadow_page_lockless_end when the TDP MMU is enabled.  This
> > should not introduce any functional changes but is used in the following
> > commit to make fast_page_fault interoperate with the TDP MMU.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>
> 
> This as I understand this, we're just lifting the rcu_lock/unlock out
> of kvm_tdp_mmu_get_walk, and then putting all the TDP MMU specific
> stuff down a level under walk_shadow_page_lockless_begin/end and
> get_walk.
> 
> Instead of moving kvm_tdp_mmu_get_walk_lockless into get_walk, it
> could also be open-coded as:
> 
> walk_shadow_page_lockless_begin
>  if (is_tdp_mmu(vcpu->arch.mmu))
>                leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
>  else
>                leaf = get_walk(vcpu, addr, sptes, &root);
> walk_shadow_page_lockless_end
> 
> in get_mmio_spte, since get_walk isn't used anywhere else. Then
> walk_shadow_page_lockless_begin/end could also be moved up out of
> get_walk instead of having to add a goto to that function.
> I don't have a strong preference either way, but the above feels like
> a slightly simpler refactor.

I don't have a strong preference either way as well. I'd be happy to
switch to your suggestion in v3.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  2021-07-12 16:14   ` Ben Gardon
@ 2021-07-12 18:11     ` David Matlack
  2021-07-12 19:53       ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: David Matlack @ 2021-07-12 18:11 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021 at 09:14:01AM -0700, Ben Gardon wrote:
> On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Enum values have to be exported to userspace since the formatting is not
> > done in the kernel. Without doing this perf maps RET_PF_FIXED and
> > RET_PF_SPURIOUS to 0, which results in incorrect output:
> >
> >   $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
> >   $ perf script | head -1
> >    [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1
> >
> > Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.
> >
> > Fixes: c4371c2a682e ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmutrace.h | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index efbad33a0645..55c7e0fcda52 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -244,6 +244,9 @@ TRACE_EVENT(
> >                   __entry->access)
> >  );
> >
> > +TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > +TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > +
> 
> If you're planning to send out a v3 anyway, it might be worth adding
> all the PF return code enums:
> 
> enum {
> RET_PF_RETRY = 0,
> RET_PF_EMULATE,
> RET_PF_INVALID,
> RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
> 
> Just so that no one has to worry about this in the future.

Will do in v3. Thanks.

> 
> >  TRACE_EVENT(
> >         fast_page_fault,
> >         TP_PROTO(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 error_code,
> > --
> > 2.32.0.93.g670b81a890-goog
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-07-12 17:49   ` Ben Gardon
@ 2021-07-12 18:20     ` David Matlack
  2021-07-12 21:03     ` Sean Christopherson
  1 sibling, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-07-12 18:20 UTC (permalink / raw)
  To: Ben Gardon
  Cc: kvm, Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Sean Christopherson, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021 at 10:49:55AM -0700, Ben Gardon wrote:
> On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Make fast_page_fault interoperate with the TDP MMU by leveraging
> > walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and
> > introducing a new helper function kvm_tdp_mmu_get_last_sptep_lockless to
> > grab the lowest level sptep.
> >
> > Suggested-by: Ben Gardon <bgardon@google.com>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     | 55 +++++++++++++++++++++++++++-----------
> >  arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++++
> >  arch/x86/kvm/mmu/tdp_mmu.h |  2 ++
> >  3 files changed, 78 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 88c71a8a55f1..1d410278a4cc 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3105,15 +3105,45 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
> >         return spte & PT_PRESENT_MASK;
> >  }
> >
> > +/*
> > + * Returns the last level spte pointer of the shadow page walk for the given
> > + * gpa, and sets *spte to the spte value. This spte may be non-preset.
> > + *
> > + * If no walk could be performed, returns NULL and *spte does not contain valid
> > + * data.
> > + *
> > + * Constraints:
> > + *  - Must be called between walk_shadow_page_lockless_{begin,end}.
> > + *  - The returned sptep must not be used after walk_shadow_page_lockless_end.
> > + */
> > +u64 *get_last_sptep_lockless(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
> > +{
> > +       struct kvm_shadow_walk_iterator iterator;
> > +       u64 old_spte;
> > +       u64 *sptep = NULL;
> > +
> > +       if (is_tdp_mmu(vcpu->arch.mmu))
> > +               return kvm_tdp_mmu_get_last_sptep_lockless(vcpu, gpa, spte);
> > +
> > +       for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) {
> > +               sptep = iterator.sptep;
> > +               *spte = old_spte;
> > +
> > +               if (!is_shadow_present_pte(old_spte))
> > +                       break;
> > +       }
> > +
> > +       return sptep;
> > +}
> > +
> >  /*
> >   * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
> >   */
> >  static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >  {
> > -       struct kvm_shadow_walk_iterator iterator;
> > -       struct kvm_mmu_page *sp;
> >         int ret = RET_PF_INVALID;
> >         u64 spte = 0ull;
> > +       u64 *sptep = NULL;
> >         uint retry_count = 0;
> >
> >         if (!page_fault_can_be_fast(error_code))
> > @@ -3122,16 +3152,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >         walk_shadow_page_lockless_begin(vcpu);
> >
> >         do {
> > +               struct kvm_mmu_page *sp;
> >                 u64 new_spte;
> >
> > -               for_each_shadow_entry_lockless(vcpu, gpa, iterator, spte)
> > -                       if (!is_shadow_present_pte(spte))
> > -                               break;
> > -
> > +               sptep = get_last_sptep_lockless(vcpu, gpa, &spte);
> >                 if (!is_shadow_present_pte(spte))
> >                         break;
> >
> > -               sp = sptep_to_sp(iterator.sptep);
> > +               sp = sptep_to_sp(sptep);
> >                 if (!is_last_spte(spte, sp->role.level))
> >                         break;
> >
> > @@ -3189,8 +3217,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >                  * since the gfn is not stable for indirect shadow page. See
> >                  * Documentation/virt/kvm/locking.rst to get more detail.
> >                  */
> > -               if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte,
> > -                                           new_spte)) {
> > +               if (fast_pf_fix_direct_spte(vcpu, sp, sptep, spte, new_spte)) {
> >                         ret = RET_PF_FIXED;
> >                         break;
> >                 }
> > @@ -3203,7 +3230,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code)
> >
> >         } while (true);
> >
> > -       trace_fast_page_fault(vcpu, gpa, error_code, iterator.sptep, spte, ret);
> > +       trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret);
> >         walk_shadow_page_lockless_end(vcpu);
> >
> >         return ret;
> > @@ -3838,11 +3865,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >         if (page_fault_handle_page_track(vcpu, error_code, gfn))
> >                 return RET_PF_EMULATE;
> >
> > -       if (!is_tdp_mmu_fault) {
> > -               r = fast_page_fault(vcpu, gpa, error_code);
> > -               if (r != RET_PF_INVALID)
> > -                       return r;
> > -       }
> > +       r = fast_page_fault(vcpu, gpa, error_code);
> > +       if (r != RET_PF_INVALID)
> > +               return r;
> >
> >         r = mmu_topup_memory_caches(vcpu, false);
> >         if (r)
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index c6fa8d00bf9f..2c9e0ed71fa0 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
> >         if (is_removed_spte(iter->old_spte))
> >                 return false;
> >
> > +       /*
> > +        * TDP MMU sptes can also be concurrently cmpxchg'd in
> > +        * fast_pf_fix_direct_spte as part of fast_page_fault.
> > +        */
> >         if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> >                       new_spte) != iter->old_spte)
> >                 return false;
> 
> I'm a little nervous about not going through the handle_changed_spte
> flow for the TDP MMU, but as things are now, I think it's safe.
> 
> > @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >
> >         return leaf;
> >  }
> > +
> > +/*
> > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}.
> > + *
> > + * The returned sptep must not be used after
> > + * kvm_tdp_mmu_walk_shadow_page_lockless_end.
> > + */
> > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > +                                        u64 *spte)
> > +{
> > +       struct tdp_iter iter;
> > +       struct kvm_mmu *mmu = vcpu->arch.mmu;
> > +       gfn_t gfn = addr >> PAGE_SHIFT;
> > +       tdp_ptep_t sptep = NULL;
> > +
> > +       tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> > +               *spte = iter.old_spte;
> > +               sptep = iter.sptep;
> > +       }
> > +
> > +       if (sptep)
> > +               /*
> > +                * Perform the rcu dereference here since we are passing the
> > +                * sptep up to the generic MMU code which does not know the
> > +                * synchronization details of the TDP MMU. This is safe as long
> > +                * as the caller obeys the contract that the sptep is not used
> > +                * after kvm_tdp_mmu_walk_shadow_page_lockless_end.
> > +                */
> 
> There's a little more to this contract:
> 1. The caller should only modify the SPTE using an atomic cmpxchg with
> the returned spte value.
> 2. The caller should not modify the mapped PFN or present <-> not
> present state of the SPTE.
> 3. There are other bits the caller can't modify too. (lpage, mt, etc.)
> 
> If the comments on this function don't document all the constraints on
> how the returned sptep can be used, it might be safer to specify that
> this is only meant to be used as part of the fast page fault handler.

I think documenting that this is only be meant to used as part of the
fast page fault handler is a simpler and less brittle approach. I can
also change the function names so there is no ambiguity that it is meant
for fast page fault handling. For example:
kvm_tdp_mmu_fast_pf_get_last_sptep().

> 
> > +               return rcu_dereference(sptep);
> > +
> > +       return NULL;
> > +}
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index e9dde5f9c0ef..508a23bdf7da 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void);
> >  void kvm_tdp_mmu_walk_lockless_end(void);
> >  int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >                                   int *root_level);
> > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > +                                        u64 *spte);
> >
> >  #ifdef CONFIG_X86_64
> >  bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> > --
> > 2.32.0.93.g670b81a890-goog
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  2021-07-12 18:11     ` David Matlack
@ 2021-07-12 19:53       ` Sean Christopherson
  2021-07-12 20:38         ` David Matlack
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2021-07-12 19:53 UTC (permalink / raw)
  To: David Matlack
  Cc: Ben Gardon, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021, David Matlack wrote:
> On Mon, Jul 12, 2021 at 09:14:01AM -0700, Ben Gardon wrote:
> > On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > Enum values have to be exported to userspace since the formatting is not
> > > done in the kernel. Without doing this perf maps RET_PF_FIXED and
> > > RET_PF_SPURIOUS to 0, which results in incorrect output:

Oof, that's brutal.

> > >   $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
> > >   $ perf script | head -1
> > >    [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1
> > >
> > > Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.
> > >
> > > Fixes: c4371c2a682e ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmutrace.h | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > > index efbad33a0645..55c7e0fcda52 100644
> > > --- a/arch/x86/kvm/mmu/mmutrace.h
> > > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > > @@ -244,6 +244,9 @@ TRACE_EVENT(
> > >                   __entry->access)
> > >  );
> > >
> > > +TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > > +TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > > +
> > 
> > If you're planning to send out a v3 anyway, it might be worth adding
> > all the PF return code enums:
> > 
> > enum {
> > RET_PF_RETRY = 0,
> > RET_PF_EMULATE,
> > RET_PF_INVALID,
> > RET_PF_FIXED,
> > RET_PF_SPURIOUS,
> > };
> > 
> > Just so that no one has to worry about this in the future.

Until someone adds a new enum :-/

> Will do in v3. Thanks.

What about converting the enums to #defines, with a blurb in the comment
explaining that the values are arbitrary but aren't enums purely to avoid this
tracepoint issue?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault
  2021-06-30 21:47 ` [PATCH v2 1/6] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
@ 2021-07-12 20:01   ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2021-07-12 20:01 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Wed, Jun 30, 2021, David Matlack wrote:
> fast_page_fault is only called from direct_page_fault where we know the
> address is a gpa.
> 
> Fixes: 736c291c9f36 ("KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM")
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU
  2021-07-12 18:11     ` David Matlack
@ 2021-07-12 20:20       ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2021-07-12 20:20 UTC (permalink / raw)
  To: David Matlack
  Cc: Ben Gardon, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021, David Matlack wrote:
> On Mon, Jul 12, 2021 at 10:02:31AM -0700, Ben Gardon wrote:
> > On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > Acquire the RCU read lock in walk_shadow_page_lockless_begin and release
> > > it in walk_shadow_page_lockless_end when the TDP MMU is enabled.  This
> > > should not introduce any functional changes but is used in the following
> > > commit to make fast_page_fault interoperate with the TDP MMU.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > 
> > Reviewed-by: Ben Gardon <bgardon@google.com>
> > 
> > This as I understand this, we're just lifting the rcu_lock/unlock out
> > of kvm_tdp_mmu_get_walk, and then putting all the TDP MMU specific
> > stuff down a level under walk_shadow_page_lockless_begin/end and
> > get_walk.
> > 
> > Instead of moving kvm_tdp_mmu_get_walk_lockless into get_walk, it
> > could also be open-coded as:
> > 
> > walk_shadow_page_lockless_begin
> >  if (is_tdp_mmu(vcpu->arch.mmu))
> >                leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root);
> >  else
> >                leaf = get_walk(vcpu, addr, sptes, &root);
> > walk_shadow_page_lockless_end
> > 
> > in get_mmio_spte, since get_walk isn't used anywhere else. Then
> > walk_shadow_page_lockless_begin/end could also be moved up out of
> > get_walk instead of having to add a goto to that function.
> > I don't have a strong preference either way, but the above feels like
> > a slightly simpler refactor.
> 
> I don't have a strong preference either way as well. I'd be happy to
> switch to your suggestion in v3.

I vote for Ben's suggestion.  As is, I think it would be too easy to overlook the
TDP MMU path in get_walk() and focus only on the for-loop.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU
  2021-06-30 21:47 ` [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU David Matlack
  2021-07-12 17:02   ` Ben Gardon
@ 2021-07-12 20:23   ` Sean Christopherson
  1 sibling, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2021-07-12 20:23 UTC (permalink / raw)
  To: David Matlack
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Wed, Jun 30, 2021, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index caac4ddb46df..c6fa8d00bf9f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1513,12 +1513,24 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>  	return spte_set;
>  }
>  
> +void kvm_tdp_mmu_walk_lockless_begin(void)
> +{
> +	rcu_read_lock();
> +}
> +
> +void kvm_tdp_mmu_walk_lockless_end(void)
> +{
> +	rcu_read_unlock();
> +}

I vote to make these static inlines.  They're nops in a non-debug build, and it's
not like it's a secret that the TDP MMU relies on RCU to protect its page tables.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault
  2021-07-12 19:53       ` Sean Christopherson
@ 2021-07-12 20:38         ` David Matlack
  0 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-07-12 20:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021 at 07:53:55PM +0000, Sean Christopherson wrote:
> On Mon, Jul 12, 2021, David Matlack wrote:
> > On Mon, Jul 12, 2021 at 09:14:01AM -0700, Ben Gardon wrote:
> > > On Wed, Jun 30, 2021 at 2:48 PM David Matlack <dmatlack@google.com> wrote:
> > > >
> > > > Enum values have to be exported to userspace since the formatting is not
> > > > done in the kernel. Without doing this perf maps RET_PF_FIXED and
> > > > RET_PF_SPURIOUS to 0, which results in incorrect output:
> 
> Oof, that's brutal.
> 
> > > >   $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
> > > >   $ perf script | head -1
> > > >    [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1
> > > >
> > > > Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.
> > > >
> > > > Fixes: c4371c2a682e ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmutrace.h | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > > > index efbad33a0645..55c7e0fcda52 100644
> > > > --- a/arch/x86/kvm/mmu/mmutrace.h
> > > > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > > > @@ -244,6 +244,9 @@ TRACE_EVENT(
> > > >                   __entry->access)
> > > >  );
> > > >
> > > > +TRACE_DEFINE_ENUM(RET_PF_FIXED);
> > > > +TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> > > > +
> > > 
> > > If you're planning to send out a v3 anyway, it might be worth adding
> > > all the PF return code enums:
> > > 
> > > enum {
> > > RET_PF_RETRY = 0,
> > > RET_PF_EMULATE,
> > > RET_PF_INVALID,
> > > RET_PF_FIXED,
> > > RET_PF_SPURIOUS,
> > > };
> > > 
> > > Just so that no one has to worry about this in the future.
> 
> Until someone adds a new enum :-/
> 
> > Will do in v3. Thanks.
> 
> What about converting the enums to #defines, with a blurb in the comment
> explaining that the values are arbitrary but aren't enums purely to avoid this
> tracepoint issue?

That will make it possible for someone to accidentally introduce a new
RET_PF symbol with a duplicate value which will result in incorrect
behavior. I am leaning towards keeping it as an enum but adding a
comment that any new enums should be reexported in mmutrace.h.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-07-12 17:49   ` Ben Gardon
  2021-07-12 18:20     ` David Matlack
@ 2021-07-12 21:03     ` Sean Christopherson
  2021-07-12 21:24       ` David Matlack
  1 sibling, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2021-07-12 21:03 UTC (permalink / raw)
  To: Ben Gardon
  Cc: David Matlack, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021, Ben Gardon wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index c6fa8d00bf9f..2c9e0ed71fa0 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
> >         if (is_removed_spte(iter->old_spte))
> >                 return false;
> >
> > +       /*
> > +        * TDP MMU sptes can also be concurrently cmpxchg'd in
> > +        * fast_pf_fix_direct_spte as part of fast_page_fault.
> > +        */

The cmpxchg64 part isn't what's interesting, it's just the means to the end.
Maybe reword slightly to focus on modifying SPTEs without holding mmu_lock, e.g.

	/*
	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs outside
	 * of mmu_lock.
	 */

> >         if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> >                       new_spte) != iter->old_spte)
> >                 return false;
> 
> I'm a little nervous about not going through the handle_changed_spte
> flow for the TDP MMU, but as things are now, I think it's safe.

Ya, it would be nice to flow through the TDP MMU proper as we could also "restore"
__rcu.  That said, the fast #PF fix flow is unique and specific enough that I don't
think it's worth going out of our way to force the issue.

> > @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >
> >         return leaf;
> >  }
> > +
> > +/*
> > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}.
> > + *
> > + * The returned sptep must not be used after
> > + * kvm_tdp_mmu_walk_shadow_page_lockless_end.
> > + */
> > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > +                                        u64 *spte)
> > +{
> > +       struct tdp_iter iter;
> > +       struct kvm_mmu *mmu = vcpu->arch.mmu;
> > +       gfn_t gfn = addr >> PAGE_SHIFT;
> > +       tdp_ptep_t sptep = NULL;
> > +
> > +       tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> > +               *spte = iter.old_spte;
> > +               sptep = iter.sptep;
> > +       }
> > +
> > +       if (sptep)

This check is unnecessary, even when using rcu_dereference.

> > +               /*
> > +                * Perform the rcu dereference here since we are passing the
> > +                * sptep up to the generic MMU code which does not know the
> > +                * synchronization details of the TDP MMU. This is safe as long
> > +                * as the caller obeys the contract that the sptep is not used
> > +                * after kvm_tdp_mmu_walk_shadow_page_lockless_end.
> > +                */
> 
> There's a little more to this contract:
> 1. The caller should only modify the SPTE using an atomic cmpxchg with
> the returned spte value.
> 2. The caller should not modify the mapped PFN or present <-> not
> present state of the SPTE.
> 3. There are other bits the caller can't modify too. (lpage, mt, etc.)
> 
> If the comments on this function don't document all the constraints on
> how the returned sptep can be used, it might be safer to specify that
> this is only meant to be used as part of the fast page fault handler.

Or maybe a less specific, but more scary comment?

> 
> > +               return rcu_dereference(sptep);

I still vote to use "(__force u64 *)" instead of rcu_dereference() to make it
clear we're cheating in order to share code with the legacy MMU.

	/*
	 * Squash the __rcu annotation, the legacy MMU doesn't rely on RCU to
	 * protect its page tables and so the common MMU code doesn't preserve
	 * the annotation.
	 *
	 * It goes without saying, but the caller must honor all TDP MMU
	 * contracts for accessing/modifying SPTEs outside of mmu_lock.
	 */
	return (__force u64 *)sptep;
	
> > +       return NULL;
> > +}
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index e9dde5f9c0ef..508a23bdf7da 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void);
> >  void kvm_tdp_mmu_walk_lockless_end(void);
> >  int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >                                   int *root_level);
> > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > +                                        u64 *spte);
> >
> >  #ifdef CONFIG_X86_64
> >  bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> > --
> > 2.32.0.93.g670b81a890-goog
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for the TDP MMU
  2021-07-12 21:03     ` Sean Christopherson
@ 2021-07-12 21:24       ` David Matlack
  0 siblings, 0 replies; 33+ messages in thread
From: David Matlack @ 2021-07-12 21:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, kvm, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Paolo Bonzini, Junaid Shahid, Andrew Jones,
	Matthew Wilcox, Yu Zhao, David Hildenbrand, Andrew Morton

On Mon, Jul 12, 2021 at 09:03:11PM +0000, Sean Christopherson wrote:
> On Mon, Jul 12, 2021, Ben Gardon wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index c6fa8d00bf9f..2c9e0ed71fa0 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -527,6 +527,10 @@ static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
> > >         if (is_removed_spte(iter->old_spte))
> > >                 return false;
> > >
> > > +       /*
> > > +        * TDP MMU sptes can also be concurrently cmpxchg'd in
> > > +        * fast_pf_fix_direct_spte as part of fast_page_fault.
> > > +        */
> 
> The cmpxchg64 part isn't what's interesting, it's just the means to the end.
> Maybe reword slightly to focus on modifying SPTEs without holding mmu_lock, e.g.
> 
> 	/*
> 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs outside
> 	 * of mmu_lock.
> 	 */

Good point about cmpxchg. I'll use your comment in v3.

> 
> > >         if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> > >                       new_spte) != iter->old_spte)
> > >                 return false;
> > 
> > I'm a little nervous about not going through the handle_changed_spte
> > flow for the TDP MMU, but as things are now, I think it's safe.
> 
> Ya, it would be nice to flow through the TDP MMU proper as we could also "restore"
> __rcu.  That said, the fast #PF fix flow is unique and specific enough that I don't
> think it's worth going out of our way to force the issue.
> 
> > > @@ -1546,3 +1550,35 @@ int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> > >
> > >         return leaf;
> > >  }
> > > +
> > > +/*
> > > + * Must be called between kvm_tdp_mmu_walk_shadow_page_lockless_{begin,end}.
> > > + *
> > > + * The returned sptep must not be used after
> > > + * kvm_tdp_mmu_walk_shadow_page_lockless_end.
> > > + */
> > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > > +                                        u64 *spte)
> > > +{
> > > +       struct tdp_iter iter;
> > > +       struct kvm_mmu *mmu = vcpu->arch.mmu;
> > > +       gfn_t gfn = addr >> PAGE_SHIFT;
> > > +       tdp_ptep_t sptep = NULL;
> > > +
> > > +       tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> > > +               *spte = iter.old_spte;
> > > +               sptep = iter.sptep;
> > > +       }
> > > +
> > > +       if (sptep)
> 
> This check is unnecessary, even when using rcu_dereference.

Ack. Will fix.

> 
> > > +               /*
> > > +                * Perform the rcu dereference here since we are passing the
> > > +                * sptep up to the generic MMU code which does not know the
> > > +                * synchronization details of the TDP MMU. This is safe as long
> > > +                * as the caller obeys the contract that the sptep is not used
> > > +                * after kvm_tdp_mmu_walk_shadow_page_lockless_end.
> > > +                */
> > 
> > There's a little more to this contract:
> > 1. The caller should only modify the SPTE using an atomic cmpxchg with
> > the returned spte value.
> > 2. The caller should not modify the mapped PFN or present <-> not
> > present state of the SPTE.
> > 3. There are other bits the caller can't modify too. (lpage, mt, etc.)
> > 
> > If the comments on this function don't document all the constraints on
> > how the returned sptep can be used, it might be safer to specify that
> > this is only meant to be used as part of the fast page fault handler.
> 
> Or maybe a less specific, but more scary comment?
> 
> > 
> > > +               return rcu_dereference(sptep);
> 
> I still vote to use "(__force u64 *)" instead of rcu_dereference() to make it
> clear we're cheating in order to share code with the legacy MMU.

Some downsides I see of using __force is:

 - The implementation of rcu_dereference() is non-trivial. I'm not sure
   how much of it we have to re-implement here. For example, should we
   us READ_ONCE() in addition to the type cast?

 - rcu_dereference() checks if the rcu read lock is held and also calls
   rcu_check_sparse, which seem like useful debugging checks we'd miss
   out on.

I think a big comment should be sufficient to draw the readers eyes and
explain [the extent to which :)] we are cheating.

> 
> 	/*
> 	 * Squash the __rcu annotation, the legacy MMU doesn't rely on RCU to
> 	 * protect its page tables and so the common MMU code doesn't preserve
> 	 * the annotation.
> 	 *
> 	 * It goes without saying, but the caller must honor all TDP MMU
> 	 * contracts for accessing/modifying SPTEs outside of mmu_lock.
> 	 */
> 	return (__force u64 *)sptep;
> 	
> > > +       return NULL;
> > > +}
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > > index e9dde5f9c0ef..508a23bdf7da 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > > @@ -81,6 +81,8 @@ void kvm_tdp_mmu_walk_lockless_begin(void);
> > >  void kvm_tdp_mmu_walk_lockless_end(void);
> > >  int kvm_tdp_mmu_get_walk_lockless(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> > >                                   int *root_level);
> > > +u64 *kvm_tdp_mmu_get_last_sptep_lockless(struct kvm_vcpu *vcpu, u64 addr,
> > > +                                        u64 *spte);
> > >
> > >  #ifdef CONFIG_X86_64
> > >  bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> > > --
> > > 2.32.0.93.g670b81a890-goog
> > >

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-07-12 21:24 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-30 21:47 [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU David Matlack
2021-06-30 21:47 ` [PATCH v2 1/6] KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault David Matlack
2021-07-12 20:01   ` Sean Christopherson
2021-06-30 21:47 ` [PATCH v2 2/6] KVM: x86/mmu: Fix use of enums in trace_fast_page_fault David Matlack
2021-07-12 16:14   ` Ben Gardon
2021-07-12 18:11     ` David Matlack
2021-07-12 19:53       ` Sean Christopherson
2021-07-12 20:38         ` David Matlack
2021-06-30 21:47 ` [PATCH v2 3/6] KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU David Matlack
2021-07-12 17:02   ` Ben Gardon
2021-07-12 18:11     ` David Matlack
2021-07-12 20:20       ` Sean Christopherson
2021-07-12 20:23   ` Sean Christopherson
2021-06-30 21:48 ` [PATCH v2 4/6] KVM: x86/mmu: fast_page_fault support for " David Matlack
2021-07-01  2:54   ` kernel test robot
2021-07-01  2:54     ` kernel test robot
2021-07-01  4:27   ` kernel test robot
2021-07-01  4:27     ` kernel test robot
2021-07-01 18:27     ` David Matlack
2021-07-01 18:27       ` David Matlack
2021-07-09 18:45   ` David Matlack
2021-07-12 17:49   ` Ben Gardon
2021-07-12 18:20     ` David Matlack
2021-07-12 21:03     ` Sean Christopherson
2021-07-12 21:24       ` David Matlack
2021-06-30 21:48 ` [PATCH v2 5/6] KVM: selftests: Fix missing break in dirty_log_perf_test arg parsing David Matlack
2021-06-30 21:48 ` [PATCH v2 6/6] KVM: selftests: Introduce access_tracking_perf_test David Matlack
2021-07-01  1:15 ` [PATCH v2 0/6] KVM: x86/mmu: Fast page fault support for the TDP MMU Matthew Wilcox
     [not found]   ` <CABgObfZUFWCAvKoxDzGjmksFnwZgbnpX9GuC+nhiVLa-Fhwj6A@mail.gmail.com>
2021-07-01 12:08     ` Matthew Wilcox
2021-07-01 16:50       ` David Matlack
2021-07-01 17:00 ` David Hildenbrand
2021-07-01 22:11   ` David Matlack
2021-07-02  7:53     ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.