linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU
@ 2021-02-02 18:57 Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 01/28] KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
                   ` (28 more replies)
  0 siblings, 29 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

The TDP MMU was implemented to simplify and improve the performance of
KVM's memory management on modern hardware with TDP (EPT / NPT). To build
on the existing performance improvements of the TDP MMU, add the ability
to handle vCPU page faults, enabling and disabling dirty logging, and
removing mappings, in parallel. In the current implementation,
vCPU page faults (actually EPT/NPT violations/misconfigurations) are the
largest source of MMU lock contention on VMs with many vCPUs. This
contention, and the resulting page fault latency, can soft-lock guests
and degrade performance. Handling page faults in parallel is especially
useful when booting VMs, enabling dirty logging, and handling demand
paging. In all these cases vCPUs are constantly incurring  page faults on
each new page accessed.

Broadly, the following changes were required to allow parallel page
faults (and other MMU operations):
-- Contention detection and yielding added to rwlocks to bring them up to
   feature parity with spin locks, at least as far as the use of the MMU
   lock is concerned.
-- TDP MMU page table memory is protected with RCU and freed in RCU
   callbacks to allow multiple threads to operate on that memory
   concurrently.
-- The MMU lock was changed to an rwlock on x86. This allows the page
   fault handlers to acquire the MMU lock in read mode and handle page
   faults in parallel, and other operations to maintain exclusive use of
   the lock by acquiring it in write mode.
-- An additional lock is added to protect some data structures needed by
   the page fault handlers, for relatively infrequent operations.
-- The page fault handler is modified to use atomic cmpxchgs to set SPTEs
   and some page fault handler operations are modified slightly to work
   concurrently with other threads.

This series also contains a few bug fixes and optimizations, related to
the above, but not strictly part of enabling parallel page fault handling.

Correctness testing:
The following tests were performed with an SMP kernel and DBX kernel on an
Intel Skylake machine. The tests were run both with and without the TDP
MMU enabled.
-- This series introduces no new failures in kvm-unit-tests
SMP + no TDP MMU no new failures
SMP + TDP MMU no new failures
DBX + no TDP MMU no new failures
DBX + TDP MMU no new failures
-- All KVM selftests behave as expected
SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
(./x86_64/vmx_preemption_timer_test also fails without this patch set,
both with the TDP MMU on and off.)
DBX + no TDP MMU all pass
DBX + TDP MMU all pass
-- A VM can be booted running a Debian 9 and all memory accessed
SMP + no TDP MMU works
SMP + TDP MMU works
DBX + no TDP MMU works
DBX + TDP MMU works

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Changelog v1 -> v2:
- Removed the MMU lock union + using a spinlock when the TDP MMU is disabled
- Merged RCU commits
- Extended additional MMU operations to operate in parallel
- Ammended dirty log perf test to cover newly parallelized code paths
- Misc refactorings (see changelogs for individual commits)
- Big thanks to Sean and Paolo for their thorough review of v1

Ben Gardon (28):
  KVM: x86/mmu: change TDP MMU yield function returns to match
    cond_resched
  KVM: x86/mmu: Add comment on __tdp_mmu_set_spte
  KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory
  KVM: x86/mmu: Factor out handling of removed page tables
  locking/rwlocks: Add contention detection for rwlocks
  sched: Add needbreak for rwlocks
  sched: Add cond_resched_rwlock
  KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages
  KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs
  KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched
  KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn
  KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
  KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed
  KVM: x86/mmu: Skip no-op changes in TDP MMU functions
  KVM: x86/mmu: Clear dirtied pages mask bit before early break
  KVM: x86/mmu: Protect TDP MMU page table memory with RCU
  KVM: x86/mmu: Use an rwlock for the x86 MMU
  KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages
  KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  KVM: x86/mmu: Mark SPTEs in disconnected pages as removed
  KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
  KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read
    lock
  KVM: selftests: Add backing src parameter to dirty_log_perf_test
  KVM: selftests: Disable dirty logging with vCPUs running

 arch/x86/include/asm/kvm_host.h               |  15 +
 arch/x86/kvm/mmu/mmu.c                        | 120 +--
 arch/x86/kvm/mmu/mmu_internal.h               |   9 +-
 arch/x86/kvm/mmu/page_track.c                 |   8 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |   8 +-
 arch/x86/kvm/mmu/spte.h                       |  21 +-
 arch/x86/kvm/mmu/tdp_iter.c                   |  46 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  21 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    | 741 ++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +-
 arch/x86/kvm/x86.c                            |   4 +-
 include/asm-generic/qrwlock.h                 |  24 +-
 include/linux/kvm_host.h                      |   5 +
 include/linux/rwlock.h                        |   7 +
 include/linux/sched.h                         |  29 +
 kernel/sched/core.c                           |  40 +
 .../selftests/kvm/demand_paging_test.c        |   3 +-
 .../selftests/kvm/dirty_log_perf_test.c       |  25 +-
 .../testing/selftests/kvm/include/kvm_util.h  |   6 -
 .../selftests/kvm/include/perf_test_util.h    |   3 +-
 .../testing/selftests/kvm/include/test_util.h |  14 +
 .../selftests/kvm/lib/perf_test_util.c        |   6 +-
 tools/testing/selftests/kvm/lib/test_util.c   |  29 +
 virt/kvm/dirty_ring.c                         |  10 +
 virt/kvm/kvm_main.c                           |  46 +-
 25 files changed, 963 insertions(+), 282 deletions(-)

-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v2 01/28] KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 02/28] KVM: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Currently the TDP MMU yield / cond_resched functions either return
nothing or return true if the TLBs were not flushed. These are confusing
semantics, especially when making control flow decisions in calling
functions.

To clean things up, change both functions to have the same
return value semantics as cond_resched: true if the thread yielded,
false if it did not. If the function yielded in the _flush_ version,
then the TLBs will have been flushed.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 39 ++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2ef8615f9dba..e9f9ff81a38e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -413,8 +413,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 			 _mmu->shadow_root_level, _start, _end)
 
 /*
- * Flush the TLB if the process should drop kvm->mmu_lock.
- * Return whether the caller still needs to flush the tlb.
+ * Flush the TLB and yield if the MMU lock is contended or this thread needs to
+ * return control to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded, the TLBs were flushed, and the
+ * iterator's traversal was reset. Return false if a yield was not needed.
  */
 static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
@@ -422,18 +429,32 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *it
 		kvm_flush_remote_tlbs(kvm);
 		cond_resched_lock(&kvm->mmu_lock);
 		tdp_iter_refresh_walk(iter);
-		return false;
-	} else {
 		return true;
 	}
+
+	return false;
 }
 
-static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+/*
+ * Yield if the MMU lock is contended or this thread needs to return control
+ * to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded and the iterator's traversal was reset.
+ * Return false if a yield was not needed.
+ */
+static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
 		cond_resched_lock(&kvm->mmu_lock);
 		tdp_iter_refresh_walk(iter);
+		return true;
 	}
+
+	return false;
 }
 
 /*
@@ -469,10 +490,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
-		if (can_yield)
-			flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
-		else
-			flush_needed = true;
+		flush_needed = !can_yield ||
+			       !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
 	}
 	return flush_needed;
 }
@@ -1072,7 +1091,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
-		spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+		spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
 	}
 
 	if (spte_set)
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 02/28] KVM: x86/mmu: Add comment on __tdp_mmu_set_spte
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 01/28] KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 03/28] KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

__tdp_mmu_set_spte is a very important function in the TDP MMU which
already accepts several arguments and will take more in future commits.
To offset this complexity, add a comment to the function describing each
of the arguemnts.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e9f9ff81a38e..3d8cca238eba 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -357,6 +357,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				      new_spte, level);
 }
 
+/*
+ * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @new_spte: The value the SPTE should be set to
+ * @record_acc_track: Notify the MM subsystem of changes to the accessed state
+ *		      of the page. Should be set unless handling an MMU
+ *		      notifier for access tracking. Leaving record_acc_track
+ *		      unset in that case prevents page accesses from being
+ *		      double counted.
+ * @record_dirty_log: Record the page as dirty in the dirty bitmap if
+ *		      appropriate for the change being made. Should be set
+ *		      unless performing certain dirty logging operations.
+ *		      Leaving record_dirty_log unset in that case prevents page
+ *		      writes from being double counted.
+ */
 static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 				      u64 new_spte, bool record_acc_track,
 				      bool record_dirty_log)
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 03/28] KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 01/28] KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 02/28] KVM: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 04/28] KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified
under the MMU lock.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3d8cca238eba..b83a6a3ad29c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -381,6 +381,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
 	int as_id = kvm_mmu_page_as_id(root);
 
+	lockdep_assert_held(&kvm->mmu_lock);
+
 	WRITE_ONCE(*iter->sptep, new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 04/28] KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (2 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 03/28] KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 05/28] KVM: x86/mmu: Factor out handling of removed page tables Ben Gardon
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

The KVM MMU caches already guarantee that shadow page table memory will
be zeroed, so there is no reason to re-zero the page in the TDP MMU page
fault handler.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b83a6a3ad29c..3828c0e83466 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -655,7 +655,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
 			list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
 			child_pt = sp->spt;
-			clear_page(child_pt);
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 05/28] KVM: x86/mmu: Factor out handling of removed page tables
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (3 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 04/28] KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Factor out the code to handle a disconnected subtree of the TDP paging
structure from the code to handle the change to an individual SPTE.
Future commits will build on this to allow asynchronous page freeing.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Replaced "disconnected" with "removed" updated derivative
  comments and code

 arch/x86/kvm/mmu/tdp_mmu.c | 71 ++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3828c0e83466..c3075fb568eb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -234,6 +234,45 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 
+/**
+ * handle_removed_tdp_mmu_page - handle a pt removed from the TDP structure
+ *
+ * @kvm: kvm instance
+ * @pt: the page removed from the paging structure
+ *
+ * Given a page table that has been removed from the TDP paging structure,
+ * iterates through the page table to clear SPTEs and free child page tables.
+ */
+static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(pt);
+	int level = sp->role.level;
+	gfn_t gfn = sp->gfn;
+	u64 old_child_spte;
+	int i;
+
+	trace_kvm_mmu_prepare_zap_page(sp);
+
+	list_del(&sp->link);
+
+	if (sp->lpage_disallowed)
+		unaccount_huge_nx_page(kvm, sp);
+
+	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+		old_child_spte = READ_ONCE(*(pt + i));
+		WRITE_ONCE(*(pt + i), 0);
+		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
+			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
+			old_child_spte, 0, level - 1);
+	}
+
+	kvm_flush_remote_tlbs_with_address(kvm, gfn,
+					   KVM_PAGES_PER_HPAGE(level));
+
+	free_page((unsigned long)pt);
+	kmem_cache_free(mmu_page_header_cache, sp);
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -254,10 +293,6 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
-	u64 *pt;
-	struct kvm_mmu_page *sp;
-	u64 old_child_spte;
-	int i;
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
@@ -321,31 +356,9 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.
 	 */
-	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
-		pt = spte_to_child_pt(old_spte, level);
-		sp = sptep_to_sp(pt);
-
-		trace_kvm_mmu_prepare_zap_page(sp);
-
-		list_del(&sp->link);
-
-		if (sp->lpage_disallowed)
-			unaccount_huge_nx_page(kvm, sp);
-
-		for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-			old_child_spte = READ_ONCE(*(pt + i));
-			WRITE_ONCE(*(pt + i), 0);
-			handle_changed_spte(kvm, as_id,
-				gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-				old_child_spte, 0, level - 1);
-		}
-
-		kvm_flush_remote_tlbs_with_address(kvm, gfn,
-						   KVM_PAGES_PER_HPAGE(level));
-
-		free_page((unsigned long)pt);
-		kmem_cache_free(mmu_page_header_cache, sp);
-	}
+	if (was_present && !was_leaf && (pfn_changed || !is_present))
+		handle_removed_tdp_mmu_page(kvm,
+				spte_to_child_pt(old_spte, level));
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (4 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 05/28] KVM: x86/mmu: Factor out handling of removed page tables Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-09 20:39   ` Guenter Roeck
  2021-02-10  3:32   ` Waiman Long
  2021-02-02 18:57 ` [PATCH v2 07/28] sched: Add needbreak " Ben Gardon
                   ` (22 subsequent siblings)
  28 siblings, 2 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso,
	Waiman Long

rwlocks do not currently have any facility to detect contention
like spinlocks do. In order to allow users of rwlocks to better manage
latency, add contention detection for queued rwlocks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/asm-generic/qrwlock.h | 24 ++++++++++++++++++------
 include/linux/rwlock.h        |  7 +++++++
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 84ce841ce735..0020d3b820a7 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -14,6 +14,7 @@
 #include <asm/processor.h>
 
 #include <asm-generic/qrwlock_types.h>
+#include <asm-generic/qspinlock.h>
 
 /*
  * Writer states & reader shift and bias.
@@ -116,15 +117,26 @@ static inline void queued_write_unlock(struct qrwlock *lock)
 	smp_store_release(&lock->wlocked, 0);
 }
 
+/**
+ * queued_rwlock_is_contended - check if the lock is contended
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static inline int queued_rwlock_is_contended(struct qrwlock *lock)
+{
+	return arch_spin_is_locked(&lock->wait_lock);
+}
+
 /*
  * Remapping rwlock architecture specific functions to the corresponding
  * queue rwlock functions.
  */
-#define arch_read_lock(l)	queued_read_lock(l)
-#define arch_write_lock(l)	queued_write_lock(l)
-#define arch_read_trylock(l)	queued_read_trylock(l)
-#define arch_write_trylock(l)	queued_write_trylock(l)
-#define arch_read_unlock(l)	queued_read_unlock(l)
-#define arch_write_unlock(l)	queued_write_unlock(l)
+#define arch_read_lock(l)		queued_read_lock(l)
+#define arch_write_lock(l)		queued_write_lock(l)
+#define arch_read_trylock(l)		queued_read_trylock(l)
+#define arch_write_trylock(l)		queued_write_trylock(l)
+#define arch_read_unlock(l)		queued_read_unlock(l)
+#define arch_write_unlock(l)		queued_write_unlock(l)
+#define arch_rwlock_is_contended(l)	queued_rwlock_is_contended(l)
 
 #endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index 3dcd617e65ae..7ce9a51ae5c0 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -128,4 +128,11 @@ do {								\
 	1 : ({ local_irq_restore(flags); 0; }); \
 })
 
+#ifdef arch_rwlock_is_contended
+#define rwlock_is_contended(lock) \
+	 arch_rwlock_is_contended(&(lock)->raw_lock)
+#else
+#define rwlock_is_contended(lock)	((void)(lock), 0)
+#endif /* arch_rwlock_is_contended */
+
 #endif /* __LINUX_RWLOCK_H */
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 07/28] sched: Add needbreak for rwlocks
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (5 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 08/28] sched: Add cond_resched_rwlock Ben Gardon
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso,
	Waiman Long

Contention awareness while holding a spin lock is essential for reducing
latency when long running kernel operations can hold that lock. Add the
same contention detection interface for read/write spin locks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/sched.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..5d1378e5a040 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1912,6 +1912,23 @@ static inline int spin_needbreak(spinlock_t *lock)
 #endif
 }
 
+/*
+ * Check if a rwlock is contended.
+ * Returns non-zero if there is another task waiting on the rwlock.
+ * Returns zero if the lock is not contended or the system / underlying
+ * rwlock implementation does not support contention detection.
+ * Technically does not depend on CONFIG_PREEMPTION, but a general need
+ * for low latency.
+ */
+static inline int rwlock_needbreak(rwlock_t *lock)
+{
+#ifdef CONFIG_PREEMPTION
+	return rwlock_is_contended(lock);
+#else
+	return 0;
+#endif
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 08/28] sched: Add cond_resched_rwlock
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (6 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 07/28] sched: Add needbreak " Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 09/28] KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages Ben Gardon
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso,
	Waiman Long

Safely rescheduling while holding a spin lock is essential for keeping
long running kernel operations running smoothly. Add the facility to
cond_resched rwlocks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/sched.h | 12 ++++++++++++
 kernel/sched/core.c   | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d1378e5a040..3052d16da3cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,12 +1883,24 @@ static inline int _cond_resched(void) { return 0; }
 })
 
 extern int __cond_resched_lock(spinlock_t *lock);
+extern int __cond_resched_rwlock_read(rwlock_t *lock);
+extern int __cond_resched_rwlock_write(rwlock_t *lock);
 
 #define cond_resched_lock(lock) ({				\
 	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
 	__cond_resched_lock(lock);				\
 })
 
+#define cond_resched_rwlock_read(lock) ({			\
+	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	__cond_resched_rwlock_read(lock);			\
+})
+
+#define cond_resched_rwlock_write(lock) ({			\
+	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	__cond_resched_rwlock_write(lock);			\
+})
+
 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ff74fca39ed2..efed1bf202d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6709,6 +6709,46 @@ int __cond_resched_lock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_lock);
 
+int __cond_resched_rwlock_read(rwlock_t *lock)
+{
+	int resched = should_resched(PREEMPT_LOCK_OFFSET);
+	int ret = 0;
+
+	lockdep_assert_held_read(lock);
+
+	if (rwlock_needbreak(lock) || resched) {
+		read_unlock(lock);
+		if (resched)
+			preempt_schedule_common();
+		else
+			cpu_relax();
+		ret = 1;
+		read_lock(lock);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_read);
+
+int __cond_resched_rwlock_write(rwlock_t *lock)
+{
+	int resched = should_resched(PREEMPT_LOCK_OFFSET);
+	int ret = 0;
+
+	lockdep_assert_held_write(lock);
+
+	if (rwlock_needbreak(lock) || resched) {
+		write_unlock(lock);
+		if (resched)
+			preempt_schedule_common();
+		else
+			cpu_relax();
+		ret = 1;
+		write_lock(lock);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_write);
+
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 09/28] KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (7 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 08/28] sched: Add cond_resched_rwlock Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 10/28] KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs Ben Gardon
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

No functional change intended.

Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..60ff6837655a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6005,10 +6005,10 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 				      struct kvm_mmu_page,
 				      lpage_disallowed_link);
 		WARN_ON_ONCE(!sp->lpage_disallowed);
-		if (sp->tdp_mmu_page)
+		if (sp->tdp_mmu_page) {
 			kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
 				sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
-		else {
+		} else {
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 			WARN_ON_ONCE(sp->lpage_disallowed);
 		}
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 10/28] KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (8 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 09/28] KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03  9:43   ` Paolo Bonzini
  2021-02-02 18:57 ` [PATCH v2 11/28] KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched Ben Gardon
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

There is a bug in the TDP MMU function to zap SPTEs which could be
replaced with a larger mapping which prevents the function from doing
anything. Fix this by correctly zapping the last level SPTEs.

Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU")
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c3075fb568eb..e3066d08c1dc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1098,8 +1098,8 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
 }
 
 /*
- * Clear non-leaf entries (and free associated page tables) which could
- * be replaced by large mappings, for GFNs within the slot.
+ * Clear leaf entries which could be replaced by large mappings, for
+ * GFNs within the slot.
  */
 static void zap_collapsible_spte_range(struct kvm *kvm,
 				       struct kvm_mmu_page *root,
@@ -1111,7 +1111,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 	tdp_root_for_each_pte(iter, root, start, end) {
 		if (!is_shadow_present_pte(iter.old_spte) ||
-		    is_last_spte(iter.old_spte, iter.level))
+		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
 		pfn = spte_to_pfn(iter.old_spte);
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 11/28] KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (9 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 10/28] KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 12/28] KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn Ben Gardon
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

The flushing and non-flushing variants of tdp_mmu_iter_cond_resched have
almost identical implementations. Merge the two functions and add a
flush parameter.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++--------------------------
 1 file changed, 13 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e3066d08c1dc..8f7b120597f3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -443,33 +443,13 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
 			 _mmu->shadow_root_level, _start, _end)
 
-/*
- * Flush the TLB and yield if the MMU lock is contended or this thread needs to
- * return control to the scheduler.
- *
- * If this function yields, it will also reset the tdp_iter's walk over the
- * paging structure and the calling function should allow the iterator to
- * continue its traversal from the paging structure root.
- *
- * Return true if this function yielded, the TLBs were flushed, and the
- * iterator's traversal was reset. Return false if a yield was not needed.
- */
-static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
-{
-	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-		kvm_flush_remote_tlbs(kvm);
-		cond_resched_lock(&kvm->mmu_lock);
-		tdp_iter_refresh_walk(iter);
-		return true;
-	}
-
-	return false;
-}
-
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
  * to the scheduler.
  *
+ * If this function should yield and flush is set, it will perform a remote
+ * TLB flush before yielding.
+ *
  * If this function yields, it will also reset the tdp_iter's walk over the
  * paging structure and the calling function should allow the iterator to
  * continue its traversal from the paging structure root.
@@ -477,9 +457,13 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *it
  * Return true if this function yielded and the iterator's traversal was reset.
  * Return false if a yield was not needed.
  */
-static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
+					     struct tdp_iter *iter, bool flush)
 {
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (flush)
+			kvm_flush_remote_tlbs(kvm);
+
 		cond_resched_lock(&kvm->mmu_lock);
 		tdp_iter_refresh_walk(iter);
 		return true;
@@ -522,7 +506,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
 		flush_needed = !can_yield ||
-			       !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+			       !tdp_mmu_iter_cond_resched(kvm, &iter, true);
 	}
 	return flush_needed;
 }
@@ -894,7 +878,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
 
-		tdp_mmu_iter_cond_resched(kvm, &iter);
+		tdp_mmu_iter_cond_resched(kvm, &iter, false);
 	}
 	return spte_set;
 }
@@ -953,7 +937,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
 
-		tdp_mmu_iter_cond_resched(kvm, &iter);
+		tdp_mmu_iter_cond_resched(kvm, &iter, false);
 	}
 	return spte_set;
 }
@@ -1069,7 +1053,7 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte(kvm, &iter, new_spte);
 		spte_set = true;
 
-		tdp_mmu_iter_cond_resched(kvm, &iter);
+		tdp_mmu_iter_cond_resched(kvm, &iter, false);
 	}
 
 	return spte_set;
@@ -1121,7 +1105,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
-		spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+		spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter, true);
 	}
 
 	if (spte_set)
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 12/28] KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (10 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 11/28] KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 13/28] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter Ben Gardon
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

The goal_gfn field in tdp_iter can be misleading as it implies that it
is the iterator's final goal. It is really a taget for the lowest gfn
mapped by the leaf level SPTE the iterator will traverse towards. Change
the field's name to be more precise.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c | 20 ++++++++++----------
 arch/x86/kvm/mmu/tdp_iter.h |  4 ++--
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 87b7e16911db..9917c55b7d24 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -22,21 +22,21 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
 
 /*
  * Sets a TDP iterator to walk a pre-order traversal of the paging structure
- * rooted at root_pt, starting with the walk to translate goal_gfn.
+ * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
-		    int min_level, gfn_t goal_gfn)
+		    int min_level, gfn_t next_last_level_gfn)
 {
 	WARN_ON(root_level < 1);
 	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
 
-	iter->goal_gfn = goal_gfn;
+	iter->next_last_level_gfn = next_last_level_gfn;
 	iter->root_level = root_level;
 	iter->min_level = min_level;
 	iter->level = root_level;
 	iter->pt_path[iter->level - 1] = root_pt;
 
-	iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
+	iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
 	tdp_iter_refresh_sptep(iter);
 
 	iter->valid = true;
@@ -82,7 +82,7 @@ static bool try_step_down(struct tdp_iter *iter)
 
 	iter->level--;
 	iter->pt_path[iter->level - 1] = child_pt;
-	iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
+	iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
 	tdp_iter_refresh_sptep(iter);
 
 	return true;
@@ -106,7 +106,7 @@ static bool try_step_side(struct tdp_iter *iter)
 		return false;
 
 	iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
-	iter->goal_gfn = iter->gfn;
+	iter->next_last_level_gfn = iter->gfn;
 	iter->sptep++;
 	iter->old_spte = READ_ONCE(*iter->sptep);
 
@@ -166,13 +166,13 @@ void tdp_iter_next(struct tdp_iter *iter)
  */
 void tdp_iter_refresh_walk(struct tdp_iter *iter)
 {
-	gfn_t goal_gfn = iter->goal_gfn;
+	gfn_t next_last_level_gfn = iter->next_last_level_gfn;
 
-	if (iter->gfn > goal_gfn)
-		goal_gfn = iter->gfn;
+	if (iter->gfn > next_last_level_gfn)
+		next_last_level_gfn = iter->gfn;
 
 	tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
-		       iter->root_level, iter->min_level, goal_gfn);
+		       iter->root_level, iter->min_level, next_last_level_gfn);
 }
 
 u64 *tdp_iter_root_pt(struct tdp_iter *iter)
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 47170d0dc98e..b2dd269c631f 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -15,7 +15,7 @@ struct tdp_iter {
 	 * The iterator will traverse the paging structure towards the mapping
 	 * for this GFN.
 	 */
-	gfn_t goal_gfn;
+	gfn_t next_last_level_gfn;
 	/* Pointers to the page tables traversed to reach the current SPTE */
 	u64 *pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
@@ -52,7 +52,7 @@ struct tdp_iter {
 u64 *spte_to_child_pt(u64 pte, int level);
 
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
-		    int min_level, gfn_t goal_gfn);
+		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_refresh_walk(struct tdp_iter *iter);
 u64 *tdp_iter_root_pt(struct tdp_iter *iter);
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 13/28] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (11 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 12/28] KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-05 23:42   ` Sean Christopherson
  2021-02-02 18:57 ` [PATCH v2 14/28] KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed Ben Gardon
                   ` (15 subsequent siblings)
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

In some functions the TDP iter risks not making forward progress if two
threads livelock yielding to one another. This is possible if two threads
are trying to execute wrprot_gfn_range. Each could write protect an entry
and then yield. This would reset the tdp_iter's walk over the paging
structure and the loop would end up repeating the same entry over and
over, preventing either thread from making forward progress.

Fix this issue by only yielding if the loop has made forward progress
since the last yield.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Moved forward progress check into tdp_mmu_iter_cond_resched
- Folded tdp_iter_refresh_walk into tdp_mmu_iter_cond_resched
- Split patch into three and renamed all

 arch/x86/kvm/mmu/tdp_iter.c | 18 +-----------------
 arch/x86/kvm/mmu/tdp_iter.h |  7 ++++++-
 arch/x86/kvm/mmu/tdp_mmu.c  | 21 ++++++++++++++++-----
 3 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 9917c55b7d24..1a09d212186b 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -31,6 +31,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
 	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
 
 	iter->next_last_level_gfn = next_last_level_gfn;
+	iter->yielded_gfn = iter->next_last_level_gfn;
 	iter->root_level = root_level;
 	iter->min_level = min_level;
 	iter->level = root_level;
@@ -158,23 +159,6 @@ void tdp_iter_next(struct tdp_iter *iter)
 	iter->valid = false;
 }
 
-/*
- * Restart the walk over the paging structure from the root, starting from the
- * highest gfn the iterator had previously reached. Assumes that the entire
- * paging structure, except the root page, may have been completely torn down
- * and rebuilt.
- */
-void tdp_iter_refresh_walk(struct tdp_iter *iter)
-{
-	gfn_t next_last_level_gfn = iter->next_last_level_gfn;
-
-	if (iter->gfn > next_last_level_gfn)
-		next_last_level_gfn = iter->gfn;
-
-	tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
-		       iter->root_level, iter->min_level, next_last_level_gfn);
-}
-
 u64 *tdp_iter_root_pt(struct tdp_iter *iter)
 {
 	return iter->pt_path[iter->root_level - 1];
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b2dd269c631f..d480c540ee27 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -16,6 +16,12 @@ struct tdp_iter {
 	 * for this GFN.
 	 */
 	gfn_t next_last_level_gfn;
+	/*
+	 * The next_last_level_gfn at the time when the thread last
+	 * yielded. Only yielding when the next_last_level_gfn !=
+	 * yielded_gfn helps ensure forward progress.
+	 */
+	gfn_t yielded_gfn;
 	/* Pointers to the page tables traversed to reach the current SPTE */
 	u64 *pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
@@ -54,7 +60,6 @@ u64 *spte_to_child_pt(u64 pte, int level);
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
-void tdp_iter_refresh_walk(struct tdp_iter *iter);
 u64 *tdp_iter_root_pt(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8f7b120597f3..7cfc0639b1ef 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -451,8 +451,9 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
  * TLB flush before yielding.
  *
  * If this function yields, it will also reset the tdp_iter's walk over the
- * paging structure and the calling function should allow the iterator to
- * continue its traversal from the paging structure root.
+ * paging structure and the calling function should skip to the next
+ * iteration to allow the iterator to continue its traversal from the
+ * paging structure root.
  *
  * Return true if this function yielded and the iterator's traversal was reset.
  * Return false if a yield was not needed.
@@ -460,12 +461,22 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 					     struct tdp_iter *iter, bool flush)
 {
+	/* Ensure forward progress has been made before yielding. */
+	if (iter->next_last_level_gfn == iter->yielded_gfn)
+		return false;
+
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
 		if (flush)
 			kvm_flush_remote_tlbs(kvm);
 
 		cond_resched_lock(&kvm->mmu_lock);
-		tdp_iter_refresh_walk(iter);
+
+		WARN_ON(iter->gfn > iter->next_last_level_gfn);
+
+		tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
+			       iter->root_level, iter->min_level,
+			       iter->next_last_level_gfn);
+
 		return true;
 	}
 
@@ -505,8 +516,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
-		flush_needed = !can_yield ||
-			       !tdp_mmu_iter_cond_resched(kvm, &iter, true);
+		flush_needed = !(can_yield &&
+				 tdp_mmu_iter_cond_resched(kvm, &iter, true));
 	}
 	return flush_needed;
 }
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 14/28] KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (12 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 13/28] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 15/28] KVM: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Given certain conditions, some TDP MMU functions may not yield
reliably / frequently enough. For example, if a paging structure was
very large but had few, if any writable entries, wrprot_gfn_range
could traverse many entries before finding a writable entry and yielding
because the check for yielding only happens after an SPTE is modified.

Fix this issue by moving the yield to the beginning of the loop.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Split patch into three

 arch/x86/kvm/mmu/tdp_mmu.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7cfc0639b1ef..c8a1149cb229 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -501,6 +501,12 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	bool flush_needed = false;
 
 	tdp_root_for_each_pte(iter, root, start, end) {
+		if (can_yield &&
+		    tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
+			flush_needed = false;
+			continue;
+		}
+
 		if (!is_shadow_present_pte(iter.old_spte))
 			continue;
 
@@ -515,9 +521,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
-
-		flush_needed = !(can_yield &&
-				 tdp_mmu_iter_cond_resched(kvm, &iter, true));
+		flush_needed = true;
 	}
 	return flush_needed;
 }
@@ -880,6 +884,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
 				   min_level, start, end) {
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+			continue;
+
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
@@ -888,8 +895,6 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
-
-		tdp_mmu_iter_cond_resched(kvm, &iter, false);
 	}
 	return spte_set;
 }
@@ -933,6 +938,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	bool spte_set = false;
 
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+			continue;
+
 		if (spte_ad_need_write_protect(iter.old_spte)) {
 			if (is_writable_pte(iter.old_spte))
 				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
@@ -947,8 +955,6 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
-
-		tdp_mmu_iter_cond_resched(kvm, &iter, false);
 	}
 	return spte_set;
 }
@@ -1056,6 +1062,9 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	bool spte_set = false;
 
 	tdp_root_for_each_pte(iter, root, start, end) {
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+			continue;
+
 		if (!is_shadow_present_pte(iter.old_spte))
 			continue;
 
@@ -1063,8 +1072,6 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte(kvm, &iter, new_spte);
 		spte_set = true;
-
-		tdp_mmu_iter_cond_resched(kvm, &iter, false);
 	}
 
 	return spte_set;
@@ -1105,6 +1112,11 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	bool spte_set = false;
 
 	tdp_root_for_each_pte(iter, root, start, end) {
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, spte_set)) {
+			spte_set = false;
+			continue;
+		}
+
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
@@ -1116,7 +1128,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
-		spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter, true);
+		spte_set = true;
 	}
 
 	if (spte_set)
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 15/28] KVM: x86/mmu: Skip no-op changes in TDP MMU functions
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (13 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 14/28] KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 16/28] KVM: x86/mmu: Clear dirtied pages mask bit before early break Ben Gardon
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Skip setting SPTEs if no change is expected.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Merged no-op checks into exiting old_spte check

 arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c8a1149cb229..aeb05f626b55 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -888,7 +888,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		if (!is_shadow_present_pte(iter.old_spte) ||
-		    !is_last_spte(iter.old_spte, iter.level))
+		    !is_last_spte(iter.old_spte, iter.level) ||
+		    !(iter.old_spte & PT_WRITABLE_MASK))
 			continue;
 
 		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
@@ -1065,7 +1066,8 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
 			continue;
 
-		if (!is_shadow_present_pte(iter.old_spte))
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    iter.old_spte & shadow_dirty_mask)
 			continue;
 
 		new_spte = iter.old_spte | shadow_dirty_mask;
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 16/28] KVM: x86/mmu: Clear dirtied pages mask bit before early break
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (14 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 15/28] KVM: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 17/28] KVM: x86/mmu: Protect TDP MMU page table memory with RCU Ben Gardon
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

In clear_dirty_pt_masked, the loop is intended to exit early after
processing each of the GFNs with corresponding bits set in mask. This
does not work as intended if another thread has already cleared the
dirty bit or writable bit on the SPTE. In that case, the loop would
proceed to the next iteration early and the bit in mask would not be
cleared. As a result the loop could not exit early and would proceed
uselessly. Move the unsetting of the mask bit before the check for a
no-op SPTE change.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP
MMU")

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index aeb05f626b55..a75e92164a8b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1007,6 +1007,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !(mask & (1UL << (iter.gfn - gfn))))
 			continue;
 
+		mask &= ~(1UL << (iter.gfn - gfn));
+
 		if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
 			if (is_writable_pte(iter.old_spte))
 				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
@@ -1020,8 +1022,6 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		}
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
-
-		mask &= ~(1UL << (iter.gfn - gfn));
 	}
 }
 
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 17/28] KVM: x86/mmu: Protect TDP MMU page table memory with RCU
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (15 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 16/28] KVM: x86/mmu: Clear dirtied pages mask bit before early break Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 18/28] KVM: x86/mmu: Use an rwlock for the x86 MMU Ben Gardon
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

In order to enable concurrent modifications to the paging structures in
the TDP MMU, threads must be able to safely remove pages of page table
memory while other threads are traversing the same memory. To ensure
threads do not access PT memory after it is freed, protect PT memory
with RCU.

Protecting concurrent accesses to page table memory from use-after-free
bugs could also have been acomplished using
walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES,
coupling with the barriers in a TLB flush. The use of RCU for this case
has several distinct advantages over that approach.
1. Disabling interrupts for long running operations is not desirable.
   Future commits will allow operations besides page faults to operate
   without the exclusive protection of the MMU lock and those operations
   are too long to disable iterrupts for their duration.
2. The use of RCU here avoids long blocking / spinning operations in
   perfromance critical paths. By freeing memory with an asynchronous
   RCU API we avoid the longer wait times TLB flushes experience when
   overlapping with a thread in walk_shadow_page_lockless_begin/end().
3. RCU provides a separation of concerns when removing memory from the
   paging structure. Because the RCU callback to free memory can be
   scheduled immediately after a TLB flush, there's no need for the
   thread to manually free a queue of pages later, as commit_zap_pages
   does.

Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
Reviewed-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Moved RCU read unlock before the TLB flush
- Merged the RCU commits from v1 into a single commit
- Changed the way accesses to page table memory are annotated with RCU
  in the TDP iterator

 arch/x86/kvm/mmu/mmu_internal.h |  3 ++
 arch/x86/kvm/mmu/tdp_iter.c     | 16 +++---
 arch/x86/kvm/mmu/tdp_iter.h     | 10 ++--
 arch/x86/kvm/mmu/tdp_mmu.c      | 95 +++++++++++++++++++++++++++++----
 4 files changed, 103 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bfc6389edc28..7f599cc64178 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -57,6 +57,9 @@ struct kvm_mmu_page {
 	atomic_t write_flooding_count;
 
 	bool tdp_mmu_page;
+
+	/* Used for freeing the page asyncronously if it is a TDP MMU page. */
+	struct rcu_head rcu_head;
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 1a09d212186b..e5f148106e20 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
 {
 	iter->sptep = iter->pt_path[iter->level - 1] +
 		SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
-	iter->old_spte = READ_ONCE(*iter->sptep);
+	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
 }
 
 static gfn_t round_gfn_for_level(gfn_t gfn, int level)
@@ -35,7 +35,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
 	iter->root_level = root_level;
 	iter->min_level = min_level;
 	iter->level = root_level;
-	iter->pt_path[iter->level - 1] = root_pt;
+	iter->pt_path[iter->level - 1] = (tdp_ptep_t)root_pt;
 
 	iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
 	tdp_iter_refresh_sptep(iter);
@@ -48,7 +48,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
  * address of the child page table referenced by the SPTE. Returns null if
  * there is no such entry.
  */
-u64 *spte_to_child_pt(u64 spte, int level)
+tdp_ptep_t spte_to_child_pt(u64 spte, int level)
 {
 	/*
 	 * There's no child entry if this entry isn't present or is a
@@ -57,7 +57,7 @@ u64 *spte_to_child_pt(u64 spte, int level)
 	if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
 		return NULL;
 
-	return __va(spte_to_pfn(spte) << PAGE_SHIFT);
+	return (tdp_ptep_t)__va(spte_to_pfn(spte) << PAGE_SHIFT);
 }
 
 /*
@@ -66,7 +66,7 @@ u64 *spte_to_child_pt(u64 spte, int level)
  */
 static bool try_step_down(struct tdp_iter *iter)
 {
-	u64 *child_pt;
+	tdp_ptep_t child_pt;
 
 	if (iter->level == iter->min_level)
 		return false;
@@ -75,7 +75,7 @@ static bool try_step_down(struct tdp_iter *iter)
 	 * Reread the SPTE before stepping down to avoid traversing into page
 	 * tables that are no longer linked from this entry.
 	 */
-	iter->old_spte = READ_ONCE(*iter->sptep);
+	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
 
 	child_pt = spte_to_child_pt(iter->old_spte, iter->level);
 	if (!child_pt)
@@ -109,7 +109,7 @@ static bool try_step_side(struct tdp_iter *iter)
 	iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
 	iter->next_last_level_gfn = iter->gfn;
 	iter->sptep++;
-	iter->old_spte = READ_ONCE(*iter->sptep);
+	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
 
 	return true;
 }
@@ -159,7 +159,7 @@ void tdp_iter_next(struct tdp_iter *iter)
 	iter->valid = false;
 }
 
-u64 *tdp_iter_root_pt(struct tdp_iter *iter)
+tdp_ptep_t tdp_iter_root_pt(struct tdp_iter *iter)
 {
 	return iter->pt_path[iter->root_level - 1];
 }
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index d480c540ee27..4cc177d75c4a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -7,6 +7,8 @@
 
 #include "mmu.h"
 
+typedef u64 __rcu *tdp_ptep_t;
+
 /*
  * A TDP iterator performs a pre-order walk over a TDP paging structure.
  */
@@ -23,9 +25,9 @@ struct tdp_iter {
 	 */
 	gfn_t yielded_gfn;
 	/* Pointers to the page tables traversed to reach the current SPTE */
-	u64 *pt_path[PT64_ROOT_MAX_LEVEL];
+	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
-	u64 *sptep;
+	tdp_ptep_t sptep;
 	/* The lowest GFN mapped by the current SPTE */
 	gfn_t gfn;
 	/* The level of the root page given to the iterator */
@@ -55,11 +57,11 @@ struct tdp_iter {
 #define for_each_tdp_pte(iter, root, root_level, start, end) \
 	for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)
 
-u64 *spte_to_child_pt(u64 pte, int level);
+tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
-u64 *tdp_iter_root_pt(struct tdp_iter *iter);
+tdp_ptep_t tdp_iter_root_pt(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a75e92164a8b..9e4009068920 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -42,6 +42,12 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 		return;
 
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
+
+	/*
+	 * Ensure that all the outstanding RCU callbacks to free shadow pages
+	 * can run before the VM is torn down.
+	 */
+	rcu_barrier();
 }
 
 static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
@@ -196,6 +202,28 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	return __pa(root->spt);
 }
 
+static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
+{
+	free_page((unsigned long)sp->spt);
+	kmem_cache_free(mmu_page_header_cache, sp);
+}
+
+/*
+ * This is called through call_rcu in order to free TDP page table memory
+ * safely with respect to other kernel threads that may be operating on
+ * the memory.
+ * By only accessing TDP MMU page table memory in an RCU read critical
+ * section, and freeing it after a grace period, lockless access to that
+ * memory won't use it after it is freed.
+ */
+static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
+{
+	struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
+					       rcu_head);
+
+	tdp_mmu_free_sp(sp);
+}
+
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level);
 
@@ -269,8 +297,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
 	kvm_flush_remote_tlbs_with_address(kvm, gfn,
 					   KVM_PAGES_PER_HPAGE(level));
 
-	free_page((unsigned long)pt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
 /**
@@ -390,13 +417,13 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 				      u64 new_spte, bool record_acc_track,
 				      bool record_dirty_log)
 {
-	u64 *root_pt = tdp_iter_root_pt(iter);
+	tdp_ptep_t root_pt = tdp_iter_root_pt(iter);
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
 	int as_id = kvm_mmu_page_as_id(root);
 
 	lockdep_assert_held(&kvm->mmu_lock);
 
-	WRITE_ONCE(*iter->sptep, new_spte);
+	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
 			      iter->level);
@@ -466,10 +493,13 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 		return false;
 
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		rcu_read_unlock();
+
 		if (flush)
 			kvm_flush_remote_tlbs(kvm);
 
 		cond_resched_lock(&kvm->mmu_lock);
+		rcu_read_lock();
 
 		WARN_ON(iter->gfn > iter->next_last_level_gfn);
 
@@ -500,6 +530,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	struct tdp_iter iter;
 	bool flush_needed = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_pte(iter, root, start, end) {
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
@@ -523,6 +555,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte(kvm, &iter, 0);
 		flush_needed = true;
 	}
+
+	rcu_read_unlock();
 	return flush_needed;
 }
 
@@ -568,13 +602,15 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 
 	if (unlikely(is_noslot_pfn(pfn))) {
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
-		trace_mark_mmio_spte(iter->sptep, iter->gfn, new_spte);
+		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
+				     new_spte);
 	} else {
 		make_spte_ret = make_spte(vcpu, ACC_ALL, iter->level, iter->gfn,
 					 pfn, iter->old_spte, prefault, true,
 					 map_writable, !shadow_accessed_mask,
 					 &new_spte);
-		trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
+		trace_kvm_mmu_set_spte(iter->level, iter->gfn,
+				       rcu_dereference(iter->sptep));
 	}
 
 	if (new_spte == iter->old_spte)
@@ -597,7 +633,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 	if (unlikely(is_mmio_spte(new_spte)))
 		ret = RET_PF_EMULATE;
 
-	trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
+	trace_kvm_mmu_set_spte(iter->level, iter->gfn,
+			       rcu_dereference(iter->sptep));
 	if (!prefault)
 		vcpu->stat.pf_fixed++;
 
@@ -635,6 +672,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 					huge_page_disallowed, &req_level);
 
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
+
+	rcu_read_lock();
+
 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
 		if (nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(iter.old_spte, gfn,
@@ -660,7 +700,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			 * because the new value informs the !present
 			 * path below.
 			 */
-			iter.old_spte = READ_ONCE(*iter.sptep);
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
@@ -678,11 +718,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		}
 	}
 
-	if (WARN_ON(iter.level != level))
+	if (WARN_ON(iter.level != level)) {
+		rcu_read_unlock();
 		return RET_PF_RETRY;
+	}
 
 	ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
 					      pfn, prefault);
+	rcu_read_unlock();
 
 	return ret;
 }
@@ -753,6 +796,8 @@ static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
 	int young = 0;
 	u64 new_spte = 0;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
 		/*
 		 * If we have a non-accessed entry we don't need to change the
@@ -784,6 +829,8 @@ static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
 		trace_kvm_age_page(iter.gfn, iter.level, slot, young);
 	}
 
+	rcu_read_unlock();
+
 	return young;
 }
 
@@ -829,6 +876,8 @@ static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
 	u64 new_spte;
 	int need_flush = 0;
 
+	rcu_read_lock();
+
 	WARN_ON(pte_huge(*ptep));
 
 	new_pfn = pte_pfn(*ptep);
@@ -857,6 +906,8 @@ static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (need_flush)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 
+	rcu_read_unlock();
+
 	return 0;
 }
 
@@ -880,6 +931,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
@@ -897,6 +950,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
 	}
+
+	rcu_read_unlock();
 	return spte_set;
 }
 
@@ -938,6 +993,8 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
 			continue;
@@ -957,6 +1014,8 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
 	}
+
+	rcu_read_unlock();
 	return spte_set;
 }
 
@@ -998,6 +1057,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 	struct tdp_iter iter;
 	u64 new_spte;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
 				    gfn + BITS_PER_LONG) {
 		if (!mask)
@@ -1023,6 +1084,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 	}
+
+	rcu_read_unlock();
 }
 
 /*
@@ -1062,6 +1125,8 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_pte(iter, root, start, end) {
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
 			continue;
@@ -1076,6 +1141,7 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		spte_set = true;
 	}
 
+	rcu_read_unlock();
 	return spte_set;
 }
 
@@ -1113,6 +1179,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	kvm_pfn_t pfn;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_pte(iter, root, start, end) {
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, spte_set)) {
 			spte_set = false;
@@ -1133,6 +1201,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		spte_set = true;
 	}
 
+	rcu_read_unlock();
 	if (spte_set)
 		kvm_flush_remote_tlbs(kvm);
 }
@@ -1169,6 +1238,8 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
 		if (!is_writable_pte(iter.old_spte))
 			break;
@@ -1180,6 +1251,8 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 		spte_set = true;
 	}
 
+	rcu_read_unlock();
+
 	return spte_set;
 }
 
@@ -1220,10 +1293,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 
 	*root_level = vcpu->arch.mmu->shadow_root_level;
 
+	rcu_read_lock();
+
 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
 
+	rcu_read_unlock();
+
 	return leaf;
 }
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 18/28] KVM: x86/mmu: Use an rwlock for the x86 MMU
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (16 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 17/28] KVM: x86/mmu: Protect TDP MMU page table memory with RCU Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 19/28] KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages Ben Gardon
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Add a read / write lock to be used in place of the MMU spinlock on x86.
The rwlock will enable the TDP MMU to handle page faults, and other
operations in parallel in future commits.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Removed MMU lock wrappers
- Completely replaced the MMU spinlock with an rwlock for x86

 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/mmu/mmu.c          | 90 ++++++++++++++++-----------------
 arch/x86/kvm/mmu/page_track.c   |  8 +--
 arch/x86/kvm/mmu/paging_tmpl.h  |  8 +--
 arch/x86/kvm/mmu/tdp_mmu.c      | 20 ++++----
 arch/x86/kvm/x86.c              |  4 +-
 include/linux/kvm_host.h        |  5 ++
 virt/kvm/dirty_ring.c           | 10 ++++
 virt/kvm/kvm_main.c             | 46 +++++++++++------
 9 files changed, 112 insertions(+), 81 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d6616f6f6ef..b6ebf2558386 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -337,6 +337,8 @@ struct kvm_mmu_root_info {
 
 #define KVM_MMU_NUM_PREV_ROOTS 3
 
+#define KVM_HAVE_MMU_RWLOCK
+
 struct kvm_mmu_page;
 
 /*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 60ff6837655a..b4d6709c240e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2016,9 +2016,9 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
 			mmu_pages_clear_parents(&parents);
 		}
-		if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {
+		if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
 			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
-			cond_resched_lock(&vcpu->kvm->mmu_lock);
+			cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
 			flush = false;
 		}
 	}
@@ -2470,7 +2470,7 @@ static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
  */
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 {
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
 		kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages -
@@ -2481,7 +2481,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 
 	kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
@@ -2492,7 +2492,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 
 	pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
 	r = 0;
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
 		pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
 			 sp->role.word);
@@ -2500,7 +2500,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	return r;
 }
@@ -3192,7 +3192,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			return;
 	}
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
@@ -3215,7 +3215,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	}
 
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);
 
@@ -3236,16 +3236,16 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 {
 	struct kvm_mmu_page *sp;
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 
 	if (make_mmu_pages_available(vcpu)) {
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 		return INVALID_PAGE;
 	}
 	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
 	++sp->root_count;
 
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 	return __pa(sp->spt);
 }
 
@@ -3416,17 +3416,17 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 		    !smp_load_acquire(&sp->unsync_children))
 			return;
 
-		spin_lock(&vcpu->kvm->mmu_lock);
+		write_lock(&vcpu->kvm->mmu_lock);
 		kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
 		mmu_sync_children(vcpu, sp);
 
 		kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 		return;
 	}
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
 	for (i = 0; i < 4; ++i) {
@@ -3440,7 +3440,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 	}
 
 	kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
 
@@ -3724,7 +3724,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		return r;
 
 	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 	r = make_mmu_pages_available(vcpu);
@@ -3739,7 +3739,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 prefault, is_tdp);
 
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -4999,7 +4999,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	 */
 	mmu_topup_memory_caches(vcpu, true);
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 
 	gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, &bytes);
 
@@ -5035,7 +5035,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	}
 	kvm_mmu_flush_or_zap(vcpu, &invalid_list, remote_flush, local_flush);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PTE_WRITE);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
@@ -5233,14 +5233,14 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		if (iterator.rmap)
 			flush |= fn(kvm, iterator.rmap);
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			if (flush && lock_flush_tlb) {
 				kvm_flush_remote_tlbs_with_address(kvm,
 						start_gfn,
 						iterator.gfn - start_gfn + 1);
 				flush = false;
 			}
-			cond_resched_lock(&kvm->mmu_lock);
+			cond_resched_rwlock_write(&kvm->mmu_lock);
 		}
 	}
 
@@ -5390,7 +5390,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 		 * be in active use by the guest.
 		 */
 		if (batch >= BATCH_ZAP_PAGES &&
-		    cond_resched_lock(&kvm->mmu_lock)) {
+		    cond_resched_rwlock_write(&kvm->mmu_lock)) {
 			batch = 0;
 			goto restart;
 		}
@@ -5423,7 +5423,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 {
 	lockdep_assert_held(&kvm->slots_lock);
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	trace_kvm_mmu_zap_all_fast(kvm);
 
 	/*
@@ -5450,7 +5450,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	if (kvm->arch.tdp_mmu_enabled)
 		kvm_tdp_mmu_zap_all(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
@@ -5492,7 +5492,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	int i;
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(memslot, slots) {
@@ -5516,7 +5516,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 			kvm_flush_remote_tlbs(kvm);
 	}
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
@@ -5531,12 +5531,12 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
 				start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	/*
 	 * We can flush all the TLBs out of the mmu lock without TLB
@@ -5596,13 +5596,13 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot)
 {
 	/* FIXME: const-ify all uses of struct kvm_memory_slot.  */
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
 			 kvm_mmu_zap_collapsible_spte, true);
 
 	if (kvm->arch.tdp_mmu_enabled)
 		kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
@@ -5625,11 +5625,11 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	/*
 	 * It's also safe to flush TLBs out of mmu lock here as currently this
@@ -5647,12 +5647,12 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
 					false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_2M);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
@@ -5664,11 +5664,11 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
@@ -5681,14 +5681,14 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	LIST_HEAD(invalid_list);
 	int ign;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 restart:
 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
 		if (WARN_ON(sp->role.invalid))
 			continue;
 		if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
 			goto restart;
-		if (cond_resched_lock(&kvm->mmu_lock))
+		if (cond_resched_rwlock_write(&kvm->mmu_lock))
 			goto restart;
 	}
 
@@ -5697,7 +5697,7 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	if (kvm->arch.tdp_mmu_enabled)
 		kvm_tdp_mmu_zap_all(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
@@ -5757,7 +5757,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			continue;
 
 		idx = srcu_read_lock(&kvm->srcu);
-		spin_lock(&kvm->mmu_lock);
+		write_lock(&kvm->mmu_lock);
 
 		if (kvm_has_zapped_obsolete_pages(kvm)) {
 			kvm_mmu_commit_zap_page(kvm,
@@ -5768,7 +5768,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
 
 unlock:
-		spin_unlock(&kvm->mmu_lock);
+		write_unlock(&kvm->mmu_lock);
 		srcu_read_unlock(&kvm->srcu, idx);
 
 		/*
@@ -5988,7 +5988,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	ulong to_zap;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
 	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
@@ -6013,14 +6013,14 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 			WARN_ON_ONCE(sp->lpage_disallowed);
 		}
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			kvm_mmu_commit_zap_page(kvm, &invalid_list);
-			cond_resched_lock(&kvm->mmu_lock);
+			cond_resched_rwlock_write(&kvm->mmu_lock);
 		}
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 8443a675715b..34bb0ec69bd8 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -184,9 +184,9 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 
 	head = &kvm->arch.track_notifier_head;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	hlist_add_head_rcu(&n->node, &head->track_notifier_list);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_register_notifier);
 
@@ -202,9 +202,9 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 
 	head = &kvm->arch.track_notifier_head;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	hlist_del_rcu(&n->node);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	synchronize_srcu(&head->track_srcu);
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50e268eb8e1a..d9f66cc459e8 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -868,7 +868,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	}
 
 	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 
@@ -881,7 +881,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -919,7 +919,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 		return;
 	}
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	for_each_shadow_entry_using_root(vcpu, root_hpa, gva, iterator) {
 		level = iterator.level;
 		sptep = iterator.sptep;
@@ -954,7 +954,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
 			break;
 	}
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
 /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9e4009068920..f1fbed72e149 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -59,7 +59,7 @@ static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 static inline bool tdp_mmu_next_root_valid(struct kvm *kvm,
 					   struct kvm_mmu_page *root)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	if (list_entry_is_head(root, &kvm->arch.tdp_mmu_roots, link))
 		return false;
@@ -117,7 +117,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
 	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	WARN_ON(root->root_count);
 	WARN_ON(!root->tdp_mmu_page);
@@ -170,13 +170,13 @@ static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
 
 	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	/* Check for an existing root before allocating a new one. */
 	for_each_tdp_mmu_root(kvm, root) {
 		if (root->role.word == role.word) {
 			kvm_mmu_get_root(kvm, root);
-			spin_unlock(&kvm->mmu_lock);
+			write_unlock(&kvm->mmu_lock);
 			return root;
 		}
 	}
@@ -186,7 +186,7 @@ static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
 
 	list_add(&root->link, &kvm->arch.tdp_mmu_roots);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	return root;
 }
@@ -421,7 +421,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
 	int as_id = kvm_mmu_page_as_id(root);
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
 
@@ -492,13 +492,13 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	if (iter->next_last_level_gfn == iter->yielded_gfn)
 		return false;
 
-	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 		rcu_read_unlock();
 
 		if (flush)
 			kvm_flush_remote_tlbs(kvm);
 
-		cond_resched_lock(&kvm->mmu_lock);
+		cond_resched_rwlock_write(&kvm->mmu_lock);
 		rcu_read_lock();
 
 		WARN_ON(iter->gfn > iter->next_last_level_gfn);
@@ -1103,7 +1103,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 	struct kvm_mmu_page *root;
 	int root_as_id;
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 	for_each_tdp_mmu_root(kvm, root) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
@@ -1268,7 +1268,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	int root_as_id;
 	bool spte_set = false;
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 	for_each_tdp_mmu_root(kvm, root) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 76bce832cade..b544f59b6952 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7092,9 +7092,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (vcpu->arch.mmu->direct_map) {
 		unsigned int indirect_shadow_pages;
 
-		spin_lock(&vcpu->kvm->mmu_lock);
+		write_lock(&vcpu->kvm->mmu_lock);
 		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		write_unlock(&vcpu->kvm->mmu_lock);
 
 		if (indirect_shadow_pages)
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f3b1013fb22c..f417447129b9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -451,7 +451,12 @@ struct kvm_memslots {
 };
 
 struct kvm {
+#ifdef KVM_HAVE_MMU_RWLOCK
+	rwlock_t mmu_lock;
+#else
 	spinlock_t mmu_lock;
+#endif /* KVM_HAVE_MMU_RWLOCK */
+
 	struct mutex slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
 	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 9d01299563ee..dc7052a6e033 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -60,9 +60,19 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
 		return;
 
+#ifdef KVM_HAVE_MMU_RWLOCK
+	write_lock(&kvm->mmu_lock);
+#else
 	spin_lock(&kvm->mmu_lock);
+#endif /* KVM_HAVE_MMU_RWLOCK */
+
 	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
+
+#ifdef KVM_HAVE_MMU_RWLOCK
+	write_unlock(&kvm->mmu_lock);
+#else
 	spin_unlock(&kvm->mmu_lock);
+#endif /* KVM_HAVE_MMU_RWLOCK */
 }
 
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8367d88ce39b..44b55f9387c4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -450,6 +450,14 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+#ifdef KVM_HAVE_MMU_RWLOCK
+#define KVM_MMU_LOCK(kvm) write_lock(&kvm->mmu_lock)
+#define KVM_MMU_UNLOCK(kvm) write_unlock(&kvm->mmu_lock)
+#else
+#define KVM_MMU_LOCK(kvm) spin_lock(&kvm->mmu_lock)
+#define KVM_MMU_UNLOCK(kvm) spin_unlock(&kvm->mmu_lock)
+#endif /* KVM_HAVE_MMU_RWLOCK */
+
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
@@ -459,13 +467,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	int idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+
+	KVM_MMU_LOCK(kvm);
+
 	kvm->mmu_notifier_seq++;
 
 	if (kvm_set_spte_hva(kvm, address, pte))
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -476,7 +486,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	int need_tlb_flush = 0, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	KVM_MMU_LOCK(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
@@ -489,7 +499,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	if (need_tlb_flush || kvm->tlbs_dirty)
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return 0;
@@ -500,7 +510,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-	spin_lock(&kvm->mmu_lock);
+	KVM_MMU_LOCK(kvm);
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
@@ -514,7 +524,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	 * in conjunction with the smp_rmb in mmu_notifier_retry().
 	 */
 	kvm->mmu_notifier_count--;
-	spin_unlock(&kvm->mmu_lock);
+	KVM_MMU_UNLOCK(kvm);
 
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
@@ -528,13 +538,13 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	KVM_MMU_LOCK(kvm);
 
 	young = kvm_age_hva(kvm, start, end);
 	if (young)
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -549,7 +559,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	KVM_MMU_LOCK(kvm);
 	/*
 	 * Even though we do not flush TLB, this will still adversely
 	 * affect performance on pre-Haswell Intel EPT, where there is
@@ -564,7 +574,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * more sophisticated heuristic later.
 	 */
 	young = kvm_age_hva(kvm, start, end);
-	spin_unlock(&kvm->mmu_lock);
+	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -578,9 +588,9 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	KVM_MMU_LOCK(kvm);
 	young = kvm_test_age_hva(kvm, address);
-	spin_unlock(&kvm->mmu_lock);
+	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -745,7 +755,11 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef KVM_HAVE_MMU_RWLOCK
+	rwlock_init(&kvm->mmu_lock);
+#else
 	spin_lock_init(&kvm->mmu_lock);
+#endif /* KVM_HAVE_MMU_RWLOCK */
 	mmgrab(current->mm);
 	kvm->mm = current->mm;
 	kvm_eventfd_init(kvm);
@@ -1525,7 +1539,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
 		memset(dirty_bitmap_buffer, 0, n);
 
-		spin_lock(&kvm->mmu_lock);
+		KVM_MMU_LOCK(kvm);
 		for (i = 0; i < n / sizeof(long); i++) {
 			unsigned long mask;
 			gfn_t offset;
@@ -1541,7 +1555,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
 								offset, mask);
 		}
-		spin_unlock(&kvm->mmu_lock);
+		KVM_MMU_UNLOCK(kvm);
 	}
 
 	if (flush)
@@ -1636,7 +1650,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
 		return -EFAULT;
 
-	spin_lock(&kvm->mmu_lock);
+	KVM_MMU_LOCK(kvm);
 	for (offset = log->first_page, i = offset / BITS_PER_LONG,
 		 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
 	     i++, offset += BITS_PER_LONG) {
@@ -1659,7 +1673,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 								offset, mask);
 		}
 	}
-	spin_unlock(&kvm->mmu_lock);
+	KVM_MMU_UNLOCK(kvm);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 19/28] KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (17 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 18/28] KVM: x86/mmu: Use an rwlock for the x86 MMU Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Move the work of adding and removing TDP MMU pages to/from  "secondary"
data structures to helper functions. These functions will be built on in
future commits to enable MMU operations to proceed (mostly) in parallel.

No functional change expected.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 47 +++++++++++++++++++++++++++++++-------
 1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f1fbed72e149..5a9e964e0178 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -262,6 +262,39 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 
+/**
+ * tdp_mmu_link_page - Add a new page to the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the new page
+ * @account_nx: This page replaces a NX large page and should be marked for
+ *		eventual reclaim.
+ */
+static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			      bool account_nx)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
+	if (account_nx)
+		account_huge_nx_page(kvm, sp);
+}
+
+/**
+ * tdp_mmu_unlink_page - Remove page from the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the page to be removed
+ */
+static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	list_del(&sp->link);
+	if (sp->lpage_disallowed)
+		unaccount_huge_nx_page(kvm, sp);
+}
+
 /**
  * handle_removed_tdp_mmu_page - handle a pt removed from the TDP structure
  *
@@ -281,10 +314,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
 
 	trace_kvm_mmu_prepare_zap_page(sp);
 
-	list_del(&sp->link);
-
-	if (sp->lpage_disallowed)
-		unaccount_huge_nx_page(kvm, sp);
+	tdp_mmu_unlink_page(kvm, sp);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		old_child_spte = READ_ONCE(*(pt + i));
@@ -705,15 +735,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
-			list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
 			child_pt = sp->spt;
+
+			tdp_mmu_link_page(vcpu->kvm, sp,
+					  huge_page_disallowed &&
+					  req_level >= iter.level);
+
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
 			trace_kvm_mmu_get_page(sp, true);
-			if (huge_page_disallowed && req_level >= iter.level)
-				account_huge_nx_page(vcpu->kvm, sp);
-
 			tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
 		}
 	}
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (18 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 19/28] KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03  2:48   ` kernel test robot
                     ` (2 more replies)
  2021-02-02 18:57 ` [PATCH v2 21/28] KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
                   ` (8 subsequent siblings)
  28 siblings, 3 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

To prepare for handling page faults in parallel, change the TDP MMU
page fault handler to use atomic operations to set SPTEs so that changes
are not lost if multiple threads attempt to modify the same SPTE.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Rename "atomic" arg to "shared" in multiple functions
- Merged the commit that protects the lists of TDP MMU pages with a new
  lock
- Merged the commits to add an atomic option for setting SPTEs and to
  use that option in the TDP MMU page fault handler

 arch/x86/include/asm/kvm_host.h |  13 +++
 arch/x86/kvm/mmu/tdp_mmu.c      | 142 ++++++++++++++++++++++++--------
 2 files changed, 122 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b6ebf2558386..78ebf56f2b37 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1028,6 +1028,19 @@ struct kvm_arch {
 	 * tdp_mmu_page set and a root_count of 0.
 	 */
 	struct list_head tdp_mmu_pages;
+
+	/*
+	 * Protects accesses to the following fields when the MMU lock
+	 * is held in read mode:
+	 *  - tdp_mmu_pages (above)
+	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
+	 *  - lpage_disallowed_mmu_pages
+	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
+	 *    by the TDP MMU
+	 * It is acceptable, but not necessary, to acquire this lock when
+	 * the thread holds the MMU lock in write mode.
+	 */
+	spinlock_t tdp_mmu_pages_lock;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5a9e964e0178..0b5a9339ac55 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,6 +7,7 @@
 #include "tdp_mmu.h"
 #include "spte.h"
 
+#include <asm/cmpxchg.h>
 #include <trace/events/kvm.h>
 
 #ifdef CONFIG_X86_64
@@ -33,6 +34,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	kvm->arch.tdp_mmu_enabled = true;
 
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 }
 
@@ -225,7 +227,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level);
+				u64 old_spte, u64 new_spte, int level,
+				bool shared);
 
 static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 {
@@ -267,17 +270,26 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
  *
  * @kvm: kvm instance
  * @sp: the new page
+ * @shared: This operation may not be running under the exclusive use of
+ *	    the MMU lock and the operation must synchronize with other
+ *	    threads that might be adding or removing pages.
  * @account_nx: This page replaces a NX large page and should be marked for
  *		eventual reclaim.
  */
 static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			      bool account_nx)
+			      bool shared, bool account_nx)
 {
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	if (shared)
+		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+	else
+		lockdep_assert_held_write(&kvm->mmu_lock);
 
 	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
 	if (account_nx)
 		account_huge_nx_page(kvm, sp);
+
+	if (shared)
+		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
 /**
@@ -285,14 +297,24 @@ static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
  *
  * @kvm: kvm instance
  * @sp: the page to be removed
+ * @shared: This operation may not be running under the exclusive use of
+ *	    the MMU lock and the operation must synchronize with other
+ *	    threads that might be adding or removing pages.
  */
-static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				bool shared)
 {
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	if (shared)
+		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+	else
+		lockdep_assert_held_write(&kvm->mmu_lock);
 
 	list_del(&sp->link);
 	if (sp->lpage_disallowed)
 		unaccount_huge_nx_page(kvm, sp);
+
+	if (shared)
+		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
 /**
@@ -300,28 +322,39 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp)
  *
  * @kvm: kvm instance
  * @pt: the page removed from the paging structure
+ * @shared: This operation may not be running under the exclusive use
+ *	    of the MMU lock and the operation must synchronize with other
+ *	    threads that might be modifying SPTEs.
  *
  * Given a page table that has been removed from the TDP paging structure,
  * iterates through the page table to clear SPTEs and free child page tables.
  */
-static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
+static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt,
+					bool shared)
 {
 	struct kvm_mmu_page *sp = sptep_to_sp(pt);
 	int level = sp->role.level;
 	gfn_t gfn = sp->gfn;
 	u64 old_child_spte;
+	u64 *sptep;
 	int i;
 
 	trace_kvm_mmu_prepare_zap_page(sp);
 
-	tdp_mmu_unlink_page(kvm, sp);
+	tdp_mmu_unlink_page(kvm, sp, shared);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-		old_child_spte = READ_ONCE(*(pt + i));
-		WRITE_ONCE(*(pt + i), 0);
+		sptep = pt + i;
+
+		if (shared) {
+			old_child_spte = xchg(sptep, 0);
+		} else {
+			old_child_spte = READ_ONCE(*sptep);
+			WRITE_ONCE(*sptep, 0);
+		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
 			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-			old_child_spte, 0, level - 1);
+			old_child_spte, 0, level - 1, shared);
 	}
 
 	kvm_flush_remote_tlbs_with_address(kvm, gfn,
@@ -338,12 +371,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
  * @level: the level of the PT the SPTE is part of in the paging structure
+ * @shared: This operation may not be running under the exclusive use of
+ *	    the MMU lock and the operation must synchronize with other
+ *	    threads that might be modifying SPTEs.
  *
  * Handle bookkeeping that might result from the modification of a SPTE.
  * This function must be called for all TDP SPTE modifications.
  */
 static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level)
+				  u64 old_spte, u64 new_spte, int level,
+				  bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
@@ -415,18 +452,51 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 */
 	if (was_present && !was_leaf && (pfn_changed || !is_present))
 		handle_removed_tdp_mmu_page(kvm,
-				spte_to_child_pt(old_spte, level));
+				spte_to_child_pt(old_spte, level), shared);
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level)
+				u64 old_spte, u64 new_spte, int level,
+				bool shared)
 {
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
+	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
+			      shared);
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
 	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
 				      new_spte, level);
 }
 
+/*
+ * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
+ * associated bookkeeping
+ *
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @new_spte: The value the SPTE should be set to
+ * Returns: true if the SPTE was set, false if it was not. If false is returned,
+ *	    this function will have no side-effects.
+ */
+static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
+					   struct tdp_iter *iter,
+					   u64 new_spte)
+{
+	u64 *root_pt = tdp_iter_root_pt(iter);
+	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
+	int as_id = kvm_mmu_page_as_id(root);
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
+		      new_spte) != iter->old_spte)
+		return false;
+
+	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
+			    iter->level, true);
+
+	return true;
+}
+
+
 /*
  * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
  * @kvm: kvm instance
@@ -456,7 +526,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-			      iter->level);
+			      iter->level, false);
 	if (record_acc_track)
 		handle_changed_spte_acc_track(iter->old_spte, new_spte,
 					      iter->level);
@@ -630,23 +700,18 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 	int ret = 0;
 	int make_spte_ret = 0;
 
-	if (unlikely(is_noslot_pfn(pfn))) {
+	if (unlikely(is_noslot_pfn(pfn)))
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
-		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
-				     new_spte);
-	} else {
+	else
 		make_spte_ret = make_spte(vcpu, ACC_ALL, iter->level, iter->gfn,
 					 pfn, iter->old_spte, prefault, true,
 					 map_writable, !shadow_accessed_mask,
 					 &new_spte);
-		trace_kvm_mmu_set_spte(iter->level, iter->gfn,
-				       rcu_dereference(iter->sptep));
-	}
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
-	else
-		tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
+	else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+		return RET_PF_RETRY;
 
 	/*
 	 * If the page fault was caused by a write but the page is write
@@ -660,8 +725,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 	}
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
-	if (unlikely(is_mmio_spte(new_spte)))
+	if (unlikely(is_mmio_spte(new_spte))) {
+		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
+				     new_spte);
 		ret = RET_PF_EMULATE;
+	} else
+		trace_kvm_mmu_set_spte(iter->level, iter->gfn,
+				       rcu_dereference(iter->sptep));
 
 	trace_kvm_mmu_set_spte(iter->level, iter->gfn,
 			       rcu_dereference(iter->sptep));
@@ -720,7 +790,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			tdp_mmu_set_spte(vcpu->kvm, &iter, 0);
+			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
+				break;
 
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
 					KVM_PAGES_PER_HPAGE(iter.level));
@@ -737,19 +808,24 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
 			child_pt = sp->spt;
 
-			tdp_mmu_link_page(vcpu->kvm, sp,
-					  huge_page_disallowed &&
-					  req_level >= iter.level);
-
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
-			trace_kvm_mmu_get_page(sp, true);
-			tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
+			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
+						    new_spte)) {
+				tdp_mmu_link_page(vcpu->kvm, sp, true,
+						  huge_page_disallowed &&
+						  req_level >= iter.level);
+
+				trace_kvm_mmu_get_page(sp, true);
+			} else {
+				tdp_mmu_free_sp(sp);
+				break;
+			}
 		}
 	}
 
-	if (WARN_ON(iter.level != level)) {
+	if (iter.level != level) {
 		rcu_read_unlock();
 		return RET_PF_RETRY;
 	}
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 21/28] KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (19 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-06  0:29   ` Sean Christopherson
  2021-02-02 18:57 ` [PATCH v2 22/28] KVM: x86/mmu: Mark SPTEs in disconnected pages as removed Ben Gardon
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

When the TDP MMU is allowed to handle page faults in parallel there is
the possiblity of a race where an SPTE is cleared and then imediately
replaced with a present SPTE pointing to a different PFN, before the
TLBs can be flushed. This race would violate architectural specs. Ensure
that the TLBs are flushed properly before other threads are allowed to
install any present value for the SPTE.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

v1 -> v2
- Renamed "FROZEN_SPTE" to "REMOVED_SPTE" and updated derivative
  comments and code

 arch/x86/kvm/mmu/spte.h    | 21 ++++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c | 63 ++++++++++++++++++++++++++++++++------
 2 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 2b3a30bd38b0..3f974006cfb6 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -130,6 +130,25 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 					  PT64_EPT_EXECUTABLE_MASK)
 #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
 
+/*
+ * If a thread running without exclusive control of the MMU lock must perform a
+ * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
+ * non-present intermediate value. Other threads which encounter this value
+ * should not modify the SPTE.
+ *
+ * This constant works because it is considered non-present on both AMD and
+ * Intel CPUs and does not create a L1TF vulnerability because the pfn section
+ * is zeroed out.
+ *
+ * Only used by the TDP MMU.
+ */
+#define REMOVED_SPTE (1ull << 59)
+
+static inline bool is_removed_spte(u64 spte)
+{
+	return spte == REMOVED_SPTE;
+}
+
 /*
  * In some cases, we need to preserve the GFN of a non-present or reserved
  * SPTE when we usurp the upper five bits of the physical address space to
@@ -187,7 +206,7 @@ static inline bool is_access_track_spte(u64 spte)
 
 static inline int is_shadow_present_pte(u64 pte)
 {
-	return (pte != 0) && !is_mmio_spte(pte);
+	return (pte != 0) && !is_mmio_spte(pte) && !is_removed_spte(pte);
 }
 
 static inline int is_large_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0b5a9339ac55..7a2cdfeac4d2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -427,15 +427,19 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 */
 	if (!was_present && !is_present) {
 		/*
-		 * If this change does not involve a MMIO SPTE, it is
-		 * unexpected. Log the change, though it should not impact the
-		 * guest since both the former and current SPTEs are nonpresent.
+		 * If this change does not involve a MMIO SPTE or removed SPTE,
+		 * it is unexpected. Log the change, though it should not
+		 * impact the guest since both the former and current SPTEs
+		 * are nonpresent.
 		 */
-		if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
+		if (WARN_ON(!is_mmio_spte(old_spte) &&
+			    !is_mmio_spte(new_spte) &&
+			    !is_removed_spte(new_spte)))
 			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
 			       "should not be replaced with another,\n"
 			       "different nonpresent SPTE, unless one or both\n"
-			       "are MMIO SPTEs.\n"
+			       "are MMIO SPTEs, or the new SPTE is\n"
+			       "a temporary removed SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
 		return;
@@ -486,6 +490,13 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
+	/*
+	 * Do not change removed SPTEs. Only the thread that froze the SPTE
+	 * may modify it.
+	 */
+	if (iter->old_spte == REMOVED_SPTE)
+		return false;
+
 	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
 		      new_spte) != iter->old_spte)
 		return false;
@@ -496,6 +507,34 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	return true;
 }
 
+static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+					   struct tdp_iter *iter)
+{
+	/*
+	 * Freeze the SPTE by setting it to a special,
+	 * non-present value. This will stop other threads from
+	 * immediately installing a present entry in its place
+	 * before the TLBs are flushed.
+	 */
+	if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
+		return false;
+
+	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
+					   KVM_PAGES_PER_HPAGE(iter->level));
+
+	/*
+	 * No other thread can overwrite the removed SPTE as they
+	 * must either wait on the MMU lock or use
+	 * tdp_mmu_set_spte_atomic which will not overrite the
+	 * special removed SPTE value. No bookkeeping is needed
+	 * here since the SPTE is going from non-present
+	 * to non-present.
+	 */
+	WRITE_ONCE(*iter->sptep, 0);
+
+	return true;
+}
+
 
 /*
  * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
@@ -523,6 +562,15 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
+	/*
+	 * No thread should be using this function to set SPTEs to the
+	 * temporary removed SPTE value.
+	 * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
+	 * should be used. If operating under the MMU lock in write mode, the
+	 * use of the removed SPTE should not be necessary.
+	 */
+	WARN_ON(iter->old_spte == REMOVED_SPTE);
+
 	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
@@ -790,12 +838,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
+			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
 				break;
 
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
-					KVM_PAGES_PER_HPAGE(iter.level));
-
 			/*
 			 * The iter must explicitly re-read the spte here
 			 * because the new value informs the !present
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 22/28] KVM: x86/mmu: Mark SPTEs in disconnected pages as removed
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (20 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 21/28] KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03 11:17   ` Paolo Bonzini
  2021-02-02 18:57 ` [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

When clearing TDP MMU pages what have been disconnected from the paging
structure root, set the SPTEs to a special non-present value which will
not be overwritten by other threads. This is needed to prevent races in
which a thread is clearing a disconnected page table, but another thread
has already acquired a pointer to that memory and installs a mapping in
an already cleared entry. This can lead to memory leaks and accounting
errors.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7a2cdfeac4d2..0dd27e000dd0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -334,9 +334,10 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt,
 {
 	struct kvm_mmu_page *sp = sptep_to_sp(pt);
 	int level = sp->role.level;
-	gfn_t gfn = sp->gfn;
+	gfn_t base_gfn = sp->gfn;
 	u64 old_child_spte;
 	u64 *sptep;
+	gfn_t gfn;
 	int i;
 
 	trace_kvm_mmu_prepare_zap_page(sp);
@@ -345,16 +346,39 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt,
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		sptep = pt + i;
+		gfn = base_gfn + (i * KVM_PAGES_PER_HPAGE(level - 1));
 
 		if (shared) {
-			old_child_spte = xchg(sptep, 0);
+			/*
+			 * Set the SPTE to a nonpresent value that other
+			 * threads will not overwrite. If the SPTE was
+			 * already marked as removed then another thread
+			 * handling a page fault could overwrite it, so
+			 * set the SPTE until it is set from some other
+			 * value to the removed SPTE value.
+			 */
+			for (;;) {
+				old_child_spte = xchg(sptep, REMOVED_SPTE);
+				if (!is_removed_spte(old_child_spte))
+					break;
+				cpu_relax();
+			}
 		} else {
 			old_child_spte = READ_ONCE(*sptep);
-			WRITE_ONCE(*sptep, 0);
+
+			/*
+			 * Marking the SPTE as a removed SPTE is not
+			 * strictly necessary here as the MMU lock should
+			 * stop other threads from concurrentrly modifying
+			 * this SPTE. Using the removed SPTE value keeps
+			 * the shared and non-atomic cases consistent and
+			 * simplifies the function.
+			 */
+			WRITE_ONCE(*sptep, REMOVED_SPTE);
 		}
-		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
-			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-			old_child_spte, 0, level - 1, shared);
+		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
+				    old_child_spte, REMOVED_SPTE, level - 1,
+				    shared);
 	}
 
 	kvm_flush_remote_tlbs_with_address(kvm, gfn,
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (21 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 22/28] KVM: x86/mmu: Mark SPTEs in disconnected pages as removed Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03 12:39   ` Paolo Bonzini
  2021-02-02 18:57 ` [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock Ben Gardon
                   ` (5 subsequent siblings)
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Make the last few changes necessary to enable the TDP MMU to handle page
faults in parallel while holding the mmu_lock in read mode.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b4d6709c240e..3d181a2b2485 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3724,7 +3724,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		return r;
 
 	r = RET_PF_RETRY;
-	write_lock(&vcpu->kvm->mmu_lock);
+
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		read_lock(&vcpu->kvm->mmu_lock);
+	else
+		write_lock(&vcpu->kvm->mmu_lock);
+
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 	r = make_mmu_pages_available(vcpu);
@@ -3739,7 +3744,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 prefault, is_tdp);
 
 out_unlock:
-	write_unlock(&vcpu->kvm->mmu_lock);
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		read_unlock(&vcpu->kvm->mmu_lock);
+	else
+		write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (22 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03 11:25   ` Paolo Bonzini
  2021-02-03 11:26   ` Paolo Bonzini
  2021-02-02 18:57 ` [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU " Ben Gardon
                   ` (4 subsequent siblings)
  28 siblings, 2 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

To reduce lock contention and interference with page fault handlers,
allow the TDP MMU function to zap a GFN range to operate under the MMU
read lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |  13 ++-
 arch/x86/kvm/mmu/mmu_internal.h |   6 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 165 +++++++++++++++++++++++++-------
 arch/x86/kvm/mmu/tdp_mmu.h      |   3 +-
 4 files changed, 145 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d181a2b2485..254ff87d2a61 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5518,13 +5518,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		}
 	}
 
+	kvm_mmu_unlock(kvm);
+
 	if (kvm->arch.tdp_mmu_enabled) {
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
+		read_lock(&kvm->mmu_lock);
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end,
+						  true);
 		if (flush)
 			kvm_flush_remote_tlbs(kvm);
-	}
 
-	write_unlock(&kvm->mmu_lock);
+		read_unlock(&kvm->mmu_lock);
+	}
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
@@ -6015,7 +6019,8 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 		WARN_ON_ONCE(!sp->lpage_disallowed);
 		if (sp->tdp_mmu_page) {
 			kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
-				sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
+				sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level),
+				false);
 		} else {
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 			WARN_ON_ONCE(sp->lpage_disallowed);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7f599cc64178..7df209fb8051 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -40,7 +40,11 @@ struct kvm_mmu_page {
 	u64 *spt;
 	/* hold the gfn of each spte inside spt */
 	gfn_t *gfns;
-	int root_count;          /* Currently serving as active root */
+	/* Currently serving as active root */
+	union {
+		int root_count;
+		refcount_t tdp_mmu_root_count;
+	};
 	unsigned int unsync_children;
 	struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 	DECLARE_BITMAP(unsync_child_bitmap, 512);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0dd27e000dd0..de26762433ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -52,46 +52,104 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	rcu_barrier();
 }
 
-static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
+static __always_inline __must_check bool tdp_mmu_get_root(struct kvm *kvm,
+						struct kvm_mmu_page *root)
 {
-	if (kvm_mmu_put_root(kvm, root))
-		kvm_tdp_mmu_free_root(kvm, root);
+	return refcount_inc_not_zero(&root->tdp_mmu_root_count);
 }
 
-static inline bool tdp_mmu_next_root_valid(struct kvm *kvm,
-					   struct kvm_mmu_page *root)
+static __always_inline void tdp_mmu_put_root(struct kvm *kvm,
+					     struct kvm_mmu_page *root,
+					     bool shared)
 {
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	int root_count;
+	int r;
 
-	if (list_entry_is_head(root, &kvm->arch.tdp_mmu_roots, link))
-		return false;
+	if (shared) {
+		lockdep_assert_held_read(&kvm->mmu_lock);
 
-	kvm_mmu_get_root(kvm, root);
-	return true;
+		root_count = atomic_read(&root->tdp_mmu_root_count.refs);
+
+		/*
+		 * If this is not the last reference on the root, it can be
+		 * dropped under the MMU read lock.
+		 */
+		if (root_count > 1) {
+			r = atomic_cmpxchg(&root->tdp_mmu_root_count.refs,
+					   root_count, root_count - 1);
+			if (r == root_count)
+				return;
+		}
+
+		/*
+		 * If the cmpxchg failed because of a race or this is the
+		 * last reference on the root, drop the read lock, and
+		 * reacquire the MMU lock in write mode.
+		 */
+		read_unlock(&kvm->mmu_lock);
+		write_lock(&kvm->mmu_lock);
+	} else {
+		lockdep_assert_held_write(&kvm->mmu_lock);
+	}
+
+	/*
+	 * No other thread can modify the root count since this thread holds
+	 * the MMU lock in write mode.
+	 */
+	BUG_ON(!atomic_read(&root->tdp_mmu_root_count.refs));
 
+
+	if (refcount_dec_and_test(&root->tdp_mmu_root_count))
+		kvm_tdp_mmu_free_root(kvm, root);
+
+	if (shared) {
+		write_unlock(&kvm->mmu_lock);
+		read_lock(&kvm->mmu_lock);
+
+	}
 }
 
 static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
-						     struct kvm_mmu_page *root)
+						     struct kvm_mmu_page *root,
+						     bool shared)
 {
 	struct kvm_mmu_page *next_root;
 
 	next_root = list_next_entry(root, link);
-	tdp_mmu_put_root(kvm, root);
+	tdp_mmu_put_root(kvm, root, shared);
 	return next_root;
 }
 
+static inline bool tdp_mmu_next_root_valid(struct kvm *kvm,
+					   struct kvm_mmu_page *root)
+{
+	for (;;) {
+		if (list_entry_is_head(root, &kvm->arch.tdp_mmu_roots, link))
+			return false;
+
+		if (tdp_mmu_get_root(kvm, root))
+			return true;
+
+		root = list_next_entry(root, link);
+	}
+
+}
+
 /*
  * Note: this iterator gets and puts references to the roots it iterates over.
  * This makes it safe to release the MMU lock and yield within the loop, but
  * if exiting the loop early, the caller must drop the reference to the most
  * recent root. (Unless keeping a live reference is desirable.)
+ *
+ * If shared is set, this function is operating under the MMU lock in read
+ * mode. In the unlikely event that this thread must free a root, the lock
+ * will be temporarily dropped and reacquired in write mode.
  */
-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root)				\
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _shared)				\
 	for (_root = list_first_entry(&_kvm->arch.tdp_mmu_roots,	\
 				      typeof(*_root), link);		\
 	     tdp_mmu_next_root_valid(_kvm, _root);			\
-	     _root = tdp_mmu_next_root(_kvm, _root))
+	     _root = tdp_mmu_next_root(_kvm, _root, _shared))
 
 #define for_each_tdp_mmu_root(_kvm, _root)				\
 	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
@@ -113,7 +171,7 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
 }
 
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield);
+			  gfn_t start, gfn_t end, bool can_yield, bool shared);
 
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
@@ -126,7 +184,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 
 	list_del(&root->link);
 
-	zap_gfn_range(kvm, root, 0, max_gfn, false);
+	zap_gfn_range(kvm, root, 0, max_gfn, false, false);
 
 	free_page((unsigned long)root->spt);
 	kmem_cache_free(mmu_page_header_cache, root);
@@ -658,7 +716,8 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
  * Return false if a yield was not needed.
  */
 static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
-					     struct tdp_iter *iter, bool flush)
+					     struct tdp_iter *iter, bool flush,
+					     bool shared)
 {
 	/* Ensure forward progress has been made before yielding. */
 	if (iter->next_last_level_gfn == iter->yielded_gfn)
@@ -670,7 +729,11 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
 		if (flush)
 			kvm_flush_remote_tlbs(kvm);
 
-		cond_resched_rwlock_write(&kvm->mmu_lock);
+		if (shared)
+			cond_resched_rwlock_read(&kvm->mmu_lock);
+		else
+			cond_resched_rwlock_write(&kvm->mmu_lock);
+
 		rcu_read_lock();
 
 		WARN_ON(iter->gfn > iter->next_last_level_gfn);
@@ -690,23 +753,38 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
  * non-root pages mapping GFNs strictly within that range. Returns true if
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
+ *
  * If can_yield is true, will release the MMU lock and reschedule if the
  * scheduler needs the CPU or there is contention on the MMU lock. If this
  * function cannot yield, it will not release the MMU lock or reschedule and
  * the caller must ensure it does not supply too large a GFN range, or the
  * operation can cause a soft lockup.
+ *
+ * If shared is true, this thread holds the MMU lock in read mode and must
+ * account for the possibility that other threads are modifying the paging
+ * structures concurrently. If shared is false, this thread should hold the
+ * MMU in write mode.
  */
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end, bool can_yield)
+			  gfn_t start, gfn_t end, bool can_yield, bool shared)
 {
 	struct tdp_iter iter;
 	bool flush_needed = false;
 
+#ifdef CONFIG_LOCKDEP
+	if (shared)
+		lockdep_assert_held_read(&kvm->mmu_lock);
+	else
+		lockdep_assert_held_write(&kvm->mmu_lock);
+#endif /* CONFIG_LOCKDEP */
+
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
+retry:
 		if (can_yield &&
-		    tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed)) {
+		    tdp_mmu_iter_cond_resched(kvm, &iter, flush_needed,
+					      shared)) {
 			flush_needed = false;
 			continue;
 		}
@@ -724,8 +802,17 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
-		tdp_mmu_set_spte(kvm, &iter, 0);
-		flush_needed = true;
+		if (!shared) {
+			tdp_mmu_set_spte(kvm, &iter, 0);
+			flush_needed = true;
+		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
+			/*
+			 * The iter must explicitly re-read the SPTE because
+			 * the atomic cmpxchg failed.
+			 */
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+			goto retry;
+		}
 	}
 
 	rcu_read_unlock();
@@ -737,14 +824,20 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * non-root pages mapping GFNs strictly within that range. Returns true if
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
+ *
+ * If shared is true, this thread holds the MMU lock in read mode and must
+ * account for the possibility that other threads are modifying the paging
+ * structures concurrently. If shared is false, this thread should hold the
+ * MMU in write mode.
  */
-bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
+			       bool shared)
 {
 	struct kvm_mmu_page *root;
 	bool flush = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root)
-		flush |= zap_gfn_range(kvm, root, start, end, true);
+	for_each_tdp_mmu_root_yield_safe(kvm, root, shared)
+		flush |= zap_gfn_range(kvm, root, start, end, true, shared);
 
 	return flush;
 }
@@ -754,7 +847,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 	bool flush;
 
-	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
+	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn, false);
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
 }
@@ -918,7 +1011,7 @@ static int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm, unsigned long start,
 	int ret = 0;
 	int as_id;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
+	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
 		as_id = kvm_mmu_page_as_id(root);
 		slots = __kvm_memslots(kvm, as_id);
 		kvm_for_each_memslot(memslot, slots) {
@@ -950,7 +1043,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
 				     struct kvm_mmu_page *root, gfn_t start,
 				     gfn_t end, unsigned long unused)
 {
-	return zap_gfn_range(kvm, root, start, end, false);
+	return zap_gfn_range(kvm, root, start, end, false, false);
 }
 
 int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
@@ -1113,7 +1206,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
 				   min_level, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
 			continue;
 
 		if (!is_shadow_present_pte(iter.old_spte) ||
@@ -1143,7 +1236,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
+	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
@@ -1172,7 +1265,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_lock();
 
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
 			continue;
 
 		if (spte_ad_need_write_protect(iter.old_spte)) {
@@ -1208,7 +1301,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
+	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
@@ -1304,7 +1397,7 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
 			continue;
 
 		if (!is_shadow_present_pte(iter.old_spte) ||
@@ -1332,7 +1425,7 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
 	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
+	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
@@ -1358,7 +1451,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, spte_set)) {
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false)) {
 			spte_set = false;
 			continue;
 		}
@@ -1392,7 +1485,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 	struct kvm_mmu_page *root;
 	int root_as_id;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
+	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index cbbdbadd1526..10ada884270b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -12,7 +12,8 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
-bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
+			       bool shared);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (23 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03 11:34   ` Paolo Bonzini
  2021-02-02 18:57 ` [PATCH v2 26/28] KVM: x86/mmu: Allow enabling / disabling dirty logging under " Ben Gardon
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

To speed the process of disabling dirty logging, change the TDP MMU
function which zaps collapsible SPTEs to run under the MMU read lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  5 ++---
 arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++++-------
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254ff87d2a61..e3cf868be6bd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5517,8 +5517,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 						start, end - 1, true);
 		}
 	}
-
-	kvm_mmu_unlock(kvm);
+	write_unlock(&kvm->mmu_lock);
 
 	if (kvm->arch.tdp_mmu_enabled) {
 		read_lock(&kvm->mmu_lock);
@@ -5611,10 +5610,10 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 	write_lock(&kvm->mmu_lock);
 	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
 			 kvm_mmu_zap_collapsible_spte, true);
+	write_unlock(&kvm->mmu_lock);
 
 	if (kvm->arch.tdp_mmu_enabled)
 		kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
-	write_unlock(&kvm->mmu_lock);
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index de26762433ea..cfe66b8d39fa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1451,10 +1451,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false)) {
-			spte_set = false;
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
-		}
 
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
@@ -1465,9 +1464,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		    !PageTransCompoundMap(pfn_to_page(pfn)))
 			continue;
 
-		tdp_mmu_set_spte(kvm, &iter, 0);
-
-		spte_set = true;
+		if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
+			/*
+			 * The iter must explicitly re-read the SPTE because
+			 * the atomic cmpxchg failed.
+			 */
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+			goto retry;
+		}
 	}
 
 	rcu_read_unlock();
@@ -1485,7 +1489,9 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 	struct kvm_mmu_page *root;
 	int root_as_id;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
+	read_lock(&kvm->mmu_lock);
+
+	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
@@ -1493,6 +1499,8 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 		zap_collapsible_spte_range(kvm, root, slot->base_gfn,
 					   slot->base_gfn + slot->npages);
 	}
+
+	read_unlock(&kvm->mmu_lock);
 }
 
 /*
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 26/28] KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read lock
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (24 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU " Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03 11:38   ` Paolo Bonzini
  2021-02-02 18:57 ` [PATCH v2 27/28] KVM: selftests: Add backing src parameter to dirty_log_perf_test Ben Gardon
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

To reduce lock contention and interference with page fault handlers,
allow the TDP MMU functions which enable and disable dirty logging
to operate under the MMU read lock.


Extend dirty logging enable disable functions read lock-ness

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 14 +++---
 arch/x86/kvm/mmu/tdp_mmu.c | 93 ++++++++++++++++++++++++++++++--------
 arch/x86/kvm/mmu/tdp_mmu.h |  2 +-
 3 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e3cf868be6bd..6ba2a72d4330 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5638,9 +5638,10 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 
 	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
+	write_unlock(&kvm->mmu_lock);
+
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
-	write_unlock(&kvm->mmu_lock);
 
 	/*
 	 * It's also safe to flush TLBs out of mmu lock here as currently this
@@ -5661,9 +5662,10 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
 					false);
+	write_unlock(&kvm->mmu_lock);
+
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_2M);
-	write_unlock(&kvm->mmu_lock);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
@@ -5677,12 +5679,12 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 
 	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
-	if (kvm->arch.tdp_mmu_enabled)
-		flush |= kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
-	write_unlock(&kvm->mmu_lock);
-
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
+	write_unlock(&kvm->mmu_lock);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cfe66b8d39fa..6093926a6bc5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -553,18 +553,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 }
 
 /*
- * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
+ * __tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
  * associated bookkeeping
  *
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @new_spte: The value the SPTE should be set to
+ * @record_dirty_log: Record the page as dirty in the dirty bitmap if
+ *		      appropriate for the change being made. Should be set
+ *		      unless performing certain dirty logging operations.
+ *		      Leaving record_dirty_log unset in that case prevents page
+ *		      writes from being double counted.
  * Returns: true if the SPTE was set, false if it was not. If false is returned,
  *	    this function will have no side-effects.
  */
-static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
-					   struct tdp_iter *iter,
-					   u64 new_spte)
+static inline bool __tdp_mmu_set_spte_atomic(struct kvm *kvm,
+		struct tdp_iter *iter, u64 new_spte, bool record_dirty_log)
 {
 	u64 *root_pt = tdp_iter_root_pt(iter);
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
@@ -583,12 +587,31 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 		      new_spte) != iter->old_spte)
 		return false;
 
-	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-			    iter->level, true);
+	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
+			      iter->level, true);
+	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
+	if (record_dirty_log)
+		handle_changed_spte_dirty_log(kvm, as_id, iter->gfn,
+					      iter->old_spte, new_spte,
+					      iter->level);
 
 	return true;
 }
 
+static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
+							struct tdp_iter *iter,
+							u64 new_spte)
+{
+	return __tdp_mmu_set_spte_atomic(kvm, iter, new_spte, false);
+}
+
+static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
+					   struct tdp_iter *iter,
+					   u64 new_spte)
+{
+	return __tdp_mmu_set_spte_atomic(kvm, iter, new_spte, true);
+}
+
 static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 					   struct tdp_iter *iter)
 {
@@ -1206,7 +1229,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
 				   min_level, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
 		if (!is_shadow_present_pte(iter.old_spte) ||
@@ -1216,7 +1240,15 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
 
-		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+		if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, &iter,
+							  new_spte)) {
+			/*
+			 * The iter must explicitly re-read the SPTE because
+			 * the atomic cmpxchg failed.
+			 */
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+			goto retry;
+		}
 		spte_set = true;
 	}
 
@@ -1236,7 +1268,8 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
+	read_lock(&kvm->mmu_lock);
+	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
@@ -1244,6 +1277,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 		spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
 			     slot->base_gfn + slot->npages, min_level);
 	}
+	read_unlock(&kvm->mmu_lock);
 
 	return spte_set;
 }
@@ -1265,7 +1299,8 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_lock();
 
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
 		if (spte_ad_need_write_protect(iter.old_spte)) {
@@ -1280,7 +1315,15 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 				continue;
 		}
 
-		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+		if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, &iter,
+							  new_spte)) {
+			/*
+			 * The iter must explicitly re-read the SPTE because
+			 * the atomic cmpxchg failed.
+			 */
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+			goto retry;
+		}
 		spte_set = true;
 	}
 
@@ -1301,7 +1344,8 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
+	read_lock(&kvm->mmu_lock);
+	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
@@ -1309,6 +1353,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 		spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
 				slot->base_gfn + slot->npages);
 	}
+	read_unlock(&kvm->mmu_lock);
 
 	return spte_set;
 }
@@ -1397,7 +1442,8 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
 		if (!is_shadow_present_pte(iter.old_spte) ||
@@ -1406,7 +1452,14 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		new_spte = iter.old_spte | shadow_dirty_mask;
 
-		tdp_mmu_set_spte(kvm, &iter, new_spte);
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
+			/*
+			 * The iter must explicitly re-read the SPTE because
+			 * the atomic cmpxchg failed.
+			 */
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+			goto retry;
+		}
 		spte_set = true;
 	}
 
@@ -1417,15 +1470,15 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 /*
  * Set the dirty status of all the SPTEs mapping GFNs in the memslot. This is
  * only used for PML, and so will involve setting the dirty bit on each SPTE.
- * Returns true if an SPTE has been changed and the TLBs need to be flushed.
  */
-bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
+void kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	struct kvm_mmu_page *root;
 	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
+	read_lock(&kvm->mmu_lock);
+	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
 			continue;
@@ -1433,7 +1486,11 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
 		spte_set |= set_dirty_gfn_range(kvm, root, slot->base_gfn,
 				slot->base_gfn + slot->npages);
 	}
-	return spte_set;
+
+	if (spte_set)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+
+	read_unlock(&kvm->mmu_lock);
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 10ada884270b..848b41b20985 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -38,7 +38,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				       struct kvm_memory_slot *slot,
 				       gfn_t gfn, unsigned long mask,
 				       bool wrprot);
-bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot);
 
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 27/28] KVM: selftests: Add backing src parameter to dirty_log_perf_test
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (25 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 26/28] KVM: x86/mmu: Allow enabling / disabling dirty logging under " Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-02 18:57 ` [PATCH v2 28/28] KVM: selftests: Disable dirty logging with vCPUs running Ben Gardon
  2021-02-03 11:00 ` [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Paolo Bonzini
  28 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Andrew Jones, Thomas Huth

Add a parameter to control the backing memory type for
dirty_log_perf_test so that the test can be run with hugepages.

To: linux-kselftest@vger.kernel.org
CC: Peter Xu <peterx@redhat.com>
CC: Andrew Jones <drjones@redhat.com>
CC: Thomas Huth <thuth@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  3 +-
 .../selftests/kvm/dirty_log_perf_test.c       | 15 ++++++++--
 .../testing/selftests/kvm/include/kvm_util.h  |  6 ----
 .../selftests/kvm/include/perf_test_util.h    |  3 +-
 .../testing/selftests/kvm/include/test_util.h | 14 +++++++++
 .../selftests/kvm/lib/perf_test_util.c        |  6 ++--
 tools/testing/selftests/kvm/lib/test_util.c   | 29 +++++++++++++++++++
 7 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index cdad1eca72f7..9e3254ff0821 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -265,7 +265,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int vcpu_id;
 	int r;
 
-	vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size);
+	vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 VM_MEM_SRC_ANONYMOUS);
 
 	perf_test_args.wr_fract = 1;
 
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 2283a0ec74a9..604ccefd6e76 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -92,6 +92,7 @@ struct test_params {
 	unsigned long iterations;
 	uint64_t phys_offset;
 	int wr_fract;
+	enum vm_mem_backing_src_type backing_src;
 };
 
 static void run_test(enum vm_guest_mode mode, void *arg)
@@ -111,7 +112,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_enable_cap cap = {};
 	struct timespec clear_dirty_log_total = (struct timespec){0};
 
-	vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size);
+	vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 p->backing_src);
 
 	perf_test_args.wr_fract = p->wr_fract;
 
@@ -236,7 +238,7 @@ static void help(char *name)
 {
 	puts("");
 	printf("usage: %s [-h] [-i iterations] [-p offset] "
-	       "[-m mode] [-b vcpu bytes] [-v vcpus]\n", name);
+	       "[-m mode] [-b vcpu bytes] [-v vcpus] [-s mem type]\n", name);
 	puts("");
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
@@ -251,6 +253,9 @@ static void help(char *name)
 	       "     1/<fraction of pages to write>.\n"
 	       "     (default: 1 i.e. all pages are written to.)\n");
 	printf(" -v: specify the number of vCPUs to run.\n");
+	printf(" -s: specify the type of memory that should be used to\n"
+	       "     back the guest data region.\n");
+	backing_src_help();
 	puts("");
 	exit(0);
 }
@@ -261,6 +266,7 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.iterations = TEST_HOST_LOOP_N,
 		.wr_fract = 1,
+		.backing_src = VM_MEM_SRC_ANONYMOUS,
 	};
 	int opt;
 
@@ -271,7 +277,7 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:")) != -1) {
+	while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:s:")) != -1) {
 		switch (opt) {
 		case 'i':
 			p.iterations = strtol(optarg, NULL, 10);
@@ -295,6 +301,9 @@ int main(int argc, char *argv[])
 			TEST_ASSERT(nr_vcpus > 0 && nr_vcpus <= max_vcpus,
 				    "Invalid number of vcpus, must be between 1 and %d", max_vcpus);
 			break;
+		case 's':
+			p.backing_src = parse_backing_src_type(optarg);
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 5cbb861525ed..2d7eb6989e83 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -79,12 +79,6 @@ struct vm_guest_mode_params {
 };
 extern const struct vm_guest_mode_params vm_guest_mode_params[];
 
-enum vm_mem_backing_src_type {
-	VM_MEM_SRC_ANONYMOUS,
-	VM_MEM_SRC_ANONYMOUS_THP,
-	VM_MEM_SRC_ANONYMOUS_HUGETLB,
-};
-
 int kvm_check_cap(long cap);
 int vm_enable_cap(struct kvm_vm *vm, struct kvm_enable_cap *cap);
 int vcpu_enable_cap(struct kvm_vm *vm, uint32_t vcpu_id,
diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
index b1188823c31b..8b66ab300175 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -44,7 +44,8 @@ extern struct perf_test_args perf_test_args;
 extern uint64_t guest_test_phys_mem;
 
 struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
-				uint64_t vcpu_memory_bytes);
+				   uint64_t vcpu_memory_bytes,
+				   enum vm_mem_backing_src_type backing_src);
 void perf_test_destroy_vm(struct kvm_vm *vm);
 void perf_test_setup_vcpus(struct kvm_vm *vm, int vcpus, uint64_t vcpu_memory_bytes);
 
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index ffffa560436b..749b24a239a1 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -67,4 +67,18 @@ struct timespec timespec_sub(struct timespec ts1, struct timespec ts2);
 struct timespec timespec_diff_now(struct timespec start);
 struct timespec timespec_div(struct timespec ts, int divisor);
 
+enum vm_mem_backing_src_type {
+	VM_MEM_SRC_ANONYMOUS,
+	VM_MEM_SRC_ANONYMOUS_THP,
+	VM_MEM_SRC_ANONYMOUS_HUGETLB,
+};
+
+struct vm_mem_backing_src_alias {
+	const char *name;
+	enum vm_mem_backing_src_type type;
+};
+
+void backing_src_help(void);
+enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
+
 #endif /* SELFTEST_KVM_TEST_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index 9be1944c2d1c..7f1571924347 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -49,7 +49,8 @@ static void guest_code(uint32_t vcpu_id)
 }
 
 struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
-				   uint64_t vcpu_memory_bytes)
+				   uint64_t vcpu_memory_bytes,
+				   enum vm_mem_backing_src_type backing_src)
 {
 	struct kvm_vm *vm;
 	uint64_t guest_num_pages;
@@ -93,8 +94,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
 	pr_info("guest physical test memory offset: 0x%lx\n", guest_test_phys_mem);
 
 	/* Add an extra memory slot for testing */
-	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
-				    guest_test_phys_mem,
+	vm_userspace_mem_region_add(vm, backing_src, guest_test_phys_mem,
 				    PERF_TEST_MEM_SLOT_INDEX,
 				    guest_num_pages, 0);
 
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 8e04c0b1608e..9fd60b142c23 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -10,6 +10,7 @@
 #include <limits.h>
 #include <stdlib.h>
 #include <time.h>
+#include "linux/kernel.h"
 
 #include "test_util.h"
 
@@ -109,3 +110,31 @@ void print_skip(const char *fmt, ...)
 	va_end(ap);
 	puts(", skipping test");
 }
+
+const struct vm_mem_backing_src_alias backing_src_aliases[] = {
+	{"anonymous", VM_MEM_SRC_ANONYMOUS,},
+	{"anonymous_thp", VM_MEM_SRC_ANONYMOUS_THP,},
+	{"anonymous_hugetlb", VM_MEM_SRC_ANONYMOUS_HUGETLB,},
+};
+
+void backing_src_help(void)
+{
+	int i;
+
+	printf("Available backing src types:\n");
+	for (i = 0; i < ARRAY_SIZE(backing_src_aliases); i++)
+		printf("\t%s\n", backing_src_aliases[i].name);
+}
+
+enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(backing_src_aliases); i++)
+		if (!strcmp(type_name, backing_src_aliases[i].name))
+			return backing_src_aliases[i].type;
+
+	backing_src_help();
+	TEST_FAIL("Unknown backing src type: %s", type_name);
+	return -1;
+}
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v2 28/28] KVM: selftests: Disable dirty logging with vCPUs running
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (26 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 27/28] KVM: selftests: Add backing src parameter to dirty_log_perf_test Ben Gardon
@ 2021-02-02 18:57 ` Ben Gardon
  2021-02-03 10:07   ` Paolo Bonzini
  2021-02-03 11:00 ` [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Paolo Bonzini
  28 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-02 18:57 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Andrew Jones, Thomas Huth

Disabling dirty logging is much more intestesting from a testing
perspective if the vCPUs are still running. This also excercises the
code-path in which collapsible SPTEs must be faulted back in at a higher
level after disabling dirty logging.

To: linux-kselftest@vger.kernel.org
CC: Peter Xu <peterx@redhat.com>
CC: Andrew Jones <drjones@redhat.com>
CC: Thomas Huth <thuth@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 604ccefd6e76..d44a5b8ef232 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -205,11 +205,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		}
 	}
 
-	/* Tell the vcpu thread to quit */
-	host_quit = true;
-	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
-		pthread_join(vcpu_threads[vcpu_id], NULL);
-
 	/* Disable dirty logging */
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	vm_mem_region_set_flags(vm, PERF_TEST_MEM_SLOT_INDEX, 0);
@@ -217,6 +212,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	pr_info("Disabling dirty logging time: %ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
 
+	/* Tell the vcpu thread to quit */
+	host_quit = true;
+	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
+		pthread_join(vcpu_threads[vcpu_id], NULL);
+
 	avg = timespec_div(get_dirty_log_total, p->iterations);
 	pr_info("Get dirty log over %lu iterations took %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
 		p->iterations, get_dirty_log_total.tv_sec,
-- 
2.30.0.365.g02bc693789-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-02-02 18:57 ` [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
@ 2021-02-03  2:48   ` kernel test robot
  2021-02-03 11:14   ` Paolo Bonzini
  2021-04-01 10:32   ` Paolo Bonzini
  2 siblings, 0 replies; 65+ messages in thread
From: kernel test robot @ 2021-02-03  2:48 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: kbuild-all, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang

[-- Attachment #1: Type: text/plain, Size: 6088 bytes --]

Hi Ben,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/master]
[also build test ERROR on linux/master linus/master v5.11-rc6 next-20210125]
[cannot apply to kvm/linux-next tip/sched/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ben-Gardon/Allow-parallel-MMU-operations-with-TDP-MMU/20210203-032259
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git a7e0bdf1b07ea6169930ec42b0bdb17e1c1e3bb0
config: i386-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/54f2f26ad4d34bc74287a904d2eebc011974147c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ben-Gardon/Allow-parallel-MMU-operations-with-TDP-MMU/20210203-032259
        git checkout 54f2f26ad4d34bc74287a904d2eebc011974147c
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/x86/include/asm/atomic.h:8,
                    from include/linux/atomic.h:7,
                    from include/linux/cpumask.h:13,
                    from arch/x86/include/asm/cpumask.h:5,
                    from arch/x86/include/asm/msr.h:11,
                    from arch/x86/include/asm/processor.h:22,
                    from arch/x86/include/asm/cpufeature.h:5,
                    from arch/x86/include/asm/thread_info.h:53,
                    from include/linux/thread_info.h:56,
                    from arch/x86/include/asm/preempt.h:7,
                    from include/linux/preempt.h:78,
                    from include/linux/percpu.h:6,
                    from include/linux/context_tracking_state.h:5,
                    from include/linux/hardirq.h:5,
                    from include/linux/kvm_host.h:7,
                    from arch/x86/kvm/mmu.h:5,
                    from arch/x86/kvm/mmu/tdp_mmu.c:3:
   In function 'handle_removed_tdp_mmu_page',
       inlined from '__handle_changed_spte' at arch/x86/kvm/mmu/tdp_mmu.c:454:3:
>> arch/x86/include/asm/cmpxchg.h:67:4: error: call to '__xchg_wrong_size' declared with attribute error: Bad argument size for xchg
      67 |    __ ## op ## _wrong_size();   \
         |    ^~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/cmpxchg.h:78:27: note: in expansion of macro '__xchg_op'
      78 | #define arch_xchg(ptr, v) __xchg_op((ptr), (v), xchg, "")
         |                           ^~~~~~~~~
   include/asm-generic/atomic-instrumented.h:1649:2: note: in expansion of macro 'arch_xchg'
    1649 |  arch_xchg(__ai_ptr, __VA_ARGS__); \
         |  ^~~~~~~~~
   arch/x86/kvm/mmu/tdp_mmu.c:350:21: note: in expansion of macro 'xchg'
     350 |    old_child_spte = xchg(sptep, 0);
         |                     ^~~~


vim +/__xchg_wrong_size +67 arch/x86/include/asm/cmpxchg.h

e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  37  
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  38  /* 
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  39   * An exchange-type operation, which takes a value and a pointer, and
7f5281ae8a8e7f Li Zhong            2013-04-25  40   * returns the old value.
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  41   */
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  42  #define __xchg_op(ptr, arg, op, lock)					\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  43  	({								\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  44  	        __typeof__ (*(ptr)) __ret = (arg);			\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  45  		switch (sizeof(*(ptr))) {				\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  46  		case __X86_CASE_B:					\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  47  			asm volatile (lock #op "b %b0, %1\n"		\
2ca052a3710fac Jeremy Fitzhardinge 2012-04-02  48  				      : "+q" (__ret), "+m" (*(ptr))	\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  49  				      : : "memory", "cc");		\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  50  			break;						\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  51  		case __X86_CASE_W:					\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  52  			asm volatile (lock #op "w %w0, %1\n"		\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  53  				      : "+r" (__ret), "+m" (*(ptr))	\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  54  				      : : "memory", "cc");		\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  55  			break;						\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  56  		case __X86_CASE_L:					\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  57  			asm volatile (lock #op "l %0, %1\n"		\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  58  				      : "+r" (__ret), "+m" (*(ptr))	\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  59  				      : : "memory", "cc");		\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  60  			break;						\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  61  		case __X86_CASE_Q:					\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  62  			asm volatile (lock #op "q %q0, %1\n"		\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  63  				      : "+r" (__ret), "+m" (*(ptr))	\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  64  				      : : "memory", "cc");		\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  65  			break;						\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  66  		default:						\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30 @67  			__ ## op ## _wrong_size();			\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  68  		}							\
31a8394e069e47 Jeremy Fitzhardinge 2011-09-30  69  		__ret;							\
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  70  	})
e9826380d83d1b Jeremy Fitzhardinge 2011-08-18  71  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 64197 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 10/28] KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs
  2021-02-02 18:57 ` [PATCH v2 10/28] KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs Ben Gardon
@ 2021-02-03  9:43   ` Paolo Bonzini
  0 siblings, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03  9:43 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> There is a bug in the TDP MMU function to zap SPTEs which could be
> replaced with a larger mapping which prevents the function from doing
> anything. Fix this by correctly zapping the last level SPTEs.
> 
> Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU")
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c3075fb568eb..e3066d08c1dc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1098,8 +1098,8 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
>   }
>   
>   /*
> - * Clear non-leaf entries (and free associated page tables) which could
> - * be replaced by large mappings, for GFNs within the slot.
> + * Clear leaf entries which could be replaced by large mappings, for
> + * GFNs within the slot.
>    */
>   static void zap_collapsible_spte_range(struct kvm *kvm,
>   				       struct kvm_mmu_page *root,
> @@ -1111,7 +1111,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>   
>   	tdp_root_for_each_pte(iter, root, start, end) {
>   		if (!is_shadow_present_pte(iter.old_spte) ||
> -		    is_last_spte(iter.old_spte, iter.level))
> +		    !is_last_spte(iter.old_spte, iter.level))
>   			continue;
>   
>   		pfn = spte_to_pfn(iter.old_spte);
> 

Queued for 5.11-rc, thanks.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 28/28] KVM: selftests: Disable dirty logging with vCPUs running
  2021-02-02 18:57 ` [PATCH v2 28/28] KVM: selftests: Disable dirty logging with vCPUs running Ben Gardon
@ 2021-02-03 10:07   ` Paolo Bonzini
  0 siblings, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 10:07 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong, Andrew Jones, Thomas Huth

On 02/02/21 19:57, Ben Gardon wrote:
> Disabling dirty logging is much more intestesting from a testing
> perspective if the vCPUs are still running. This also excercises the
> code-path in which collapsible SPTEs must be faulted back in at a higher
> level after disabling dirty logging.
> 
> To: linux-kselftest@vger.kernel.org
> CC: Peter Xu <peterx@redhat.com>
> CC: Andrew Jones <drjones@redhat.com>
> CC: Thomas Huth <thuth@redhat.com>
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 604ccefd6e76..d44a5b8ef232 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -205,11 +205,6 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   		}
>   	}
>   
> -	/* Tell the vcpu thread to quit */
> -	host_quit = true;
> -	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> -		pthread_join(vcpu_threads[vcpu_id], NULL);
> -
>   	/* Disable dirty logging */
>   	clock_gettime(CLOCK_MONOTONIC, &start);
>   	vm_mem_region_set_flags(vm, PERF_TEST_MEM_SLOT_INDEX, 0);
> @@ -217,6 +212,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   	pr_info("Disabling dirty logging time: %ld.%.9lds\n",
>   		ts_diff.tv_sec, ts_diff.tv_nsec);
>   
> +	/* Tell the vcpu thread to quit */
> +	host_quit = true;
> +	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++)
> +		pthread_join(vcpu_threads[vcpu_id], NULL);
> +
>   	avg = timespec_div(get_dirty_log_total, p->iterations);
>   	pr_info("Get dirty log over %lu iterations took %ld.%.9lds. (Avg %ld.%.9lds/iteration)\n",
>   		p->iterations, get_dirty_log_total.tv_sec,
> 

Queued the two selftests patches, because why not.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU
  2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
                   ` (27 preceding siblings ...)
  2021-02-02 18:57 ` [PATCH v2 28/28] KVM: selftests: Disable dirty logging with vCPUs running Ben Gardon
@ 2021-02-03 11:00 ` Paolo Bonzini
  2021-02-03 17:54   ` Sean Christopherson
  28 siblings, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 11:00 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> The TDP MMU was implemented to simplify and improve the performance of
> KVM's memory management on modern hardware with TDP (EPT / NPT). To build
> on the existing performance improvements of the TDP MMU, add the ability
> to handle vCPU page faults, enabling and disabling dirty logging, and
> removing mappings, in parallel. In the current implementation,
> vCPU page faults (actually EPT/NPT violations/misconfigurations) are the
> largest source of MMU lock contention on VMs with many vCPUs. This
> contention, and the resulting page fault latency, can soft-lock guests
> and degrade performance. Handling page faults in parallel is especially
> useful when booting VMs, enabling dirty logging, and handling demand
> paging. In all these cases vCPUs are constantly incurring  page faults on
> each new page accessed.
> 
> Broadly, the following changes were required to allow parallel page
> faults (and other MMU operations):
> -- Contention detection and yielding added to rwlocks to bring them up to
>     feature parity with spin locks, at least as far as the use of the MMU
>     lock is concerned.
> -- TDP MMU page table memory is protected with RCU and freed in RCU
>     callbacks to allow multiple threads to operate on that memory
>     concurrently.
> -- The MMU lock was changed to an rwlock on x86. This allows the page
>     fault handlers to acquire the MMU lock in read mode and handle page
>     faults in parallel, and other operations to maintain exclusive use of
>     the lock by acquiring it in write mode.
> -- An additional lock is added to protect some data structures needed by
>     the page fault handlers, for relatively infrequent operations.
> -- The page fault handler is modified to use atomic cmpxchgs to set SPTEs
>     and some page fault handler operations are modified slightly to work
>     concurrently with other threads.
> 
> This series also contains a few bug fixes and optimizations, related to
> the above, but not strictly part of enabling parallel page fault handling.
> 
> Correctness testing:
> The following tests were performed with an SMP kernel and DBX kernel on an
> Intel Skylake machine. The tests were run both with and without the TDP
> MMU enabled.
> -- This series introduces no new failures in kvm-unit-tests
> SMP + no TDP MMU no new failures
> SMP + TDP MMU no new failures
> DBX + no TDP MMU no new failures
> DBX + TDP MMU no new failures

What's DBX?  Lockdep etc.?

> -- All KVM selftests behave as expected
> SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
> SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
> (./x86_64/vmx_preemption_timer_test also fails without this patch set,
> both with the TDP MMU on and off.)

Yes, it's flaky.  It depends on your host.

> DBX + no TDP MMU all pass
> DBX + TDP MMU all pass
> -- A VM can be booted running a Debian 9 and all memory accessed
> SMP + no TDP MMU works
> SMP + TDP MMU works
> DBX + no TDP MMU works
> DBX + TDP MMU works
> 
> This series can be viewed in Gerrit at:
> https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Looks good!  I'll wait for a few days of reviews, but I'd like to queue 
this for 5.12 and I plan to make it the default in 5.13 or 5.12-rc 
(depending on when I can ask Red Hat QE to give it a shake).

It also needs more documentation though.  I'll do that myself based on 
your KVM Forum talk so that I can teach myself more of it.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-02-02 18:57 ` [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
  2021-02-03  2:48   ` kernel test robot
@ 2021-02-03 11:14   ` Paolo Bonzini
  2021-02-06  0:26     ` Sean Christopherson
  2021-04-01 10:32   ` Paolo Bonzini
  2 siblings, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 11:14 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> To prepare for handling page faults in parallel, change the TDP MMU
> page fault handler to use atomic operations to set SPTEs so that changes
> are not lost if multiple threads attempt to modify the same SPTE.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> Signed-off-by: Ben Gardon <bgardon@google.com>
> 
> ---
> 
> v1 -> v2
> - Rename "atomic" arg to "shared" in multiple functions
> - Merged the commit that protects the lists of TDP MMU pages with a new
>    lock
> - Merged the commits to add an atomic option for setting SPTEs and to
>    use that option in the TDP MMU page fault handler

I'll look at the kernel test robot report if nobody beats me to it.  In 
the meanwhile here's some doc to squash in:

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index b21a34c34a21..bd03638f1e55 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,7 +16,14 @@ The acquisition orders for mutexes are as follows:
  - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
    them together is quite rare.

-On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
+On x86:
+
+- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
+
+- kvm->arch.mmu_lock is an rwlock.  kvm->arch.tdp_mmu_pages_lock is
+  taken inside kvm->arch.mmu_lock, and cannot be taken without already
+  holding kvm->arch.mmu_lock (with ``read_lock``, otherwise there's
+  no need to take kvm->arch.tdp_mmu_pages_lock at all).

  Everything else is a leaf: no other lock is taken inside the critical
  sections.

Paoloo

>   arch/x86/include/asm/kvm_host.h |  13 +++
>   arch/x86/kvm/mmu/tdp_mmu.c      | 142 ++++++++++++++++++++++++--------
>   2 files changed, 122 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b6ebf2558386..78ebf56f2b37 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1028,6 +1028,19 @@ struct kvm_arch {
>   	 * tdp_mmu_page set and a root_count of 0.
>   	 */
>   	struct list_head tdp_mmu_pages;
> +
> +	/*
> +	 * Protects accesses to the following fields when the MMU lock
> +	 * is held in read mode:
> +	 *  - tdp_mmu_pages (above)
> +	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
> +	 *  - lpage_disallowed_mmu_pages
> +	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
> +	 *    by the TDP MMU
> +	 * It is acceptable, but not necessary, to acquire this lock when
> +	 * the thread holds the MMU lock in write mode.
> +	 */
> +	spinlock_t tdp_mmu_pages_lock;
>   };
>   
>   struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a9e964e0178..0b5a9339ac55 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -7,6 +7,7 @@
>   #include "tdp_mmu.h"
>   #include "spte.h"
>   
> +#include <asm/cmpxchg.h>
>   #include <trace/events/kvm.h>
>   
>   #ifdef CONFIG_X86_64
> @@ -33,6 +34,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>   	kvm->arch.tdp_mmu_enabled = true;
>   
>   	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> +	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
>   	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
>   }
>   
> @@ -225,7 +227,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
>   }
>   
>   static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level);
> +				u64 old_spte, u64 new_spte, int level,
> +				bool shared);
>   
>   static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
>   {
> @@ -267,17 +270,26 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
>    *
>    * @kvm: kvm instance
>    * @sp: the new page
> + * @shared: This operation may not be running under the exclusive use of
> + *	    the MMU lock and the operation must synchronize with other
> + *	    threads that might be adding or removing pages.
>    * @account_nx: This page replaces a NX large page and should be marked for
>    *		eventual reclaim.
>    */
>   static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> -			      bool account_nx)
> +			      bool shared, bool account_nx)
>   {
> -	lockdep_assert_held_write(&kvm->mmu_lock);
> +	if (shared)
> +		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +	else
> +		lockdep_assert_held_write(&kvm->mmu_lock);
>   
>   	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
>   	if (account_nx)
>   		account_huge_nx_page(kvm, sp);
> +
> +	if (shared)
> +		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>   }
>   
>   /**
> @@ -285,14 +297,24 @@ static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>    *
>    * @kvm: kvm instance
>    * @sp: the page to be removed
> + * @shared: This operation may not be running under the exclusive use of
> + *	    the MMU lock and the operation must synchronize with other
> + *	    threads that might be adding or removing pages.
>    */
> -static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				bool shared)
>   {
> -	lockdep_assert_held_write(&kvm->mmu_lock);
> +	if (shared)
> +		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +	else
> +		lockdep_assert_held_write(&kvm->mmu_lock);
>   
>   	list_del(&sp->link);
>   	if (sp->lpage_disallowed)
>   		unaccount_huge_nx_page(kvm, sp);
> +
> +	if (shared)
> +		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>   }
>   
>   /**
> @@ -300,28 +322,39 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>    *
>    * @kvm: kvm instance
>    * @pt: the page removed from the paging structure
> + * @shared: This operation may not be running under the exclusive use
> + *	    of the MMU lock and the operation must synchronize with other
> + *	    threads that might be modifying SPTEs.
>    *
>    * Given a page table that has been removed from the TDP paging structure,
>    * iterates through the page table to clear SPTEs and free child page tables.
>    */
> -static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
> +static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt,
> +					bool shared)
>   {
>   	struct kvm_mmu_page *sp = sptep_to_sp(pt);
>   	int level = sp->role.level;
>   	gfn_t gfn = sp->gfn;
>   	u64 old_child_spte;
> +	u64 *sptep;
>   	int i;
>   
>   	trace_kvm_mmu_prepare_zap_page(sp);
>   
> -	tdp_mmu_unlink_page(kvm, sp);
> +	tdp_mmu_unlink_page(kvm, sp, shared);
>   
>   	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> -		old_child_spte = READ_ONCE(*(pt + i));
> -		WRITE_ONCE(*(pt + i), 0);
> +		sptep = pt + i;
> +
> +		if (shared) {
> +			old_child_spte = xchg(sptep, 0);
> +		} else {
> +			old_child_spte = READ_ONCE(*sptep);
> +			WRITE_ONCE(*sptep, 0);
> +		}
>   		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
>   			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
> -			old_child_spte, 0, level - 1);
> +			old_child_spte, 0, level - 1, shared);
>   	}
>   
>   	kvm_flush_remote_tlbs_with_address(kvm, gfn,
> @@ -338,12 +371,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, u64 *pt)
>    * @old_spte: The value of the SPTE before the change
>    * @new_spte: The value of the SPTE after the change
>    * @level: the level of the PT the SPTE is part of in the paging structure
> + * @shared: This operation may not be running under the exclusive use of
> + *	    the MMU lock and the operation must synchronize with other
> + *	    threads that might be modifying SPTEs.
>    *
>    * Handle bookkeeping that might result from the modification of a SPTE.
>    * This function must be called for all TDP SPTE modifications.
>    */
>   static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level)
> +				  u64 old_spte, u64 new_spte, int level,
> +				  bool shared)
>   {
>   	bool was_present = is_shadow_present_pte(old_spte);
>   	bool is_present = is_shadow_present_pte(new_spte);
> @@ -415,18 +452,51 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   	 */
>   	if (was_present && !was_leaf && (pfn_changed || !is_present))
>   		handle_removed_tdp_mmu_page(kvm,
> -				spte_to_child_pt(old_spte, level));
> +				spte_to_child_pt(old_spte, level), shared);
>   }
>   
>   static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level)
> +				u64 old_spte, u64 new_spte, int level,
> +				bool shared)
>   {
> -	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
> +	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> +			      shared);
>   	handle_changed_spte_acc_track(old_spte, new_spte, level);
>   	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
>   				      new_spte, level);
>   }
>   
> +/*
> + * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
> + * associated bookkeeping
> + *
> + * @kvm: kvm instance
> + * @iter: a tdp_iter instance currently on the SPTE that should be set
> + * @new_spte: The value the SPTE should be set to
> + * Returns: true if the SPTE was set, false if it was not. If false is returned,
> + *	    this function will have no side-effects.
> + */
> +static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> +					   struct tdp_iter *iter,
> +					   u64 new_spte)
> +{
> +	u64 *root_pt = tdp_iter_root_pt(iter);
> +	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
> +	int as_id = kvm_mmu_page_as_id(root);
> +
> +	lockdep_assert_held_read(&kvm->mmu_lock);
> +
> +	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> +		      new_spte) != iter->old_spte)
> +		return false;
> +
> +	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> +			    iter->level, true);
> +
> +	return true;
> +}
> +
> +
>   /*
>    * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
>    * @kvm: kvm instance
> @@ -456,7 +526,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>   	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
>   
>   	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> -			      iter->level);
> +			      iter->level, false);
>   	if (record_acc_track)
>   		handle_changed_spte_acc_track(iter->old_spte, new_spte,
>   					      iter->level);
> @@ -630,23 +700,18 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
>   	int ret = 0;
>   	int make_spte_ret = 0;
>   
> -	if (unlikely(is_noslot_pfn(pfn))) {
> +	if (unlikely(is_noslot_pfn(pfn)))
>   		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> -		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
> -				     new_spte);
> -	} else {
> +	else
>   		make_spte_ret = make_spte(vcpu, ACC_ALL, iter->level, iter->gfn,
>   					 pfn, iter->old_spte, prefault, true,
>   					 map_writable, !shadow_accessed_mask,
>   					 &new_spte);
> -		trace_kvm_mmu_set_spte(iter->level, iter->gfn,
> -				       rcu_dereference(iter->sptep));
> -	}
>   
>   	if (new_spte == iter->old_spte)
>   		ret = RET_PF_SPURIOUS;
> -	else
> -		tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
> +	else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
> +		return RET_PF_RETRY;
>   
>   	/*
>   	 * If the page fault was caused by a write but the page is write
> @@ -660,8 +725,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
>   	}
>   
>   	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
> -	if (unlikely(is_mmio_spte(new_spte)))
> +	if (unlikely(is_mmio_spte(new_spte))) {
> +		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
> +				     new_spte);
>   		ret = RET_PF_EMULATE;
> +	} else
> +		trace_kvm_mmu_set_spte(iter->level, iter->gfn,
> +				       rcu_dereference(iter->sptep));
>   
>   	trace_kvm_mmu_set_spte(iter->level, iter->gfn,
>   			       rcu_dereference(iter->sptep));
> @@ -720,7 +790,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   		 */
>   		if (is_shadow_present_pte(iter.old_spte) &&
>   		    is_large_pte(iter.old_spte)) {
> -			tdp_mmu_set_spte(vcpu->kvm, &iter, 0);
> +			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
> +				break;
>   
>   			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
>   					KVM_PAGES_PER_HPAGE(iter.level));
> @@ -737,19 +808,24 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
>   			child_pt = sp->spt;
>   
> -			tdp_mmu_link_page(vcpu->kvm, sp,
> -					  huge_page_disallowed &&
> -					  req_level >= iter.level);
> -
>   			new_spte = make_nonleaf_spte(child_pt,
>   						     !shadow_accessed_mask);
>   
> -			trace_kvm_mmu_get_page(sp, true);
> -			tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
> +			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
> +						    new_spte)) {
> +				tdp_mmu_link_page(vcpu->kvm, sp, true,
> +						  huge_page_disallowed &&
> +						  req_level >= iter.level);
> +
> +				trace_kvm_mmu_get_page(sp, true);
> +			} else {
> +				tdp_mmu_free_sp(sp);
> +				break;
> +			}
>   		}
>   	}
>   
> -	if (WARN_ON(iter.level != level)) {
> +	if (iter.level != level) {
>   		rcu_read_unlock();
>   		return RET_PF_RETRY;
>   	}
> 


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 22/28] KVM: x86/mmu: Mark SPTEs in disconnected pages as removed
  2021-02-02 18:57 ` [PATCH v2 22/28] KVM: x86/mmu: Mark SPTEs in disconnected pages as removed Ben Gardon
@ 2021-02-03 11:17   ` Paolo Bonzini
  0 siblings, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 11:17 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> 
> +			 * Marking the SPTE as a removed SPTE is not
> +			 * strictly necessary here as the MMU lock should

"should" is a bit too weak---the point of !shared is that the MMU lock 
*will* stop other threads from concurrent modifications of the SPTEs.

Paolo

> +			 * stop other threads from concurrentrly modifying
> +			 * this SPTE. Using the removed SPTE value keeps
> +			 * the shared and non-atomic cases consistent and
> +			 * simplifies the function.
> +			 */
> +			WRITE_ONCE(*sptep, REMOVED_SPTE);



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  2021-02-02 18:57 ` [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock Ben Gardon
@ 2021-02-03 11:25   ` Paolo Bonzini
  2021-02-03 11:26   ` Paolo Bonzini
  1 sibling, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 11:25 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> 
> @@ -5518,13 +5518,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  		}
>  	}
>  
> +	kvm_mmu_unlock(kvm);
> +
>  	if (kvm->arch.tdp_mmu_enabled) {

Temporary compile error.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  2021-02-02 18:57 ` [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock Ben Gardon
  2021-02-03 11:25   ` Paolo Bonzini
@ 2021-02-03 11:26   ` Paolo Bonzini
  2021-02-03 18:31     ` Ben Gardon
  1 sibling, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 11:26 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> +#ifdef CONFIG_LOCKDEP
> +	if (shared)
> +		lockdep_assert_held_read(&kvm->mmu_lock);
> +	else
> +		lockdep_assert_held_write(&kvm->mmu_lock);
> +#endif /* CONFIG_LOCKDEP */

Also, there's no need for the #ifdef here.  Do we want a helper 
kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm, bool shared)?

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
  2021-02-02 18:57 ` [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU " Ben Gardon
@ 2021-02-03 11:34   ` Paolo Bonzini
  2021-02-03 18:51     ` Ben Gardon
  0 siblings, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 11:34 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> @@ -1485,7 +1489,9 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  	struct kvm_mmu_page *root;
>  	int root_as_id;
>  
> -	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
> +	read_lock(&kvm->mmu_lock);
> +
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
>  		root_as_id = kvm_mmu_page_as_id(root);
>  		if (root_as_id != slot->as_id)
>  			continue;
> @@ -1493,6 +1499,8 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  		zap_collapsible_spte_range(kvm, root, slot->base_gfn,
>  					   slot->base_gfn + slot->npages);
>  	}
> +
> +	read_unlock(&kvm->mmu_lock);
>  }


I'd prefer the functions to be consistent about who takes the lock, 
either mmu.c or tdp_mmu.c.  Since everywhere else you're doing it in 
mmu.c, that would be:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0554d9c5c5d4..386ee4b703d9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5567,10 +5567,13 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
  	write_lock(&kvm->mmu_lock);
  	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
  			 kvm_mmu_zap_collapsible_spte, true);
+	write_unlock(&kvm->mmu_lock);

-	if (kvm->arch.tdp_mmu_enabled)
+	if (kvm->arch.tdp_mmu_enabled) {
+		read_lock(&kvm->mmu_lock);
  		kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
-	write_unlock(&kvm->mmu_lock);
+		read_unlock(&kvm->mmu_lock);
+	}
  }

  void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,

and just lockdep_assert_held_read here.

> -		tdp_mmu_set_spte(kvm, &iter, 0);
> -
> -		spte_set = true;

Is it correct to remove this assignment?

Paolo


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 26/28] KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read lock
  2021-02-02 18:57 ` [PATCH v2 26/28] KVM: x86/mmu: Allow enabling / disabling dirty logging under " Ben Gardon
@ 2021-02-03 11:38   ` Paolo Bonzini
  0 siblings, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 11:38 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> To reduce lock contention and interference with page fault handlers,
> allow the TDP MMU functions which enable and disable dirty logging
> to operate under the MMU read lock.
> 
> 
> Extend dirty logging enable disable functions read lock-ness
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c     | 14 +++---
>   arch/x86/kvm/mmu/tdp_mmu.c | 93 ++++++++++++++++++++++++++++++--------
>   arch/x86/kvm/mmu/tdp_mmu.h |  2 +-
>   3 files changed, 84 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e3cf868be6bd..6ba2a72d4330 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5638,9 +5638,10 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>   
>   	write_lock(&kvm->mmu_lock);
>   	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
> +	write_unlock(&kvm->mmu_lock);
> +
>   	if (kvm->arch.tdp_mmu_enabled)
>   		flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
> -	write_unlock(&kvm->mmu_lock);
>   
>   	/*
>   	 * It's also safe to flush TLBs out of mmu lock here as currently this
> @@ -5661,9 +5662,10 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>   	write_lock(&kvm->mmu_lock);
>   	flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
>   					false);
> +	write_unlock(&kvm->mmu_lock);
> +
>   	if (kvm->arch.tdp_mmu_enabled)
>   		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_2M);
> -	write_unlock(&kvm->mmu_lock);
>   
>   	if (flush)
>   		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> @@ -5677,12 +5679,12 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
>   
>   	write_lock(&kvm->mmu_lock);
>   	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
> -	if (kvm->arch.tdp_mmu_enabled)
> -		flush |= kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
> -	write_unlock(&kvm->mmu_lock);
> -
>   	if (flush)
>   		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> +	write_unlock(&kvm->mmu_lock);
> +
> +	if (kvm->arch.tdp_mmu_enabled)
> +		kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
>   }
>   EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
>   
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cfe66b8d39fa..6093926a6bc5 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -553,18 +553,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   }
>   
>   /*
> - * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
> + * __tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
>    * associated bookkeeping
>    *
>    * @kvm: kvm instance
>    * @iter: a tdp_iter instance currently on the SPTE that should be set
>    * @new_spte: The value the SPTE should be set to
> + * @record_dirty_log: Record the page as dirty in the dirty bitmap if
> + *		      appropriate for the change being made. Should be set
> + *		      unless performing certain dirty logging operations.
> + *		      Leaving record_dirty_log unset in that case prevents page
> + *		      writes from being double counted.
>    * Returns: true if the SPTE was set, false if it was not. If false is returned,
>    *	    this function will have no side-effects.
>    */
> -static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> -					   struct tdp_iter *iter,
> -					   u64 new_spte)
> +static inline bool __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> +		struct tdp_iter *iter, u64 new_spte, bool record_dirty_log)

Instead of adding the bool argument, just name this 
tdp_mmu_set_spte_atomic_no_dirty_log...

>   {
>   	u64 *root_pt = tdp_iter_root_pt(iter);
>   	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
> @@ -583,12 +587,31 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>   		      new_spte) != iter->old_spte)
>   		return false;
>   
> -	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> -			    iter->level, true);
> +	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> +			      iter->level, true);
> +	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
> +	if (record_dirty_log)
> +		handle_changed_spte_dirty_log(kvm, as_id, iter->gfn,
> +					      iter->old_spte, new_spte,
> +					      iter->level);

... and tdp_mmu_set_spte_atomic becomes

	if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, iter, new_spte))
		return false;

	handle_changed_spte_dirty_log(kvm, as_id, iter->gfn,
				      iter->old_spte, new_spte,
				      iter->level);
	return true;


> @@ -1301,7 +1344,8 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
>   	int root_as_id;
>   	bool spte_set = false;
>   
> -	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
> +	read_lock(&kvm->mmu_lock);
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
>   		root_as_id = kvm_mmu_page_as_id(root);
>   		if (root_as_id != slot->as_id)
>   			continue;
> @@ -1309,6 +1353,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
>   		spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
>   				slot->base_gfn + slot->npages);
>   	}
> +	read_unlock(&kvm->mmu_lock);

Same remark as before.

>   	return spte_set;
>   }
> @@ -1397,7 +1442,8 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   	rcu_read_lock();
>   
>   	tdp_root_for_each_pte(iter, root, start, end) {
> -		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
> +retry:
> +		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>   			continue;
>   
>   		if (!is_shadow_present_pte(iter.old_spte) ||
> @@ -1406,7 +1452,14 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   
>   		new_spte = iter.old_spte | shadow_dirty_mask;
>   
> -		tdp_mmu_set_spte(kvm, &iter, new_spte);
> +		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
> +			/*
> +			 * The iter must explicitly re-read the SPTE because
> +			 * the atomic cmpxchg failed.
> +			 */
> +			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> +			goto retry;
> +		}
>   		spte_set = true;

Yep, looks like that spte_set assignment should not have been removed. :)

>   	}
>   
> @@ -1417,15 +1470,15 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   /*
>    * Set the dirty status of all the SPTEs mapping GFNs in the memslot. This is
>    * only used for PML, and so will involve setting the dirty bit on each SPTE.
> - * Returns true if an SPTE has been changed and the TLBs need to be flushed.
>    */
> -bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
> +void kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
>   {
>   	struct kvm_mmu_page *root;
>   	int root_as_id;
>   	bool spte_set = false;
>   
> -	for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
> +	read_lock(&kvm->mmu_lock);

And again here.

Paolo

> +	for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
>   		root_as_id = kvm_mmu_page_as_id(root);
>   		if (root_as_id != slot->as_id)
>   			continue;
> @@ -1433,7 +1486,11 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
>   		spte_set |= set_dirty_gfn_range(kvm, root, slot->base_gfn,
>   				slot->base_gfn + slot->npages);
>   	}
> -	return spte_set;
> +
> +	if (spte_set)
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +	read_unlock(&kvm->mmu_lock);
>   }
>   
>   /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 10ada884270b..848b41b20985 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -38,7 +38,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>   				       struct kvm_memory_slot *slot,
>   				       gfn_t gfn, unsigned long mask,
>   				       bool wrprot);
> -bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
> +void kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
>   void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>   				       const struct kvm_memory_slot *slot);
>   
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-02-02 18:57 ` [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
@ 2021-02-03 12:39   ` Paolo Bonzini
  2021-02-03 17:46     ` Ben Gardon
  0 siblings, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 12:39 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> 
> -	write_lock(&vcpu->kvm->mmu_lock);
> +
> +	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +		read_lock(&vcpu->kvm->mmu_lock);
> +	else
> +		write_lock(&vcpu->kvm->mmu_lock);
> +

I'd like to make this into two helper functions, but I'm not sure about 
the naming:

- kvm_mmu_read_lock_for_root/kvm_mmu_read_unlock_for_root: not precise 
because it's really write-locked for shadow MMU roots

- kvm_mmu_lock_for_root/kvm_mmu_unlock_for_root: not clear that TDP MMU 
operations will need to operate in shared-lock mode

I prefer the first because at least it's the conservative option, but 
I'm open to other opinions and suggestions.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-02-03 12:39   ` Paolo Bonzini
@ 2021-02-03 17:46     ` Ben Gardon
  2021-02-03 18:30       ` Paolo Bonzini
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-03 17:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Feb 3, 2021 at 4:40 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 02/02/21 19:57, Ben Gardon wrote:
> >
> > -     write_lock(&vcpu->kvm->mmu_lock);
> > +
> > +     if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > +             read_lock(&vcpu->kvm->mmu_lock);
> > +     else
> > +             write_lock(&vcpu->kvm->mmu_lock);
> > +
>
> I'd like to make this into two helper functions, but I'm not sure about
> the naming:
>
> - kvm_mmu_read_lock_for_root/kvm_mmu_read_unlock_for_root: not precise
> because it's really write-locked for shadow MMU roots
>
> - kvm_mmu_lock_for_root/kvm_mmu_unlock_for_root: not clear that TDP MMU
> operations will need to operate in shared-lock mode
>
> I prefer the first because at least it's the conservative option, but
> I'm open to other opinions and suggestions.
>
> Paolo
>

Of the above two options, I like the second one, though I'd be happy
with either. I agree the first is more conservative, in that it's
clear the MMU lock could be shared. It feels a little misleading,
though to have read in the name of the function but then acquire the
write lock, especially since there's code below that which expects the
write lock. I don't know of a good way to abstract this into a helper
without some comments to make it clear what's going on, but maybe
there's a slightly more open-coded compromise:
if (!kvm_mmu_read_lock_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
         write_lock(&vcpu->kvm->mmu_lock);
or
enum kvm_mmu_lock_mode lock_mode =
get_mmu_lock_mode_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa);
....
kvm_mmu_lock_for_mode(lock_mode);

Not sure if either of those are actually clearer, but the latter
trends in the direction the RCF took, having an enum to capture
read/write and whether or not yo yield in a lock mode parameter.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU
  2021-02-03 11:00 ` [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Paolo Bonzini
@ 2021-02-03 17:54   ` Sean Christopherson
  2021-02-03 18:13     ` Paolo Bonzini
  0 siblings, 1 reply; 65+ messages in thread
From: Sean Christopherson @ 2021-02-03 17:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Feb 03, 2021, Paolo Bonzini wrote:
> Looks good!  I'll wait for a few days of reviews,

I guess I know what I'm doing this afternoon :-)

> but I'd like to queue this for 5.12 and I plan to make it the default in 5.13
> or 5.12-rc (depending on when I can ask Red Hat QE to give it a shake).

Hmm, given that kvm/queue doesn't seem to get widespread testing, I think it
should be enabled by default in rc1 for whatever kernel it targets.

Would it be too heinous to enable it by default in 5.12-rc1, knowing full well
that there's a good possibility it would get reverted?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU
  2021-02-03 17:54   ` Sean Christopherson
@ 2021-02-03 18:13     ` Paolo Bonzini
  0 siblings, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 18:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 03/02/21 18:54, Sean Christopherson wrote:
> On Wed, Feb 03, 2021, Paolo Bonzini wrote:
>> Looks good!  I'll wait for a few days of reviews,
> 
> I guess I know what I'm doing this afternoon :-)
> 
>> but I'd like to queue this for 5.12 and I plan to make it the default in 5.13
>> or 5.12-rc (depending on when I can ask Red Hat QE to give it a shake).
> 
> Hmm, given that kvm/queue doesn't seem to get widespread testing, I think it
> should be enabled by default in rc1 for whatever kernel it targets.
> 
> Would it be too heinous to enable it by default in 5.12-rc1, knowing full well
> that there's a good possibility it would get reverted?

Absolutely not.  However, to clarify my plan:

- what is now kvm/queue and has been reviewed will graduate to kvm/next 
in a couple of days, and then to 5.12-rc1.  Ben's patches are already in 
kvm/queue, but there's no problem in waiting another week before moving 
them to kvm/next because it's not enabled by default.  (Right now even 
CET is in kvm/queue, but it will not move to kvm/next until bare metal 
support is in).

- if this will not have been tested by Red Hat QE by say 5.12-rc3, I 
would enable it in kvm/next instead, and at that point the target would 
become the 5.13 merge window (and release).

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-02-03 17:46     ` Ben Gardon
@ 2021-02-03 18:30       ` Paolo Bonzini
  2021-02-06  0:12         ` Sean Christopherson
  0 siblings, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 18:30 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 03/02/21 18:46, Ben Gardon wrote:
> enum kvm_mmu_lock_mode lock_mode =
> get_mmu_lock_mode_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa);
> ....
> kvm_mmu_lock_for_mode(lock_mode);
> 
> Not sure if either of those are actually clearer, but the latter
> trends in the direction the RCF took, having an enum to capture
> read/write and whether or not yo yield in a lock mode parameter.

Could be a possibility.  Also:

enum kvm_mmu_lock_mode lock_mode =
   kvm_mmu_lock_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa);

kvm_mmu_unlock(vcpu->kvm, lock_mode);

Anyway it can be done on top.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  2021-02-03 11:26   ` Paolo Bonzini
@ 2021-02-03 18:31     ` Ben Gardon
  2021-02-03 18:32       ` Paolo Bonzini
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-02-03 18:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Feb 3, 2021 at 3:26 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 02/02/21 19:57, Ben Gardon wrote:
> > +#ifdef CONFIG_LOCKDEP
> > +     if (shared)
> > +             lockdep_assert_held_read(&kvm->mmu_lock);
> > +     else
> > +             lockdep_assert_held_write(&kvm->mmu_lock);
> > +#endif /* CONFIG_LOCKDEP */
>
> Also, there's no need for the #ifdef here.

I agree, I must have misinterpreted some feedback on a previous commit
and gone overboard with it.


> Do we want a helper
> kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm, bool shared)?

There are only two places that try to assert both ways as far as I can
see on a cursory check, but it couldn't hurt.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
  2021-02-03 18:31     ` Ben Gardon
@ 2021-02-03 18:32       ` Paolo Bonzini
  0 siblings, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-03 18:32 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 03/02/21 19:31, Ben Gardon wrote:
> On Wed, Feb 3, 2021 at 3:26 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 02/02/21 19:57, Ben Gardon wrote:
>>> +#ifdef CONFIG_LOCKDEP
>>> +     if (shared)
>>> +             lockdep_assert_held_read(&kvm->mmu_lock);
>>> +     else
>>> +             lockdep_assert_held_write(&kvm->mmu_lock);
>>> +#endif /* CONFIG_LOCKDEP */
>>
>> Also, there's no need for the #ifdef here.
> 
> I agree, I must have misinterpreted some feedback on a previous commit
> and gone overboard with it.
> 
> 
>> Do we want a helper
>> kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm, bool shared)?
> 
> There are only two places that try to assert both ways as far as I can
> see on a cursory check, but it couldn't hurt.

I think there's a couple more after patches 25/26.  But there's no issue 
in having them in too (and therefore having a more complete picture) 
before figuring out what the locking API could look like.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
  2021-02-03 11:34   ` Paolo Bonzini
@ 2021-02-03 18:51     ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2021-02-03 18:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Feb 3, 2021 at 3:34 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 02/02/21 19:57, Ben Gardon wrote:
> > @@ -1485,7 +1489,9 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >       struct kvm_mmu_page *root;
> >       int root_as_id;
> >
> > -     for_each_tdp_mmu_root_yield_safe(kvm, root, false) {
> > +     read_lock(&kvm->mmu_lock);
> > +
> > +     for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
> >               root_as_id = kvm_mmu_page_as_id(root);
> >               if (root_as_id != slot->as_id)
> >                       continue;
> > @@ -1493,6 +1499,8 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >               zap_collapsible_spte_range(kvm, root, slot->base_gfn,
> >                                          slot->base_gfn + slot->npages);
> >       }
> > +
> > +     read_unlock(&kvm->mmu_lock);
> >  }
>
>
> I'd prefer the functions to be consistent about who takes the lock,
> either mmu.c or tdp_mmu.c.  Since everywhere else you're doing it in
> mmu.c, that would be:
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0554d9c5c5d4..386ee4b703d9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5567,10 +5567,13 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>         write_lock(&kvm->mmu_lock);
>         slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
>                          kvm_mmu_zap_collapsible_spte, true);
> +       write_unlock(&kvm->mmu_lock);
>
> -       if (kvm->arch.tdp_mmu_enabled)
> +       if (kvm->arch.tdp_mmu_enabled) {
> +               read_lock(&kvm->mmu_lock);
>                 kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
> -       write_unlock(&kvm->mmu_lock);
> +               read_unlock(&kvm->mmu_lock);
> +       }
>   }
>
>   void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
>
> and just lockdep_assert_held_read here.

That makes sense to me, I agree keeping it consistent is probably a good idea.

>
> > -             tdp_mmu_set_spte(kvm, &iter, 0);
> > -
> > -             spte_set = true;
>
> Is it correct to remove this assignment?

No, it was not correct to remove it. Thank you for catching that.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 13/28] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
  2021-02-02 18:57 ` [PATCH v2 13/28] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter Ben Gardon
@ 2021-02-05 23:42   ` Sean Christopherson
  0 siblings, 0 replies; 65+ messages in thread
From: Sean Christopherson @ 2021-02-05 23:42 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Feb 02, 2021, Ben Gardon wrote:
> @@ -505,8 +516,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  
>  		tdp_mmu_set_spte(kvm, &iter, 0);
>  
> -		flush_needed = !can_yield ||
> -			       !tdp_mmu_iter_cond_resched(kvm, &iter, true);
> +		flush_needed = !(can_yield &&
> +				 tdp_mmu_iter_cond_resched(kvm, &iter, true));

Unnecessary change to convert perfectly readable code into an abomination :-D

No need to "fix", it goes aways in the next patch anyways, I just wanted to
complain.

>  	}
>  	return flush_needed;
>  }
> -- 
> 2.30.0.365.g02bc693789-goog
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-02-03 18:30       ` Paolo Bonzini
@ 2021-02-06  0:12         ` Sean Christopherson
  0 siblings, 0 replies; 65+ messages in thread
From: Sean Christopherson @ 2021-02-06  0:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, LKML, kvm, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On Wed, Feb 03, 2021, Paolo Bonzini wrote:
> On 03/02/21 18:46, Ben Gardon wrote:
> > enum kvm_mmu_lock_mode lock_mode =
> > get_mmu_lock_mode_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa);
> > ....
> > kvm_mmu_lock_for_mode(lock_mode);
> > 
> > Not sure if either of those are actually clearer, but the latter
> > trends in the direction the RCF took, having an enum to capture
> > read/write and whether or not yo yield in a lock mode parameter.
> 
> Could be a possibility.  Also:
> 
> enum kvm_mmu_lock_mode lock_mode =
>   kvm_mmu_lock_for_root(vcpu->kvm, vcpu->arch.mmu->root_hpa);
> 
> kvm_mmu_unlock(vcpu->kvm, lock_mode);
> 
> Anyway it can be done on top.

Maybe go with a literal name, unless we expect additional usage?  E.g. 
kvm_mmu_(un)lock_for_page_fault() isn't terrible.

I'm not a fan of the kvm_mmu_lock_for_root() variants.  "for_root" doesn't have
an obvious connection to the page fault handler or to the read/shared mode of
the TDP.  But, the name is also specific enough to pique my curiosity and make
me wonder what's it's doing.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-02-03 11:14   ` Paolo Bonzini
@ 2021-02-06  0:26     ` Sean Christopherson
  2021-02-08 10:32       ` Paolo Bonzini
  0 siblings, 1 reply; 65+ messages in thread
From: Sean Christopherson @ 2021-02-06  0:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Feb 03, 2021, Paolo Bonzini wrote:
> On 02/02/21 19:57, Ben Gardon wrote:
> > To prepare for handling page faults in parallel, change the TDP MMU
> > page fault handler to use atomic operations to set SPTEs so that changes
> > are not lost if multiple threads attempt to modify the same SPTE.
> > 
> > Reviewed-by: Peter Feiner <pfeiner@google.com>
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > 
> > ---
> > 
> > v1 -> v2
> > - Rename "atomic" arg to "shared" in multiple functions
> > - Merged the commit that protects the lists of TDP MMU pages with a new
> >    lock
> > - Merged the commits to add an atomic option for setting SPTEs and to
> >    use that option in the TDP MMU page fault handler
> 
> I'll look at the kernel test robot report if nobody beats me to it.

It's just a vanilla i386 compilation issue, the xchg() is on an 8-byte value.

We could fudge around it via #ifdef around the xchg().  Making all of tdp_mmu.c
x86-64 only would be nice to avoid future annoyance, though the number of stubs
required would be painful...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 21/28] KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  2021-02-02 18:57 ` [PATCH v2 21/28] KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
@ 2021-02-06  0:29   ` Sean Christopherson
  0 siblings, 0 replies; 65+ messages in thread
From: Sean Christopherson @ 2021-02-06  0:29 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Feb 02, 2021, Ben Gardon wrote:
> +static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> +					   struct tdp_iter *iter)
> +{
> +	/*
> +	 * Freeze the SPTE by setting it to a special,
> +	 * non-present value. This will stop other threads from
> +	 * immediately installing a present entry in its place
> +	 * before the TLBs are flushed.
> +	 */
> +	if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
> +		return false;
> +
> +	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
> +					   KVM_PAGES_PER_HPAGE(iter->level));
> +
> +	/*
> +	 * No other thread can overwrite the removed SPTE as they
> +	 * must either wait on the MMU lock or use
> +	 * tdp_mmu_set_spte_atomic which will not overrite the
> +	 * special removed SPTE value. No bookkeeping is needed
> +	 * here since the SPTE is going from non-present
> +	 * to non-present.
> +	 */

Can we expand these comments out to 80 chars before the final/official push?

> +	WRITE_ONCE(*iter->sptep, 0);
> +
> +	return true;
> +}
> +
>  
>  /*
>   * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
> @@ -523,6 +562,15 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>  
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  
> +	/*
> +	 * No thread should be using this function to set SPTEs to the
> +	 * temporary removed SPTE value.
> +	 * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
> +	 * should be used. If operating under the MMU lock in write mode, the
> +	 * use of the removed SPTE should not be necessary.
> +	 */
> +	WARN_ON(iter->old_spte == REMOVED_SPTE);
> +
>  	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
>  
>  	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> @@ -790,12 +838,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  		 */
>  		if (is_shadow_present_pte(iter.old_spte) &&
>  		    is_large_pte(iter.old_spte)) {
> -			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
> +			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>  				break;
>  
> -			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
> -					KVM_PAGES_PER_HPAGE(iter.level));
> -
>  			/*
>  			 * The iter must explicitly re-read the spte here
>  			 * because the new value informs the !present
> -- 
> 2.30.0.365.g02bc693789-goog
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-02-06  0:26     ` Sean Christopherson
@ 2021-02-08 10:32       ` Paolo Bonzini
  0 siblings, 0 replies; 65+ messages in thread
From: Paolo Bonzini @ 2021-02-08 10:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 06/02/21 01:26, Sean Christopherson wrote:
> We could fudge around it via #ifdef around the xchg().  Making all of tdp_mmu.c
> x86-64 only would be nice to avoid future annoyance, though the number of stubs
> required would be painful...

It's really just a handful, so it's worth it.

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-02 18:57 ` [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
@ 2021-02-09 20:39   ` Guenter Roeck
  2021-02-09 21:46     ` Waiman Long
  2021-02-10  3:32   ` Waiman Long
  1 sibling, 1 reply; 65+ messages in thread
From: Guenter Roeck @ 2021-02-09 20:39 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso,
	Waiman Long

On Tue, Feb 02, 2021 at 10:57:12AM -0800, Ben Gardon wrote:
> rwlocks do not currently have any facility to detect contention
> like spinlocks do. In order to allow users of rwlocks to better manage
> latency, add contention detection for queued rwlocks.
> 
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Will Deacon <will@kernel.org>
> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Davidlohr Bueso <dbueso@suse.de>
> Acked-by: Waiman Long <longman@redhat.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Ben Gardon <bgardon@google.com>

When building mips:defconfig, this patch results in:

Error log:
In file included from include/linux/spinlock.h:90,
                 from include/linux/ipc.h:5,
                 from include/uapi/linux/sem.h:5,
                 from include/linux/sem.h:5,
                 from include/linux/compat.h:14,
                 from arch/mips/kernel/asm-offsets.c:12:
arch/mips/include/asm/spinlock.h:17:28: error: redefinition of 'queued_spin_unlock'
   17 | #define queued_spin_unlock queued_spin_unlock
      |                            ^~~~~~~~~~~~~~~~~~
arch/mips/include/asm/spinlock.h:22:20: note: in expansion of macro 'queued_spin_unlock'
   22 | static inline void queued_spin_unlock(struct qspinlock *lock)
      |                    ^~~~~~~~~~~~~~~~~~
In file included from include/asm-generic/qrwlock.h:17,
                 from ./arch/mips/include/generated/asm/qrwlock.h:1,
                 from arch/mips/include/asm/spinlock.h:13,
                 from include/linux/spinlock.h:90,
                 from include/linux/ipc.h:5,
                 from include/uapi/linux/sem.h:5,
                 from include/linux/sem.h:5,
                 from include/linux/compat.h:14,
                 from arch/mips/kernel/asm-offsets.c:12:
include/asm-generic/qspinlock.h:94:29: note: previous definition of 'queued_spin_unlock' was here
   94 | static __always_inline void queued_spin_unlock(struct qspinlock *lock)
      |                             ^~~~~~~~~~~~~~~~~~

Bisect log attached.

Guenter

---
# bad: [a4bfd8d46ac357c12529e4eebb6c89502b03ecc9] Add linux-next specific files for 20210209
# good: [92bf22614b21a2706f4993b278017e437f7785b3] Linux 5.11-rc7
git bisect start 'HEAD' 'v5.11-rc7'
# good: [a8eb921ba7e8e77d994a1c6c69c8ef08456ecf53] Merge remote-tracking branch 'crypto/master'
git bisect good a8eb921ba7e8e77d994a1c6c69c8ef08456ecf53
# good: [21d507c41bdf83f6afc0e02976e43c10badfc6cd] Merge remote-tracking branch 'spi/for-next'
git bisect good 21d507c41bdf83f6afc0e02976e43c10badfc6cd
# bad: [30cd4c688a3bcf324f011d7716044b1a4681efc1] Merge remote-tracking branch 'soundwire/next'
git bisect bad 30cd4c688a3bcf324f011d7716044b1a4681efc1
# bad: [c43d2173d3eb4047bb62a7a393a298a1032cce18] Merge remote-tracking branch 'drivers-x86/for-next'
git bisect bad c43d2173d3eb4047bb62a7a393a298a1032cce18
# good: [973e9d8622a6fecc52f639680cbbde1519e1fcf8] Merge remote-tracking branch 'rcu/rcu/next'
git bisect good 973e9d8622a6fecc52f639680cbbde1519e1fcf8
# bad: [7b2aaf51d499e0372cbecafad04582c71ad03c73] Merge remote-tracking branch 'kvm/next'
git bisect bad 7b2aaf51d499e0372cbecafad04582c71ad03c73
# good: [04548ed0206ca895c8edd6a078c20a218423890b] KVM: SVM: Replace hard-coded value with #define
git bisect good 04548ed0206ca895c8edd6a078c20a218423890b
# bad: [92f4d400a407235783afd4399fa26c4c665024b5] KVM: x86/xen: Fix __user pointer handling for hypercall page installation
git bisect bad 92f4d400a407235783afd4399fa26c4c665024b5
# good: [ed5e484b79e8a9b8be714bd85b6fc70bd6dc99a7] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
git bisect good ed5e484b79e8a9b8be714bd85b6fc70bd6dc99a7
# bad: [f3d4b4b1dc1c5fb9ea17cac14133463bfe72f170] sched: Add cond_resched_rwlock
git bisect bad f3d4b4b1dc1c5fb9ea17cac14133463bfe72f170
# good: [f1b3b06a058bb5c636ffad0afae138fe30775881] KVM: x86/mmu: Clear dirtied pages mask bit before early break
git bisect good f1b3b06a058bb5c636ffad0afae138fe30775881
# bad: [26128cb6c7e6731fe644c687af97733adfdb5ee9] locking/rwlocks: Add contention detection for rwlocks
git bisect bad 26128cb6c7e6731fe644c687af97733adfdb5ee9
# good: [7cca2d0b7e7d9f3cd740d41afdc00051c9b508a0] KVM: x86/mmu: Protect TDP MMU page table memory with RCU
git bisect good 7cca2d0b7e7d9f3cd740d41afdc00051c9b508a0
# first bad commit: [26128cb6c7e6731fe644c687af97733adfdb5ee9] locking/rwlocks: Add contention detection for rwlocks

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-09 20:39   ` Guenter Roeck
@ 2021-02-09 21:46     ` Waiman Long
  2021-02-09 22:25       ` Guenter Roeck
  0 siblings, 1 reply; 65+ messages in thread
From: Waiman Long @ 2021-02-09 21:46 UTC (permalink / raw)
  To: Guenter Roeck, Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso

On 2/9/21 3:39 PM, Guenter Roeck wrote:
> On Tue, Feb 02, 2021 at 10:57:12AM -0800, Ben Gardon wrote:
>> rwlocks do not currently have any facility to detect contention
>> like spinlocks do. In order to allow users of rwlocks to better manage
>> latency, add contention detection for queued rwlocks.
>>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: Will Deacon <will@kernel.org>
>> Acked-by: Peter Zijlstra <peterz@infradead.org>
>> Acked-by: Davidlohr Bueso <dbueso@suse.de>
>> Acked-by: Waiman Long <longman@redhat.com>
>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Ben Gardon <bgardon@google.com>
> When building mips:defconfig, this patch results in:
>
> Error log:
> In file included from include/linux/spinlock.h:90,
>                   from include/linux/ipc.h:5,
>                   from include/uapi/linux/sem.h:5,
>                   from include/linux/sem.h:5,
>                   from include/linux/compat.h:14,
>                   from arch/mips/kernel/asm-offsets.c:12:
> arch/mips/include/asm/spinlock.h:17:28: error: redefinition of 'queued_spin_unlock'
>     17 | #define queued_spin_unlock queued_spin_unlock
>        |                            ^~~~~~~~~~~~~~~~~~
> arch/mips/include/asm/spinlock.h:22:20: note: in expansion of macro 'queued_spin_unlock'
>     22 | static inline void queued_spin_unlock(struct qspinlock *lock)
>        |                    ^~~~~~~~~~~~~~~~~~
> In file included from include/asm-generic/qrwlock.h:17,
>                   from ./arch/mips/include/generated/asm/qrwlock.h:1,
>                   from arch/mips/include/asm/spinlock.h:13,
>                   from include/linux/spinlock.h:90,
>                   from include/linux/ipc.h:5,
>                   from include/uapi/linux/sem.h:5,
>                   from include/linux/sem.h:5,
>                   from include/linux/compat.h:14,
>                   from arch/mips/kernel/asm-offsets.c:12:
> include/asm-generic/qspinlock.h:94:29: note: previous definition of 'queued_spin_unlock' was here
>     94 | static __always_inline void queued_spin_unlock(struct qspinlock *lock)
>        |                             ^~~~~~~~~~~~~~~~~~

I think the compile error is caused by the improper header file 
inclusion ordering. Can you try the following change to see if it can 
fix the compile error?

Cheers,
Longman

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 0020d3b820a7..d7178a9439b5 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -10,11 +10,11 @@
  #define __ASM_GENERIC_QRWLOCK_H

  #include <linux/atomic.h>
+#include <linux/spinlock.h>
  #include <asm/barrier.h>
  #include <asm/processor.h>

  #include <asm-generic/qrwlock_types.h>
-#include <asm-generic/qspinlock.h>

  /*
   * Writer states & reader shift and bias.




^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-09 21:46     ` Waiman Long
@ 2021-02-09 22:25       ` Guenter Roeck
  2021-02-10  0:27         ` Waiman Long
  0 siblings, 1 reply; 65+ messages in thread
From: Guenter Roeck @ 2021-02-09 22:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ben Gardon, linux-kernel, kvm, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Peter Shier, Peter Feiner, Junaid Shahid,
	Jim Mattson, Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov,
	Xiao Guangrong, Ingo Molnar, Will Deacon, Peter Zijlstra,
	Davidlohr Bueso

On Tue, Feb 09, 2021 at 04:46:02PM -0500, Waiman Long wrote:
> On 2/9/21 3:39 PM, Guenter Roeck wrote:
> > On Tue, Feb 02, 2021 at 10:57:12AM -0800, Ben Gardon wrote:
> > > rwlocks do not currently have any facility to detect contention
> > > like spinlocks do. In order to allow users of rwlocks to better manage
> > > latency, add contention detection for queued rwlocks.
> > > 
> > > CC: Ingo Molnar <mingo@redhat.com>
> > > CC: Will Deacon <will@kernel.org>
> > > Acked-by: Peter Zijlstra <peterz@infradead.org>
> > > Acked-by: Davidlohr Bueso <dbueso@suse.de>
> > > Acked-by: Waiman Long <longman@redhat.com>
> > > Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Signed-off-by: Ben Gardon <bgardon@google.com>
> > When building mips:defconfig, this patch results in:
> > 
> > Error log:
> > In file included from include/linux/spinlock.h:90,
> >                   from include/linux/ipc.h:5,
> >                   from include/uapi/linux/sem.h:5,
> >                   from include/linux/sem.h:5,
> >                   from include/linux/compat.h:14,
> >                   from arch/mips/kernel/asm-offsets.c:12:
> > arch/mips/include/asm/spinlock.h:17:28: error: redefinition of 'queued_spin_unlock'
> >     17 | #define queued_spin_unlock queued_spin_unlock
> >        |                            ^~~~~~~~~~~~~~~~~~
> > arch/mips/include/asm/spinlock.h:22:20: note: in expansion of macro 'queued_spin_unlock'
> >     22 | static inline void queued_spin_unlock(struct qspinlock *lock)
> >        |                    ^~~~~~~~~~~~~~~~~~
> > In file included from include/asm-generic/qrwlock.h:17,
> >                   from ./arch/mips/include/generated/asm/qrwlock.h:1,
> >                   from arch/mips/include/asm/spinlock.h:13,
> >                   from include/linux/spinlock.h:90,
> >                   from include/linux/ipc.h:5,
> >                   from include/uapi/linux/sem.h:5,
> >                   from include/linux/sem.h:5,
> >                   from include/linux/compat.h:14,
> >                   from arch/mips/kernel/asm-offsets.c:12:
> > include/asm-generic/qspinlock.h:94:29: note: previous definition of 'queued_spin_unlock' was here
> >     94 | static __always_inline void queued_spin_unlock(struct qspinlock *lock)
> >        |                             ^~~~~~~~~~~~~~~~~~
> 
> I think the compile error is caused by the improper header file inclusion
> ordering. Can you try the following change to see if it can fix the compile
> error?
> 

That results in:

In file included from ./arch/mips/include/generated/asm/qrwlock.h:1,
                 from ./arch/mips/include/asm/spinlock.h:13,
                 from ./include/linux/spinlock.h:90,
                 from ./include/linux/ipc.h:5,
                 from ./include/uapi/linux/sem.h:5,
                 from ./include/linux/sem.h:5,
                 from ./include/linux/compat.h:14,
                 from arch/mips/kernel/asm-offsets.c:12:
./include/asm-generic/qrwlock.h: In function 'queued_rwlock_is_contended':
./include/asm-generic/qrwlock.h:127:9: error: implicit declaration of function 'arch_spin_is_locked'

Guenter

> Cheers,
> Longman
> 
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index 0020d3b820a7..d7178a9439b5 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -10,11 +10,11 @@
>  #define __ASM_GENERIC_QRWLOCK_H
> 
>  #include <linux/atomic.h>
> +#include <linux/spinlock.h>
>  #include <asm/barrier.h>
>  #include <asm/processor.h>
> 
>  #include <asm-generic/qrwlock_types.h>
> -#include <asm-generic/qspinlock.h>
> 
>  /*
>   * Writer states & reader shift and bias.
> 
> 
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-09 22:25       ` Guenter Roeck
@ 2021-02-10  0:27         ` Waiman Long
  2021-02-10  0:41           ` Waiman Long
  2021-02-10  6:04           ` Guenter Roeck
  0 siblings, 2 replies; 65+ messages in thread
From: Waiman Long @ 2021-02-10  0:27 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Ben Gardon, linux-kernel, kvm, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Peter Shier, Peter Feiner, Junaid Shahid,
	Jim Mattson, Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov,
	Xiao Guangrong, Ingo Molnar, Will Deacon, Peter Zijlstra,
	Davidlohr Bueso

On 2/9/21 5:25 PM, Guenter Roeck wrote:
> On Tue, Feb 09, 2021 at 04:46:02PM -0500, Waiman Long wrote:
>> On 2/9/21 3:39 PM, Guenter Roeck wrote:
>>> On Tue, Feb 02, 2021 at 10:57:12AM -0800, Ben Gardon wrote:
>>>> rwlocks do not currently have any facility to detect contention
>>>> like spinlocks do. In order to allow users of rwlocks to better manage
>>>> latency, add contention detection for queued rwlocks.
>>>>
>>>> CC: Ingo Molnar <mingo@redhat.com>
>>>> CC: Will Deacon <will@kernel.org>
>>>> Acked-by: Peter Zijlstra <peterz@infradead.org>
>>>> Acked-by: Davidlohr Bueso <dbueso@suse.de>
>>>> Acked-by: Waiman Long <longman@redhat.com>
>>>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>>>> Signed-off-by: Ben Gardon <bgardon@google.com>
>>> When building mips:defconfig, this patch results in:
>>>
>>> Error log:
>>> In file included from include/linux/spinlock.h:90,
>>>                    from include/linux/ipc.h:5,
>>>                    from include/uapi/linux/sem.h:5,
>>>                    from include/linux/sem.h:5,
>>>                    from include/linux/compat.h:14,
>>>                    from arch/mips/kernel/asm-offsets.c:12:
>>> arch/mips/include/asm/spinlock.h:17:28: error: redefinition of 'queued_spin_unlock'
>>>      17 | #define queued_spin_unlock queued_spin_unlock
>>>         |                            ^~~~~~~~~~~~~~~~~~
>>> arch/mips/include/asm/spinlock.h:22:20: note: in expansion of macro 'queued_spin_unlock'
>>>      22 | static inline void queued_spin_unlock(struct qspinlock *lock)
>>>         |                    ^~~~~~~~~~~~~~~~~~
>>> In file included from include/asm-generic/qrwlock.h:17,
>>>                    from ./arch/mips/include/generated/asm/qrwlock.h:1,
>>>                    from arch/mips/include/asm/spinlock.h:13,
>>>                    from include/linux/spinlock.h:90,
>>>                    from include/linux/ipc.h:5,
>>>                    from include/uapi/linux/sem.h:5,
>>>                    from include/linux/sem.h:5,
>>>                    from include/linux/compat.h:14,
>>>                    from arch/mips/kernel/asm-offsets.c:12:
>>> include/asm-generic/qspinlock.h:94:29: note: previous definition of 'queued_spin_unlock' was here
>>>      94 | static __always_inline void queued_spin_unlock(struct qspinlock *lock)
>>>         |                             ^~~~~~~~~~~~~~~~~~
>> I think the compile error is caused by the improper header file inclusion
>> ordering. Can you try the following change to see if it can fix the compile
>> error?
>>
> That results in:
>
> In file included from ./arch/mips/include/generated/asm/qrwlock.h:1,
>                   from ./arch/mips/include/asm/spinlock.h:13,
>                   from ./include/linux/spinlock.h:90,
>                   from ./include/linux/ipc.h:5,
>                   from ./include/uapi/linux/sem.h:5,
>                   from ./include/linux/sem.h:5,
>                   from ./include/linux/compat.h:14,
>                   from arch/mips/kernel/asm-offsets.c:12:
> ./include/asm-generic/qrwlock.h: In function 'queued_rwlock_is_contended':
> ./include/asm-generic/qrwlock.h:127:9: error: implicit declaration of function 'arch_spin_is_locked'
>
> Guenter

It is because in arch/mips/include/asm/spinlock.h, asm/qrwlock.h is 
included before asm/qspinlock.h. The compilation error should be gone if 
the asm/qrwlock.h is removed or moved after asm/qspinlock.h.

I did a x86 build and there was no compilation issue.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-10  0:27         ` Waiman Long
@ 2021-02-10  0:41           ` Waiman Long
  2021-02-10  6:04           ` Guenter Roeck
  1 sibling, 0 replies; 65+ messages in thread
From: Waiman Long @ 2021-02-10  0:41 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Ben Gardon, linux-kernel, kvm, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Peter Shier, Peter Feiner, Junaid Shahid,
	Jim Mattson, Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov,
	Xiao Guangrong, Ingo Molnar, Will Deacon, Peter Zijlstra,
	Davidlohr Bueso

On 2/9/21 7:27 PM, Waiman Long wrote:
> On 2/9/21 5:25 PM, Guenter Roeck wrote:
>> On Tue, Feb 09, 2021 at 04:46:02PM -0500, Waiman Long wrote:
>>> On 2/9/21 3:39 PM, Guenter Roeck wrote:
>>>> On Tue, Feb 02, 2021 at 10:57:12AM -0800, Ben Gardon wrote:
>>>>> rwlocks do not currently have any facility to detect contention
>>>>> like spinlocks do. In order to allow users of rwlocks to better 
>>>>> manage
>>>>> latency, add contention detection for queued rwlocks.
>>>>>
>>>>> CC: Ingo Molnar <mingo@redhat.com>
>>>>> CC: Will Deacon <will@kernel.org>
>>>>> Acked-by: Peter Zijlstra <peterz@infradead.org>
>>>>> Acked-by: Davidlohr Bueso <dbueso@suse.de>
>>>>> Acked-by: Waiman Long <longman@redhat.com>
>>>>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>> Signed-off-by: Ben Gardon <bgardon@google.com>
>>>> When building mips:defconfig, this patch results in:
>>>>
>>>> Error log:
>>>> In file included from include/linux/spinlock.h:90,
>>>>                    from include/linux/ipc.h:5,
>>>>                    from include/uapi/linux/sem.h:5,
>>>>                    from include/linux/sem.h:5,
>>>>                    from include/linux/compat.h:14,
>>>>                    from arch/mips/kernel/asm-offsets.c:12:
>>>> arch/mips/include/asm/spinlock.h:17:28: error: redefinition of 
>>>> 'queued_spin_unlock'
>>>>      17 | #define queued_spin_unlock queued_spin_unlock
>>>>         |                            ^~~~~~~~~~~~~~~~~~
>>>> arch/mips/include/asm/spinlock.h:22:20: note: in expansion of macro 
>>>> 'queued_spin_unlock'
>>>>      22 | static inline void queued_spin_unlock(struct qspinlock 
>>>> *lock)
>>>>         |                    ^~~~~~~~~~~~~~~~~~
>>>> In file included from include/asm-generic/qrwlock.h:17,
>>>>                    from ./arch/mips/include/generated/asm/qrwlock.h:1,
>>>>                    from arch/mips/include/asm/spinlock.h:13,
>>>>                    from include/linux/spinlock.h:90,
>>>>                    from include/linux/ipc.h:5,
>>>>                    from include/uapi/linux/sem.h:5,
>>>>                    from include/linux/sem.h:5,
>>>>                    from include/linux/compat.h:14,
>>>>                    from arch/mips/kernel/asm-offsets.c:12:
>>>> include/asm-generic/qspinlock.h:94:29: note: previous definition of 
>>>> 'queued_spin_unlock' was here
>>>>      94 | static __always_inline void queued_spin_unlock(struct 
>>>> qspinlock *lock)
>>>>         |                             ^~~~~~~~~~~~~~~~~~
>>> I think the compile error is caused by the improper header file 
>>> inclusion
>>> ordering. Can you try the following change to see if it can fix the 
>>> compile
>>> error?
>>>
>> That results in:
>>
>> In file included from ./arch/mips/include/generated/asm/qrwlock.h:1,
>>                   from ./arch/mips/include/asm/spinlock.h:13,
>>                   from ./include/linux/spinlock.h:90,
>>                   from ./include/linux/ipc.h:5,
>>                   from ./include/uapi/linux/sem.h:5,
>>                   from ./include/linux/sem.h:5,
>>                   from ./include/linux/compat.h:14,
>>                   from arch/mips/kernel/asm-offsets.c:12:
>> ./include/asm-generic/qrwlock.h: In function 
>> 'queued_rwlock_is_contended':
>> ./include/asm-generic/qrwlock.h:127:9: error: implicit declaration of 
>> function 'arch_spin_is_locked'
>>
>> Guenter
>
> It is because in arch/mips/include/asm/spinlock.h, asm/qrwlock.h is 
> included before asm/qspinlock.h. The compilation error should be gone 
> if the asm/qrwlock.h is removed or moved after asm/qspinlock.h. 

After thinking a bit more, I think we should remove asm/qrwlock.h in 
arch/mips/include/asm/spinlock.h. qrwlock and qspinlocks are 
independent. An architecture can include one but not the other. Also 
there is no point in including qrwlock.h in a asm/spinlock.h.

Regards,
Longman


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-02 18:57 ` [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
  2021-02-09 20:39   ` Guenter Roeck
@ 2021-02-10  3:32   ` Waiman Long
  2021-02-10 15:15     ` Waiman Long
  1 sibling, 1 reply; 65+ messages in thread
From: Waiman Long @ 2021-02-10  3:32 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ingo Molnar,
	Will Deacon, Peter Zijlstra, Davidlohr Bueso

On 2/2/21 1:57 PM, Ben Gardon wrote:
> rwlocks do not currently have any facility to detect contention
> like spinlocks do. In order to allow users of rwlocks to better manage
> latency, add contention detection for queued rwlocks.
>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Will Deacon <will@kernel.org>
> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Davidlohr Bueso <dbueso@suse.de>
> Acked-by: Waiman Long <longman@redhat.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   include/asm-generic/qrwlock.h | 24 ++++++++++++++++++------
>   include/linux/rwlock.h        |  7 +++++++
>   2 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index 84ce841ce735..0020d3b820a7 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -14,6 +14,7 @@
>   #include <asm/processor.h>
>   
>   #include <asm-generic/qrwlock_types.h>
> +#include <asm-generic/qspinlock.h>

As said in another thread, qspinlock and qrwlock can be independently 
enabled for an architecture. So we shouldn't include qspinlock.h here. 
Instead, just include the regular linux/spinlock.h file to make sure 
that arch_spin_is_locked() is available.


>   
>   /*
>    * Writer states & reader shift and bias.
> @@ -116,15 +117,26 @@ static inline void queued_write_unlock(struct qrwlock *lock)
>   	smp_store_release(&lock->wlocked, 0);
>   }
>   
> +/**
> + * queued_rwlock_is_contended - check if the lock is contended
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock contended, 0 otherwise
> + */
> +static inline int queued_rwlock_is_contended(struct qrwlock *lock)
> +{
> +	return arch_spin_is_locked(&lock->wait_lock);
> +}
> +
>   /*
>    * Remapping rwlock architecture specific functions to the corresponding
>    * queue rwlock functions.
>    */
> -#define arch_read_lock(l)	queued_read_lock(l)
> -#define arch_write_lock(l)	queued_write_lock(l)
> -#define arch_read_trylock(l)	queued_read_trylock(l)
> -#define arch_write_trylock(l)	queued_write_trylock(l)
> -#define arch_read_unlock(l)	queued_read_unlock(l)
> -#define arch_write_unlock(l)	queued_write_unlock(l)
> +#define arch_read_lock(l)		queued_read_lock(l)
> +#define arch_write_lock(l)		queued_write_lock(l)
> +#define arch_read_trylock(l)		queued_read_trylock(l)
> +#define arch_write_trylock(l)		queued_write_trylock(l)
> +#define arch_read_unlock(l)		queued_read_unlock(l)
> +#define arch_write_unlock(l)		queued_write_unlock(l)
> +#define arch_rwlock_is_contended(l)	queued_rwlock_is_contended(l)
>   
>   #endif /* __ASM_GENERIC_QRWLOCK_H */
> diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
> index 3dcd617e65ae..7ce9a51ae5c0 100644
> --- a/include/linux/rwlock.h
> +++ b/include/linux/rwlock.h
> @@ -128,4 +128,11 @@ do {								\
>   	1 : ({ local_irq_restore(flags); 0; }); \
>   })
>   
> +#ifdef arch_rwlock_is_contended
> +#define rwlock_is_contended(lock) \
> +	 arch_rwlock_is_contended(&(lock)->raw_lock)
> +#else
> +#define rwlock_is_contended(lock)	((void)(lock), 0)
> +#endif /* arch_rwlock_is_contended */
> +
>   #endif /* __LINUX_RWLOCK_H */

Cheers,
Longman


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-10  0:27         ` Waiman Long
  2021-02-10  0:41           ` Waiman Long
@ 2021-02-10  6:04           ` Guenter Roeck
  2021-02-10 14:57             ` Waiman Long
  1 sibling, 1 reply; 65+ messages in thread
From: Guenter Roeck @ 2021-02-10  6:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ben Gardon, linux-kernel, kvm, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Peter Shier, Peter Feiner, Junaid Shahid,
	Jim Mattson, Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov,
	Xiao Guangrong, Ingo Molnar, Will Deacon, Peter Zijlstra,
	Davidlohr Bueso

On 2/9/21 4:27 PM, Waiman Long wrote:
[ ... ]

> 
> It is because in arch/mips/include/asm/spinlock.h, asm/qrwlock.h is included before asm/qspinlock.h. The compilation error should be gone if the asm/qrwlock.h is removed or moved after asm/qspinlock.h.
> 
> I did a x86 build and there was no compilation issue.
> 
I can not really comment on what exactly is wrong - I don't know the code well
enough to do that - but I don't think this is a valid argument.

Anyway, it seems like mips is the only architecture affected by the problem.
I am not entirely sure, though - linux-next is too broken for that.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-10  6:04           ` Guenter Roeck
@ 2021-02-10 14:57             ` Waiman Long
  0 siblings, 0 replies; 65+ messages in thread
From: Waiman Long @ 2021-02-10 14:57 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Ben Gardon, linux-kernel, kvm, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Peter Shier, Peter Feiner, Junaid Shahid,
	Jim Mattson, Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov,
	Xiao Guangrong, Ingo Molnar, Will Deacon, Peter Zijlstra,
	Davidlohr Bueso

On 2/10/21 1:04 AM, Guenter Roeck wrote:
> On 2/9/21 4:27 PM, Waiman Long wrote:
> [ ... ]
>
>> It is because in arch/mips/include/asm/spinlock.h, asm/qrwlock.h is included before asm/qspinlock.h. The compilation error should be gone if the asm/qrwlock.h is removed or moved after asm/qspinlock.h.
>>
>> I did a x86 build and there was no compilation issue.
>>
> I can not really comment on what exactly is wrong - I don't know the code well
> enough to do that - but I don't think this is a valid argument.
>
> Anyway, it seems like mips is the only architecture affected by the problem.
> I am not entirely sure, though - linux-next is too broken for that.

It does look like a rather common practice to include both qrwlock.h and 
qspinlock.h in asm/spinlock.h file. I have just a patch to make sure 
that qrwlock is always included after qspinlock.h if present. Hopefully 
that can fix the compilation problem.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks
  2021-02-10  3:32   ` Waiman Long
@ 2021-02-10 15:15     ` Waiman Long
  0 siblings, 0 replies; 65+ messages in thread
From: Waiman Long @ 2021-02-10 15:15 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ingo Molnar,
	Will Deacon, Peter Zijlstra, Davidlohr Bueso

On 2/9/21 10:32 PM, Waiman Long wrote:
> On 2/2/21 1:57 PM, Ben Gardon wrote:
>> rwlocks do not currently have any facility to detect contention
>> like spinlocks do. In order to allow users of rwlocks to better manage
>> latency, add contention detection for queued rwlocks.
>>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: Will Deacon <will@kernel.org>
>> Acked-by: Peter Zijlstra <peterz@infradead.org>
>> Acked-by: Davidlohr Bueso <dbueso@suse.de>
>> Acked-by: Waiman Long <longman@redhat.com>
>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Ben Gardon <bgardon@google.com>
>> ---
>>   include/asm-generic/qrwlock.h | 24 ++++++++++++++++++------
>>   include/linux/rwlock.h        |  7 +++++++
>>   2 files changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/asm-generic/qrwlock.h 
>> b/include/asm-generic/qrwlock.h
>> index 84ce841ce735..0020d3b820a7 100644
>> --- a/include/asm-generic/qrwlock.h
>> +++ b/include/asm-generic/qrwlock.h
>> @@ -14,6 +14,7 @@
>>   #include <asm/processor.h>
>>     #include <asm-generic/qrwlock_types.h>
>> +#include <asm-generic/qspinlock.h>
>
> As said in another thread, qspinlock and qrwlock can be independently 
> enabled for an architecture. So we shouldn't include qspinlock.h here. 
> Instead, just include the regular linux/spinlock.h file to make sure 
> that arch_spin_is_locked() is available.

The csky architecture uses qrwlock but not qspinlock. So this patch can 
be problematic when compiling for csky.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-02-02 18:57 ` [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
  2021-02-03  2:48   ` kernel test robot
  2021-02-03 11:14   ` Paolo Bonzini
@ 2021-04-01 10:32   ` Paolo Bonzini
  2021-04-01 16:50     ` Ben Gardon
  2 siblings, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-04-01 10:32 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 02/02/21 19:57, Ben Gardon wrote:
> @@ -720,7 +790,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   		 */
>   		if (is_shadow_present_pte(iter.old_spte) &&
>   		    is_large_pte(iter.old_spte)) {
> -			tdp_mmu_set_spte(vcpu->kvm, &iter, 0);
> +			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
> +				break;
>   
>   			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
>   					KVM_PAGES_PER_HPAGE(iter.level));
>
>  			/*
>  			 * The iter must explicitly re-read the spte here
>  			 * because the new value informs the !present
>                          * path below.
>                          */
>                         iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>                 }
> 
>                 if (!is_shadow_present_pte(iter.old_spte)) {

Would it be easier to reason about this code by making it retry, like:

retry:
                 if (is_shadow_present_pte(iter.old_spte)) {
			if (is_large_pte(iter.old_spte)) {
	                        if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
	                                break;

				/*
				 * The iter must explicitly re-read the SPTE because
				 * the atomic cmpxchg failed.
				 */
	                        iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
				goto retry;
			}
                 } else {
			...
		}

?

Paolo


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-04-01 10:32   ` Paolo Bonzini
@ 2021-04-01 16:50     ` Ben Gardon
  2021-04-01 17:32       ` Paolo Bonzini
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2021-04-01 16:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Thu, Apr 1, 2021 at 3:32 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 02/02/21 19:57, Ben Gardon wrote:
> > @@ -720,7 +790,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >                */
> >               if (is_shadow_present_pte(iter.old_spte) &&
> >                   is_large_pte(iter.old_spte)) {
> > -                     tdp_mmu_set_spte(vcpu->kvm, &iter, 0);
> > +                     if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
> > +                             break;
> >
> >                       kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
> >                                       KVM_PAGES_PER_HPAGE(iter.level));
> >
> >                       /*
> >                        * The iter must explicitly re-read the spte here
> >                        * because the new value informs the !present
> >                          * path below.
> >                          */
> >                         iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> >                 }
> >
> >                 if (!is_shadow_present_pte(iter.old_spte)) {
>
> Would it be easier to reason about this code by making it retry, like:
>
> retry:
>                  if (is_shadow_present_pte(iter.old_spte)) {
>                         if (is_large_pte(iter.old_spte)) {
>                                 if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>                                         break;
>
>                                 /*
>                                  * The iter must explicitly re-read the SPTE because
>                                  * the atomic cmpxchg failed.
>                                  */
>                                 iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>                                 goto retry;
>                         }
>                  } else {
>                         ...
>                 }
>
> ?

To be honest, that feels less readable to me. For me retry implies
that we failed to make progress and need to repeat an operation, but
the reality is that we did make progress and there are just multiple
steps to replace the large SPTE with a child PT.
Another option which could improve readability and performance would
be to use the retry to repeat failed cmpxchgs instead of breaking out
of the loop. Then we could avoid retrying the page fault each time a
cmpxchg failed, which may happen a lot as vCPUs allocate intermediate
page tables on boot. (Probably less common for leaf entries, but
possibly useful there too.)
Another-nother option would be to remove this two part process by
eagerly splitting large page mappings in a single step. This would
substantially reduce the number of page faults incurred for NX
splitting / dirty logging splitting. It's been on our list of features
to send upstream for a while and I hope we'll be able to get it into
shape and send it out reasonably soon.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-04-01 16:50     ` Ben Gardon
@ 2021-04-01 17:32       ` Paolo Bonzini
  2021-04-01 18:09         ` Sean Christopherson
  0 siblings, 1 reply; 65+ messages in thread
From: Paolo Bonzini @ 2021-04-01 17:32 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 01/04/21 18:50, Ben Gardon wrote:
>> retry:
>>                   if (is_shadow_present_pte(iter.old_spte)) {
>>                          if (is_large_pte(iter.old_spte)) {
>>                                  if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>>                                          break;
>>
>>                                  /*
>>                                   * The iter must explicitly re-read the SPTE because
>>                                   * the atomic cmpxchg failed.
>>                                   */
>>                                  iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>>                                  goto retry;
>>                          }
>>                   } else {
>>                          ...
>>                  }
>>
>> ?
> To be honest, that feels less readable to me. For me retry implies
> that we failed to make progress and need to repeat an operation, but
> the reality is that we did make progress and there are just multiple
> steps to replace the large SPTE with a child PT.

You're right, it's makes no sense---I misremembered the direction of
tdp_mmu_zap_spte_atomic's return value.  I was actually thinking of this:

> Another option which could improve readability and performance would
> be to use the retry to repeat failed cmpxchgs instead of breaking out
> of the loop. Then we could avoid retrying the page fault each time a
> cmpxchg failed, which may happen a lot as vCPUs allocate intermediate
> page tables on boot. (Probably less common for leaf entries, but
> possibly useful there too.)

which would be

retry:
                  if (is_shadow_present_pte(iter.old_spte)) {
                        if (is_large_pte(iter.old_spte) &&
                            !tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) {
                                 /*
                                  * The iter must explicitly re-read the SPTE because
                                  * the atomic cmpxchg failed.
                                  */
                                 iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
                                 goto retry;
                             }
                             /* XXX move this to tdp_mmu_zap_spte_atomic? */
                             iter.old_spte = 0;
                        } else {
                             continue;
                        }
                  }
                  sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
                  child_pt = sp->spt;

                  new_spte = make_nonleaf_spte(child_pt,
                                               !shadow_accessed_mask);

                  if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
                                              new_spte)) {
                       tdp_mmu_free_sp(sp);
                       /*
                        * The iter must explicitly re-read the SPTE because
                        * the atomic cmpxchg failed.
                        */
                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
                       goto retry;
                  }
                  tdp_mmu_link_page(vcpu->kvm, sp, true,
                                    huge_page_disallowed &&
                                    req_level >= iter.level);

                  trace_kvm_mmu_get_page(sp, true);

which survives at least a quick smoke test of booting a 20-vCPU Windows
guest.  If you agree I'll turn this into an actual patch.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-04-01 17:32       ` Paolo Bonzini
@ 2021-04-01 18:09         ` Sean Christopherson
  0 siblings, 0 replies; 65+ messages in thread
From: Sean Christopherson @ 2021-04-01 18:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, LKML, kvm, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On Thu, Apr 01, 2021, Paolo Bonzini wrote:
> On 01/04/21 18:50, Ben Gardon wrote:
> > > retry:
> > >                   if (is_shadow_present_pte(iter.old_spte)) {
> > >                          if (is_large_pte(iter.old_spte)) {
> > >                                  if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> > >                                          break;
> > > 
> > >                                  /*
> > >                                   * The iter must explicitly re-read the SPTE because
> > >                                   * the atomic cmpxchg failed.
> > >                                   */
> > >                                  iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> > >                                  goto retry;
> > >                          }
> > >                   } else {
> > >                          ...
> > >                  }
> > > 
> > > ?
> > To be honest, that feels less readable to me. For me retry implies
> > that we failed to make progress and need to repeat an operation, but
> > the reality is that we did make progress and there are just multiple
> > steps to replace the large SPTE with a child PT.
> 
> You're right, it's makes no sense---I misremembered the direction of
> tdp_mmu_zap_spte_atomic's return value.  I was actually thinking of this:
> 
> > Another option which could improve readability and performance would
> > be to use the retry to repeat failed cmpxchgs instead of breaking out
> > of the loop. Then we could avoid retrying the page fault each time a
> > cmpxchg failed, which may happen a lot as vCPUs allocate intermediate
> > page tables on boot. (Probably less common for leaf entries, but
> > possibly useful there too.)
> 
> which would be
> 
> retry:
>                  if (is_shadow_present_pte(iter.old_spte)) {
>                        if (is_large_pte(iter.old_spte) &&
>                            !tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) {
>                                 /*
>                                  * The iter must explicitly re-read the SPTE because
>                                  * the atomic cmpxchg failed.
>                                  */
>                                 iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>                                 goto retry;
>                             }
>                             /* XXX move this to tdp_mmu_zap_spte_atomic? */
>                             iter.old_spte = 0;
>                        } else {
>                             continue;

This is wrong.  If a large PTE is successfully zapped, it will leave a !PRESENT
intermediate entry.  It's probably not fatal; I'm guessing it would lead to
RET_PF_RETRY and cleaned up on the subsequent re-fault.

>                        }
>                  }
>                  sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
>                  child_pt = sp->spt;
> 
>                  new_spte = make_nonleaf_spte(child_pt,
>                                               !shadow_accessed_mask);
> 
>                  if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
>                                              new_spte)) {
>                       tdp_mmu_free_sp(sp);
>                       /*
>                        * The iter must explicitly re-read the SPTE because
>                        * the atomic cmpxchg failed.
>                        */
>                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>                       goto retry;

I'm not sure that _always_ retrying is correct.  The conflict means something
else is writing the same SPTE.  That could be a different vCPU handling an
identical fault, but it could also be something else blasting away the SPTE.  If
an upper level SPTE was zapped, e.g. the entire MMU instance is zapped,
installing a new SPE would be wrong.

AFAICT, the only motivation for retrying in this loop is to handle the case
where a different vCPU is handling an identical fault.  It should be safe to
handler that, but if the conflicting SPTE is not-present, I believe this needs
to break to handle any pending updates.

			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
			if (!is_shadow_present_pte(iter.old_spte))
				break;
			goto retry;

>                  }
>                  tdp_mmu_link_page(vcpu->kvm, sp, true,
>                                    huge_page_disallowed &&
>                                    req_level >= iter.level);
> 
>                  trace_kvm_mmu_get_page(sp, true);
> 
> which survives at least a quick smoke test of booting a 20-vCPU Windows
> guest.  If you agree I'll turn this into an actual patch.
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2021-04-01 19:20 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02 18:57 [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Ben Gardon
2021-02-02 18:57 ` [PATCH v2 01/28] KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
2021-02-02 18:57 ` [PATCH v2 02/28] KVM: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
2021-02-02 18:57 ` [PATCH v2 03/28] KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
2021-02-02 18:57 ` [PATCH v2 04/28] KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
2021-02-02 18:57 ` [PATCH v2 05/28] KVM: x86/mmu: Factor out handling of removed page tables Ben Gardon
2021-02-02 18:57 ` [PATCH v2 06/28] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
2021-02-09 20:39   ` Guenter Roeck
2021-02-09 21:46     ` Waiman Long
2021-02-09 22:25       ` Guenter Roeck
2021-02-10  0:27         ` Waiman Long
2021-02-10  0:41           ` Waiman Long
2021-02-10  6:04           ` Guenter Roeck
2021-02-10 14:57             ` Waiman Long
2021-02-10  3:32   ` Waiman Long
2021-02-10 15:15     ` Waiman Long
2021-02-02 18:57 ` [PATCH v2 07/28] sched: Add needbreak " Ben Gardon
2021-02-02 18:57 ` [PATCH v2 08/28] sched: Add cond_resched_rwlock Ben Gardon
2021-02-02 18:57 ` [PATCH v2 09/28] KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages Ben Gardon
2021-02-02 18:57 ` [PATCH v2 10/28] KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs Ben Gardon
2021-02-03  9:43   ` Paolo Bonzini
2021-02-02 18:57 ` [PATCH v2 11/28] KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched Ben Gardon
2021-02-02 18:57 ` [PATCH v2 12/28] KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn Ben Gardon
2021-02-02 18:57 ` [PATCH v2 13/28] KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter Ben Gardon
2021-02-05 23:42   ` Sean Christopherson
2021-02-02 18:57 ` [PATCH v2 14/28] KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed Ben Gardon
2021-02-02 18:57 ` [PATCH v2 15/28] KVM: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
2021-02-02 18:57 ` [PATCH v2 16/28] KVM: x86/mmu: Clear dirtied pages mask bit before early break Ben Gardon
2021-02-02 18:57 ` [PATCH v2 17/28] KVM: x86/mmu: Protect TDP MMU page table memory with RCU Ben Gardon
2021-02-02 18:57 ` [PATCH v2 18/28] KVM: x86/mmu: Use an rwlock for the x86 MMU Ben Gardon
2021-02-02 18:57 ` [PATCH v2 19/28] KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages Ben Gardon
2021-02-02 18:57 ` [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
2021-02-03  2:48   ` kernel test robot
2021-02-03 11:14   ` Paolo Bonzini
2021-02-06  0:26     ` Sean Christopherson
2021-02-08 10:32       ` Paolo Bonzini
2021-04-01 10:32   ` Paolo Bonzini
2021-04-01 16:50     ` Ben Gardon
2021-04-01 17:32       ` Paolo Bonzini
2021-04-01 18:09         ` Sean Christopherson
2021-02-02 18:57 ` [PATCH v2 21/28] KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
2021-02-06  0:29   ` Sean Christopherson
2021-02-02 18:57 ` [PATCH v2 22/28] KVM: x86/mmu: Mark SPTEs in disconnected pages as removed Ben Gardon
2021-02-03 11:17   ` Paolo Bonzini
2021-02-02 18:57 ` [PATCH v2 23/28] KVM: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
2021-02-03 12:39   ` Paolo Bonzini
2021-02-03 17:46     ` Ben Gardon
2021-02-03 18:30       ` Paolo Bonzini
2021-02-06  0:12         ` Sean Christopherson
2021-02-02 18:57 ` [PATCH v2 24/28] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock Ben Gardon
2021-02-03 11:25   ` Paolo Bonzini
2021-02-03 11:26   ` Paolo Bonzini
2021-02-03 18:31     ` Ben Gardon
2021-02-03 18:32       ` Paolo Bonzini
2021-02-02 18:57 ` [PATCH v2 25/28] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU " Ben Gardon
2021-02-03 11:34   ` Paolo Bonzini
2021-02-03 18:51     ` Ben Gardon
2021-02-02 18:57 ` [PATCH v2 26/28] KVM: x86/mmu: Allow enabling / disabling dirty logging under " Ben Gardon
2021-02-03 11:38   ` Paolo Bonzini
2021-02-02 18:57 ` [PATCH v2 27/28] KVM: selftests: Add backing src parameter to dirty_log_perf_test Ben Gardon
2021-02-02 18:57 ` [PATCH v2 28/28] KVM: selftests: Disable dirty logging with vCPUs running Ben Gardon
2021-02-03 10:07   ` Paolo Bonzini
2021-02-03 11:00 ` [PATCH v2 00/28] Allow parallel MMU operations with TDP MMU Paolo Bonzini
2021-02-03 17:54   ` Sean Christopherson
2021-02-03 18:13     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).