KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers
@ 2021-04-02  0:56 Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte() Sean Christopherson
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

The end goal of this series is to optimize the MMU notifiers to take
mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
sensitive to mmu_lock being taken for write at inopportune times, and
such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
page shenanigans.  The vast majority of notifications for these VMs will
be spurious (for KVM), and eliding mmu_lock for spurious notifications
avoids an otherwise unacceptable disruption to the guest.

To get there without potentially degrading performance, e.g. due to
multiple memslot lookups, especially on non-x86 where the use cases are
largely unknown (from my perspective), first consolidate the MMU notifier
logic by moving the hva->gfn lookups into common KVM.

Based on kvm/queue, commit 5f986f748438 ("KVM: x86: dump_vmcs should
include the autoload/autostore MSR lists").

Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
I give it even odds that I introduced an off-by-one bug somewhere.

v2:
 - Drop the patches that have already been pushed to kvm/queue.
 - Drop two selftest changes that had snuck in via "git commit -a".
 - Add a patch to assert that mmu_notifier_count is elevated when
   .change_pte() runs. [Paolo]
 - Split out moving KVM_MMU_(UN)LOCK() to __kvm_handle_hva_range() to a
   separate patch.  Opted not to squash it with the introduction of the
   common hva walkers (patch 02), as that prevented sharing code between
   the old and new APIs. [Paolo]
 - Tweak the comment in kvm_vm_destroy() above the smashing of the new
   slots lock. [Paolo]
 - Make mmu_notifier_slots_lock unconditional to avoid #ifdefs. [Paolo]

v1:
 - https://lkml.kernel.org/r/20210326021957.1424875-1-seanjc@google.com

Sean Christopherson (10):
  KVM: Assert that notifier count is elevated in .change_pte()
  KVM: Move x86's MMU notifier memslot walkers to generic code
  KVM: arm64: Convert to the gfn-based MMU notifier callbacks
  KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
  KVM: PPC: Convert to the gfn-based MMU notifier callbacks
  KVM: Kill off the old hva-based MMU notifier callbacks
  KVM: Move MMU notifier's mmu_lock acquisition into common helper
  KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
    memslot
  KVM: Don't take mmu_lock for range invalidation unless necessary
  KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
    possible

 arch/arm64/kvm/mmu.c                   | 117 +++------
 arch/mips/kvm/mmu.c                    |  97 ++------
 arch/powerpc/include/asm/kvm_book3s.h  |  12 +-
 arch/powerpc/include/asm/kvm_ppc.h     |   9 +-
 arch/powerpc/kvm/book3s.c              |  18 +-
 arch/powerpc/kvm/book3s.h              |  10 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  98 ++------
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  25 +-
 arch/powerpc/kvm/book3s_hv.c           |  12 +-
 arch/powerpc/kvm/book3s_pr.c           |  56 ++---
 arch/powerpc/kvm/e500_mmu_host.c       |  27 +-
 arch/x86/kvm/mmu/mmu.c                 | 127 ++++------
 arch/x86/kvm/mmu/tdp_mmu.c             | 245 +++++++------------
 arch/x86/kvm/mmu/tdp_mmu.h             |  14 +-
 include/linux/kvm_host.h               |  22 +-
 virt/kvm/kvm_main.c                    | 325 +++++++++++++++++++------
 16 files changed, 552 insertions(+), 662 deletions(-)

-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte()
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02 11:08   ` Paolo Bonzini
  2021-04-02  0:56 ` [PATCH v2 02/10] KVM: Move x86's MMU notifier memslot walkers to generic code Sean Christopherson
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

In KVM's .change_pte() notification callback, replace the notifier
sequence bump with a WARN_ON assertion that the notifier count is
elevated.  An elevated count provides stricter protections than bumping
the sequence, and the sequence is guarnateed to be bumped before the
count hits zero.

When .change_pte() was added by commit 828502d30073 ("ksm: add
mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary
as .change_pte() would be invoked without any surrounding notifications.

However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify
with invalidate_range_start and invalidate_range_end"), all calls to
.change_pte() are guaranteed to be bookended by start() and end(), and
so are guaranteed to run with an elevated notifier count.

Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a
bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats
the purpose of .change_pte().  Every arch's kvm_set_spte_hva() assumes
.change_pte() is called when the relevant SPTE is present in KVM's MMU,
as the original goal was to accelerate Kernel Samepage Merging (KSM) by
updating KVM's SPTEs without requiring a VM-Exit (due to invalidating
the SPTE).  I.e. it means that .change_pte() is effectively dead code
on _all_ architectures.

x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
is guaranteed due to the prior invalidation.  PPC simply unmaps the SPTE,
which again should be a nop due to the invalidation.  arm64 is a bit
murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
called without a cache pointer, which means it will map an entry if and
only if an existing PTE was found.

For now, take advantage of the bug to simplify future consolidation of
KVMs's MMU notifier code.   Doing so will not greatly complicate fixing
.change_pte(), assuming it's even worth fixing.  .change_pte() has been
broken for 8+ years and no one has complained.  Even if there are
KSM+KVM users that care deeply about its performance, the benefits of
avoiding VM-Exits via .change_pte() need to be reevaluated to justify
the added complexity and testing burden.  Ripping out .change_pte()
entirely would be a lot easier.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1de843b7618..8df091950161 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -461,12 +461,17 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 	trace_kvm_set_spte_hva(address);
 
+	/*
+	 * .change_pte() must be bookended by .invalidate_range_{start,end}(),
+	 * and so always runs with an elevated notifier count.  This obviates
+	 * the need to bump the sequence count.
+	 */
+	WARN_ON_ONCE(!kvm->mmu_notifier_count);
+
 	idx = srcu_read_lock(&kvm->srcu);
 
 	KVM_MMU_LOCK(kvm);
 
-	kvm->mmu_notifier_seq++;
-
 	if (kvm_set_spte_hva(kvm, address, pte))
 		kvm_flush_remote_tlbs(kvm);
 
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 02/10] KVM: Move x86's MMU notifier memslot walkers to generic code
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte() Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 03/10] KVM: arm64: Convert to the gfn-based MMU notifier callbacks Sean Christopherson
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move the hva->gfn lookup for MMU notifiers into common code.  Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.

In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.

The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.

Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.

Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.

MIPS, PPC, and arm64 will be converted one at a time in future patches.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/kvm/mmu/mmu.c          | 127 +++++++----------
 arch/x86/kvm/mmu/tdp_mmu.c      | 241 ++++++++++++--------------------
 arch/x86/kvm/mmu/tdp_mmu.h      |  14 +-
 include/linux/kvm_host.h        |  14 ++
 virt/kvm/kvm_main.c             | 169 +++++++++++++++++++++-
 6 files changed, 317 insertions(+), 249 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 99778ac51243..a21e3698f4dc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1718,6 +1718,7 @@ asmlinkage void kvm_spurious_fault(void);
 	_ASM_EXTABLE(666b, 667b)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efb41f31e80a..f2046c41eb93 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1298,26 +1298,25 @@ static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return flush;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			   struct kvm_memory_slot *slot, gfn_t gfn, int level,
-			   unsigned long data)
+static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			    struct kvm_memory_slot *slot, gfn_t gfn, int level,
+			    pte_t unused)
 {
 	return kvm_zap_rmapp(kvm, rmap_head, slot);
 }
 
-static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
-			     unsigned long data)
+static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			      struct kvm_memory_slot *slot, gfn_t gfn, int level,
+			      pte_t pte)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
 	int need_flush = 0;
 	u64 new_spte;
-	pte_t *ptep = (pte_t *)data;
 	kvm_pfn_t new_pfn;
 
-	WARN_ON(pte_huge(*ptep));
-	new_pfn = pte_pfn(*ptep);
+	WARN_ON(pte_huge(pte));
+	new_pfn = pte_pfn(pte);
 
 restart:
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
@@ -1326,7 +1325,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 		need_flush = 1;
 
-		if (pte_write(*ptep)) {
+		if (pte_write(pte)) {
 			pte_list_remove(rmap_head, sptep);
 			goto restart;
 		} else {
@@ -1414,86 +1413,52 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
 	     slot_rmap_walk_okay(_iter_);				\
 	     slot_rmap_walk_next(_iter_))
 
-typedef int (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			      struct kvm_memory_slot *slot, gfn_t gfn,
-			      int level, unsigned long data);
+typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			       struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level, pte_t pte);
 
-static __always_inline int kvm_handle_hva_range(struct kvm *kvm,
-						unsigned long start,
-						unsigned long end,
-						unsigned long data,
-						rmap_handler_t handler)
+static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
+						 struct kvm_gfn_range *range,
+						 rmap_handler_t handler)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
 	struct slot_rmap_walk_iterator iterator;
-	int ret = 0;
-	int i;
+	bool ret = false;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		slots = __kvm_memslots(kvm, i);
-		kvm_for_each_memslot(memslot, slots) {
-			unsigned long hva_start, hva_end;
-			gfn_t gfn_start, gfn_end;
-
-			hva_start = max(start, memslot->userspace_addr);
-			hva_end = min(end, memslot->userspace_addr +
-				      (memslot->npages << PAGE_SHIFT));
-			if (hva_start >= hva_end)
-				continue;
-			/*
-			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-			 */
-			gfn_start = hva_to_gfn_memslot(hva_start, memslot);
-			gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-			for_each_slot_rmap_range(memslot, PG_LEVEL_4K,
-						 KVM_MAX_HUGEPAGE_LEVEL,
-						 gfn_start, gfn_end - 1,
-						 &iterator)
-				ret |= handler(kvm, iterator.rmap, memslot,
-					       iterator.gfn, iterator.level, data);
-		}
-	}
+	for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
+				 range->start, range->end - 1, &iterator)
+		ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn,
+			       iterator.level, range->pte);
 
 	return ret;
 }
 
-static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
-			  unsigned long data, rmap_handler_t handler)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler);
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
-{
-	int r;
+	bool flush;
 
-	r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+	flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp);
 
 	if (is_tdp_mmu_enabled(kvm))
-		r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);
+		flush |= kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
 
-	return r;
+	return flush;
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	int r;
+	bool flush;
 
-	r = kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+	flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmapp);
 
 	if (is_tdp_mmu_enabled(kvm))
-		r |= kvm_tdp_mmu_set_spte_hva(kvm, hva, &pte);
+		flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range);
 
-	return r;
+	return flush;
 }
 
-static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			 struct kvm_memory_slot *slot, gfn_t gfn, int level,
-			 unsigned long data)
+static bool kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			  struct kvm_memory_slot *slot, gfn_t gfn, int level,
+			  pte_t unused)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1505,9 +1470,9 @@ static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return young;
 }
 
-static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			      struct kvm_memory_slot *slot, gfn_t gfn,
-			      int level, unsigned long data)
+static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			       struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level, pte_t unused)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1529,29 +1494,31 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 
 	rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp);
 
-	kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, 0);
+	kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 	kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	int young = false;
+	bool young;
+
+	young = kvm_handle_gfn_range(kvm, range, kvm_age_rmapp);
 
-	young = kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
 	if (is_tdp_mmu_enabled(kvm))
-		young |= kvm_tdp_mmu_age_hva_range(kvm, start, end);
+		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
 
 	return young;
 }
 
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	int young = false;
+	bool young;
+
+	young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmapp);
 
-	young = kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
 	if (is_tdp_mmu_enabled(kvm))
-		young |= kvm_tdp_mmu_test_age_hva(kvm, hva);
+		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
 
 	return young;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index eeb644d2eb6f..7797d24f0937 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -878,142 +878,118 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	return ret;
 }
 
-typedef int (*tdp_handler_t)(struct kvm *kvm, struct kvm_memory_slot *slot,
-			     struct kvm_mmu_page *root, gfn_t start, gfn_t end,
-			     unsigned long data);
-
-static __always_inline int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm,
-							unsigned long start,
-							unsigned long end,
-							unsigned long data,
-							tdp_handler_t handler)
+bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
+				 bool flush)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
 	struct kvm_mmu_page *root;
-	int ret = 0;
-	int as_id;
 
-	for (as_id = 0; as_id < KVM_ADDRESS_SPACE_NUM; as_id++) {
-		for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) {
-			slots = __kvm_memslots(kvm, as_id);
-			kvm_for_each_memslot(memslot, slots) {
-				unsigned long hva_start, hva_end;
-				gfn_t gfn_start, gfn_end;
+	for_each_tdp_mmu_root(kvm, root, range->slot->as_id)
+		flush |= zap_gfn_range(kvm, root, range->start, range->end,
+				       false, flush);
+
+	return flush;
+}
+
+typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
+			      struct kvm_gfn_range *range);
+
+static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
+						   struct kvm_gfn_range *range,
+						   tdp_handler_t handler)
+{
+	struct kvm_mmu_page *root;
+	struct tdp_iter iter;
+	bool ret = false;
 
-				hva_start = max(start, memslot->userspace_addr);
-				hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-				if (hva_start >= hva_end)
-					continue;
-				/*
-				 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-				 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-				 */
-				gfn_start = hva_to_gfn_memslot(hva_start, memslot);
-				gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
+	rcu_read_lock();
 
-				ret |= handler(kvm, memslot, root, gfn_start,
-					gfn_end, data);
-			}
-		}
+	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+			ret |= handler(kvm, &iter, range);
 	}
 
+	rcu_read_unlock();
+
 	return ret;
 }
 
-static __always_inline int kvm_tdp_mmu_handle_hva(struct kvm *kvm,
-						  unsigned long addr,
-						  unsigned long data,
-						  tdp_handler_t handler)
-{
-	return kvm_tdp_mmu_handle_hva_range(kvm, addr, addr + 1, data, handler);
-}
-
-static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
-				     struct kvm_memory_slot *slot,
-				     struct kvm_mmu_page *root, gfn_t start,
-				     gfn_t end, unsigned long unused)
-{
-	return zap_gfn_range(kvm, root, start, end, false, false);
-}
-
-int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
-{
-	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
-					    zap_gfn_range_hva_wrapper);
-}
-
 /*
  * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
  * if any of the GFNs in the range have been accessed.
  */
-static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
-			 struct kvm_mmu_page *root, gfn_t start, gfn_t end,
-			 unsigned long unused)
+static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
+			  struct kvm_gfn_range *range)
 {
-	struct tdp_iter iter;
-	int young = 0;
-	u64 new_spte;
+	u64 new_spte = 0;
 
-	rcu_read_lock();
+	/* If we have a non-accessed entry we don't need to change the pte. */
+	if (!is_accessed_spte(iter->old_spte))
+		return false;
 
-	tdp_root_for_each_leaf_pte(iter, root, start, end) {
+	new_spte = iter->old_spte;
+
+	if (spte_ad_enabled(new_spte)) {
+		new_spte &= ~shadow_accessed_mask;
+	} else {
 		/*
-		 * If we have a non-accessed entry we don't need to change the
-		 * pte.
+		 * Capture the dirty status of the page, so that it doesn't get
+		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (!is_accessed_spte(iter.old_spte))
-			continue;
-
-		new_spte = iter.old_spte;
-
-		if (spte_ad_enabled(new_spte)) {
-			new_spte &= ~shadow_accessed_mask;
-		} else {
-			/*
-			 * Capture the dirty status of the page, so that it doesn't get
-			 * lost when the SPTE is marked for access tracking.
-			 */
-			if (is_writable_pte(new_spte))
-				kvm_set_pfn_dirty(spte_to_pfn(new_spte));
-
-			new_spte = mark_spte_for_access_track(new_spte);
-		}
-
-		tdp_mmu_set_spte_no_acc_track(kvm, &iter, new_spte);
-		young = 1;
+		if (is_writable_pte(new_spte))
+			kvm_set_pfn_dirty(spte_to_pfn(new_spte));
+
+		new_spte = mark_spte_for_access_track(new_spte);
 	}
 
-	rcu_read_unlock();
+	tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
 
-	return young;
+	return true;
 }
 
-int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
-					    age_gfn_range);
+	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
 }
 
-static int test_age_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-			struct kvm_mmu_page *root, gfn_t gfn, gfn_t end,
-			unsigned long unused)
+static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
+			 struct kvm_gfn_range *range)
 {
-	struct tdp_iter iter;
-
-	tdp_root_for_each_leaf_pte(iter, root, gfn, end)
-		if (is_accessed_spte(iter.old_spte))
-			return 1;
+	return is_accessed_spte(iter->old_spte);
+}
 
-	return 0;
+bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
 }
 
-int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
+static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
+			 struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_hva(kvm, hva, 0, test_age_gfn);
+	u64 new_spte;
+
+	/* Huge pages aren't expected to be modified without first being zapped. */
+	WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end);
+
+	if (iter->level != PG_LEVEL_4K ||
+	    !is_shadow_present_pte(iter->old_spte))
+		return false;
+
+	/*
+	 * Note, when changing a read-only SPTE, it's not strictly necessary to
+	 * zero the SPTE before setting the new PFN, but doing so preserves the
+	 * invariant that the PFN of a present * leaf SPTE can never change.
+	 * See __handle_changed_spte().
+	 */
+	tdp_mmu_set_spte(kvm, iter, 0);
+
+	if (!pte_write(range->pte)) {
+		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
+								  pte_pfn(range->pte));
+
+		tdp_mmu_set_spte(kvm, iter, new_spte);
+	}
+
+	return true;
 }
 
 /*
@@ -1022,60 +998,15 @@ int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
  * notifier.
  * Returns non-zero if a flush is needed before releasing the MMU lock.
  */
-static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
-			struct kvm_mmu_page *root, gfn_t gfn, gfn_t end,
-			unsigned long data)
+bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	struct tdp_iter iter;
-	pte_t *ptep = (pte_t *)data;
-	kvm_pfn_t new_pfn;
-	u64 new_spte;
-	int need_flush = 0;
+	bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
 
-	rcu_read_lock();
+	/* FIXME: return 'flush' instead of flushing here. */
+	if (flush)
+		kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
 
-	WARN_ON(pte_huge(*ptep) || (gfn + 1) != end);
-
-	new_pfn = pte_pfn(*ptep);
-
-	tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
-		if (iter.level != PG_LEVEL_4K)
-			continue;
-
-		if (!is_shadow_present_pte(iter.old_spte))
-			break;
-
-		/*
-		 * Note, when changing a read-only SPTE, it's not strictly
-		 * necessary to zero the SPTE before setting the new PFN, but
-		 * doing so preserves the invariant that the PFN of a present
-		 * leaf SPTE can never change.  See __handle_changed_spte().
-		 */
-		tdp_mmu_set_spte(kvm, &iter, 0);
-
-		if (!pte_write(*ptep)) {
-			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
-					iter.old_spte, new_pfn);
-
-			tdp_mmu_set_spte(kvm, &iter, new_spte);
-		}
-
-		need_flush = 1;
-	}
-
-	if (need_flush)
-		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
-
-	rcu_read_unlock();
-
-	return 0;
-}
-
-int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
-			     pte_t *host_ptep)
-{
-	return kvm_tdp_mmu_handle_hva(kvm, address, (unsigned long)host_ptep,
-				      set_tdp_spte);
+	return false;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bf3ce169122e..ee8efa58902f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -38,15 +38,11 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		    int map_writable, int max_level, kvm_pfn_t pfn,
 		    bool prefault);
 
-int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-
-int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
-
-int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
-			     pte_t *host_ptep);
+bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
+				 bool flush);
+bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 
 bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 			     int min_level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e6d77353025c..e6bb401dd856 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -219,11 +219,25 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+struct kvm_gfn_range {
+	struct kvm_memory_slot *slot;
+	gfn_t start;
+	gfn_t end;
+	pte_t pte;
+	bool may_block;
+};
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+#else
 int kvm_unmap_hva_range(struct kvm *kvm,
 			unsigned long start, unsigned long end, unsigned flags);
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8df091950161..7a7e62ae5eb4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -451,14 +451,131 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+
+typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+
+struct kvm_hva_range {
+	unsigned long start;
+	unsigned long end;
+	pte_t pte;
+	hva_handler_t handler;
+	bool flush_on_ret;
+	bool may_block;
+};
+
+static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
+						  const struct kvm_hva_range *range)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_gfn_range gfn_range;
+	bool ret = false;
+	int i, idx;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(slot, slots) {
+			unsigned long hva_start, hva_end;
+
+			hva_start = max(range->start, slot->userspace_addr);
+			hva_end = min(range->end, slot->userspace_addr +
+						  (slot->npages << PAGE_SHIFT));
+			if (hva_start >= hva_end)
+				continue;
+
+			/*
+			 * To optimize for the likely case where the address
+			 * range is covered by zero or one memslots, don't
+			 * bother making these conditional (to avoid writes on
+			 * the second or later invocation of the handler).
+			 */
+			gfn_range.pte = range->pte;
+			gfn_range.may_block = range->may_block;
+
+			/*
+			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+			 */
+			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
+			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
+			gfn_range.slot = slot;
+
+			ret |= range->handler(kvm, &gfn_range);
+		}
+	}
+
+	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
+		kvm_flush_remote_tlbs(kvm);
+
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	/* The notifiers are averse to booleans. :-( */
+	return (int)ret;
+}
+
+static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
+						unsigned long start,
+						unsigned long end,
+						pte_t pte,
+						hva_handler_t handler)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range range = {
+		.start		= start,
+		.end		= end,
+		.pte		= pte,
+		.handler	= handler,
+		.flush_on_ret	= true,
+		.may_block	= false,
+	};
+	int ret;
+
+	KVM_MMU_LOCK(kvm);
+	ret = __kvm_handle_hva_range(kvm, &range);
+	KVM_MMU_UNLOCK(kvm);
+
+	return ret;
+}
+
+static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
+							 unsigned long start,
+							 unsigned long end,
+							 hva_handler_t handler)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range range = {
+		.start		= start,
+		.end		= end,
+		.pte		= __pte(0),
+		.handler	= handler,
+		.flush_on_ret	= false,
+		.may_block	= false,
+	};
+	int ret;
+
+	KVM_MMU_LOCK(kvm);
+	ret = __kvm_handle_hva_range(kvm, &range);
+	KVM_MMU_UNLOCK(kvm);
+
+	return ret;
+}
+#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
+
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
 					pte_t pte)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int idx;
 
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	int idx;
+#endif
 	trace_kvm_set_spte_hva(address);
 
 	/*
@@ -468,6 +585,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	 */
 	WARN_ON_ONCE(!kvm->mmu_notifier_count);
 
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+#else
 	idx = srcu_read_lock(&kvm->srcu);
 
 	KVM_MMU_LOCK(kvm);
@@ -477,17 +597,32 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
+#endif
 }
 
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	const struct kvm_hva_range hva_range = {
+		.start		= range->start,
+		.end		= range->end,
+		.pte		= __pte(0),
+		.handler	= kvm_unmap_gfn_range,
+		.flush_on_ret	= true,
+		.may_block	= mmu_notifier_range_blockable(range),
+	};
+#else
 	int need_tlb_flush = 0, idx;
+#endif
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
 
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	idx = srcu_read_lock(&kvm->srcu);
+#endif
+
 	KVM_MMU_LOCK(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
@@ -513,14 +648,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		kvm->mmu_notifier_range_end =
 			max(kvm->mmu_notifier_range_end, range->end);
 	}
+
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	__kvm_handle_hva_range(kvm, &hva_range);
+#else
 	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end,
 					     range->flags);
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush || kvm->tlbs_dirty)
 		kvm_flush_remote_tlbs(kvm);
+#endif
 
 	KVM_MMU_UNLOCK(kvm);
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	srcu_read_unlock(&kvm->srcu, idx);
+#endif
 
 	return 0;
 }
@@ -554,11 +696,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      unsigned long start,
 					      unsigned long end)
 {
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
-
+#endif
 	trace_kvm_age_hva(start, end);
 
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
+#else
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
 
@@ -570,6 +716,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
+#endif
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -577,13 +724,13 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					unsigned long start,
 					unsigned long end)
 {
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
+#endif
 
 	trace_kvm_age_hva(start, end);
 
-	idx = srcu_read_lock(&kvm->srcu);
-	KVM_MMU_LOCK(kvm);
 	/*
 	 * Even though we do not flush TLB, this will still adversely
 	 * affect performance on pre-Haswell Intel EPT, where there is
@@ -597,22 +744,33 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+#else
+	idx = srcu_read_lock(&kvm->srcu);
+	KVM_MMU_LOCK(kvm);
 	young = kvm_age_hva(kvm, start, end);
 	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
+#endif
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long address)
 {
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
-
+#endif
 	trace_kvm_test_age_hva(address);
 
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
+					     kvm_test_age_gfn);
+#else
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
 	young = kvm_test_age_hva(kvm, address);
@@ -620,6 +778,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
+#endif
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 03/10] KVM: arm64: Convert to the gfn-based MMU notifier callbacks
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte() Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 02/10] KVM: Move x86's MMU notifier memslot walkers to generic code Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-12 10:12   ` Marc Zyngier
  2021-04-02  0:56 ` [PATCH v2 04/10] KVM: MIPS/MMU: " Sean Christopherson
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move arm64 to the gfn-base MMU notifier APIs, which do the hva->gfn
lookup in common code.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h |   1 +
 arch/arm64/kvm/mmu.c              | 117 ++++++++----------------------
 2 files changed, 33 insertions(+), 85 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 72e6b4600264..1ad729cf7b0d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -582,6 +582,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4b7e1e327337..35728231e9a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -839,7 +839,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
 	 * the page we just got a reference to gets unmapped before we have a
 	 * chance to grab the mmu_lock, which ensure that if the page gets
-	 * unmapped afterwards, the call to kvm_unmap_hva will take it away
+	 * unmapped afterwards, the call to kvm_unmap_gfn will take it away
 	 * from us again properly. This smp_rmb() interacts with the smp_wmb()
 	 * in kvm_mmu_notifier_invalidate_<page|range_end>.
 	 */
@@ -1064,123 +1064,70 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
-static int handle_hva_to_gpa(struct kvm *kvm,
-			     unsigned long start,
-			     unsigned long end,
-			     int (*handler)(struct kvm *kvm,
-					    gpa_t gpa, u64 size,
-					    void *data),
-			     void *data)
-{
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
-	int ret = 0;
-
-	slots = kvm_memslots(kvm);
-
-	/* we only care about the pages that the guest sees */
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gpa;
-
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-
-		gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
-		ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
-	}
-
-	return ret;
-}
-
-static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
-{
-	unsigned flags = *(unsigned *)data;
-	bool may_block = flags & MMU_NOTIFIER_RANGE_BLOCKABLE;
-
-	__unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
-	return 0;
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm,
-			unsigned long start, unsigned long end, unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
-	handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, &flags);
-	return 0;
-}
+	__unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
+			     (range->end - range->start) << PAGE_SHIFT,
+			     range->may_block);
 
-static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
-{
-	kvm_pfn_t *pfn = (kvm_pfn_t *)data;
-
-	WARN_ON(size != PAGE_SIZE);
-
-	/*
-	 * The MMU notifiers will have unmapped a huge PMD before calling
-	 * ->change_pte() (which in turn calls kvm_set_spte_hva()) and
-	 * therefore we never need to clear out a huge PMD through this
-	 * calling path and a memcache is not required.
-	 */
-	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, gpa, PAGE_SIZE,
-			       __pfn_to_phys(*pfn), KVM_PGTABLE_PROT_R, NULL);
 	return 0;
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	unsigned long end = hva + PAGE_SIZE;
-	kvm_pfn_t pfn = pte_pfn(pte);
+	kvm_pfn_t pfn = pte_pfn(range->pte);
 
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
+	WARN_ON(range->end - range->start != 1);
+
 	/*
 	 * We've moved a page around, probably through CoW, so let's treat it
 	 * just like a translation fault and clean the cache to the PoC.
 	 */
 	clean_dcache_guest_page(pfn, PAGE_SIZE);
-	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
+
+	/*
+	 * The MMU notifiers will have unmapped a huge PMD before calling
+	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
+	 * therefore we never need to clear out a huge PMD through this
+	 * calling path and a memcache is not required.
+	 */
+	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
+			       PAGE_SIZE, __pfn_to_phys(pfn),
+			       KVM_PGTABLE_PROT_R, NULL);
+
 	return 0;
 }
 
-static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	pte_t pte;
+	u64 size = (range->end - range->start) << PAGE_SHIFT;
 	kvm_pte_t kpte;
+	pte_t pte;
+
+	if (!kvm->arch.mmu.pgt)
+		return 0;
 
 	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
-	kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt, gpa);
+
+	kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
+					range->start << PAGE_SHIFT);
 	pte = __pte(kpte);
 	return pte_valid(pte) && pte_young(pte);
 }
 
-static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
-{
-	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
-	return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt, gpa);
-}
-
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
-{
-	if (!kvm->arch.mmu.pgt)
-		return 0;
-
-	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
-}
-
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
-	return handle_hva_to_gpa(kvm, hva, hva + PAGE_SIZE,
-				 kvm_test_age_hva_handler, NULL);
+	return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
+					   range->start << PAGE_SHIFT);
 }
 
 phys_addr_t kvm_mmu_get_httbr(void)
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 04/10] KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (2 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 03/10] KVM: arm64: Convert to the gfn-based MMU notifier callbacks Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 05/10] KVM: PPC: " Sean Christopherson
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move MIPS to the gfn-based MMU notifier APIs, which do the hva->gfn
lookup in common code, and whose code is nearly identical to MIPS'
lookup.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/mips/include/asm/kvm_host.h |  1 +
 arch/mips/kvm/mmu.c              | 97 ++++++--------------------------
 2 files changed, 17 insertions(+), 81 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index feaa77036b67..374a3c8806e8 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,6 +967,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
 						   bool write);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 /* Emulation */
 int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 3dabeda82458..3dc885df2e32 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -439,85 +439,36 @@ static int kvm_mips_mkold_gpa_pt(struct kvm *kvm, gfn_t start_gfn,
 				  end_gfn << PAGE_SHIFT);
 }
 
-static int handle_hva_to_gpa(struct kvm *kvm,
-			     unsigned long start,
-			     unsigned long end,
-			     int (*handler)(struct kvm *kvm, gfn_t gfn,
-					    gpa_t gfn_end,
-					    struct kvm_memory_slot *memslot,
-					    void *data),
-			     void *data)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
-	int ret = 0;
-
-	slots = kvm_memslots(kvm);
-
-	/* we only care about the pages that the guest sees */
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gfn, gfn_end;
-
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-
-		/*
-		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-		 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-		 */
-		gfn = hva_to_gfn_memslot(hva_start, memslot);
-		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-		ret |= handler(kvm, gfn, gfn_end, memslot, data);
-	}
-
-	return ret;
-}
-
-
-static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-				 struct kvm_memory_slot *memslot, void *data)
-{
-	kvm_mips_flush_gpa_pt(kvm, gfn, gfn_end);
-	return 1;
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
-{
-	handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL);
+	kvm_mips_flush_gpa_pt(kvm, range->start, range->end);
 
 	kvm_mips_callbacks->flush_shadow_all(kvm);
 	return 0;
 }
 
-static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-				struct kvm_memory_slot *memslot, void *data)
+static bool __kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	gpa_t gpa = gfn << PAGE_SHIFT;
-	pte_t hva_pte = *(pte_t *)data;
+	gpa_t gpa = range->start << PAGE_SHIFT;
+	pte_t hva_pte = range->pte;
 	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
 	pte_t old_pte;
 
 	if (!gpa_pte)
-		return 0;
+		return false;
 
 	/* Mapping may need adjusting depending on memslot flags */
 	old_pte = *gpa_pte;
-	if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
+	if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
 		hva_pte = pte_mkclean(hva_pte);
-	else if (memslot->flags & KVM_MEM_READONLY)
+	else if (range->slot->flags & KVM_MEM_READONLY)
 		hva_pte = pte_wrprotect(hva_pte);
 
 	set_pte(gpa_pte, hva_pte);
 
 	/* Replacing an absent or old page doesn't need flushes */
 	if (!pte_present(old_pte) || !pte_young(old_pte))
-		return 0;
+		return false;
 
 	/* Pages swapped, aged, moved, or cleaned require flushes */
 	return !pte_present(hva_pte) ||
@@ -526,27 +477,21 @@ static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
 	       (pte_dirty(old_pte) && !pte_dirty(hva_pte));
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	unsigned long end = hva + PAGE_SIZE;
-	int ret;
-
-	ret = handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
-	if (ret)
+	if (__kvm_set_spte_gfn(kvm, range))
 		kvm_mips_callbacks->flush_shadow_all(kvm);
-	return 0;
+	return false;
 }
 
-static int kvm_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-			       struct kvm_memory_slot *memslot, void *data)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_mips_mkold_gpa_pt(kvm, gfn, gfn_end);
+	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
 }
 
-static int kvm_test_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-				    struct kvm_memory_slot *memslot, void *data)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	gpa_t gpa = gfn << PAGE_SHIFT;
+	gpa_t gpa = range->start << PAGE_SHIFT;
 	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
 
 	if (!gpa_pte)
@@ -554,16 +499,6 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
 	return pte_young(*gpa_pte);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
-{
-	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
-}
-
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
-{
-	return handle_hva_to_gpa(kvm, hva, hva, kvm_test_age_hva_handler, NULL);
-}
-
 /**
  * _kvm_mips_map_page_fast() - Fast path GPA fault handler.
  * @vcpu:		VCPU pointer.
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 05/10] KVM: PPC: Convert to the gfn-based MMU notifier callbacks
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (3 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 04/10] KVM: MIPS/MMU: " Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 06/10] KVM: Kill off the old hva-based " Sean Christopherson
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion
PPC-internal hooks to work with gfns instead of hvas.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/include/asm/kvm_book3s.h  | 12 ++--
 arch/powerpc/include/asm/kvm_host.h    |  1 +
 arch/powerpc/include/asm/kvm_ppc.h     |  9 ++-
 arch/powerpc/kvm/book3s.c              | 18 +++--
 arch/powerpc/kvm/book3s.h              | 10 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    | 98 +++++++-------------------
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 25 +++----
 arch/powerpc/kvm/book3s_hv.c           | 12 ++--
 arch/powerpc/kvm/book3s_pr.c           | 56 +++++----------
 arch/powerpc/kvm/e500_mmu_host.c       | 27 +++----
 10 files changed, 95 insertions(+), 173 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 2f5f919f6cd3..2d03f2930767 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -210,12 +210,12 @@ extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
 				      unsigned int lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
-extern int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			unsigned long gfn);
-extern int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			unsigned long gfn);
-extern int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			unsigned long gfn);
+extern bool kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			    unsigned long gfn);
+extern bool kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			  unsigned long gfn);
+extern bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			       unsigned long gfn);
 extern long kvmppc_hv_get_dirty_log_radix(struct kvm *kvm,
 			struct kvm_memory_slot *memslot, unsigned long *map);
 extern void kvmppc_radix_flush_memslot(struct kvm *kvm,
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1e83359f286b..1335f0001bdd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -55,6 +55,7 @@
 #include <linux/mmu_notifier.h>
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 #define HPTEG_CACHE_NUM			(1 << 15)
 #define HPTEG_HASH_BITS_PTE		13
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 8aacd76bb702..21ab0332eb42 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -281,11 +281,10 @@ struct kvmppc_ops {
 				     const struct kvm_memory_slot *old,
 				     const struct kvm_memory_slot *new,
 				     enum kvm_mr_change change);
-	int (*unmap_hva_range)(struct kvm *kvm, unsigned long start,
-			   unsigned long end);
-	int (*age_hva)(struct kvm *kvm, unsigned long start, unsigned long end);
-	int (*test_age_hva)(struct kvm *kvm, unsigned long hva);
-	void (*set_spte_hva)(struct kvm *kvm, unsigned long hva, pte_t pte);
+	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
+	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+	bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	void (*free_memslot)(struct kvm_memory_slot *slot);
 	int (*init_vm)(struct kvm *kvm);
 	void (*destroy_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 44bf567b6589..2b691f4d1f26 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -834,26 +834,24 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm,
 	kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
 }
 
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm->arch.kvm_ops->unmap_hva_range(kvm, start, end);
+	return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm->arch.kvm_ops->age_hva(kvm, start, end);
+	return kvm->arch.kvm_ops->age_gfn(kvm, range);
 }
 
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm->arch.kvm_ops->test_age_hva(kvm, hva);
+	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	kvm->arch.kvm_ops->set_spte_hva(kvm, hva, pte);
-	return 0;
+	return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
 }
 
 int kvmppc_core_init_vm(struct kvm *kvm)
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 9b6323ec8e60..740e51def5a5 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -9,12 +9,10 @@
 
 extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 					 struct kvm_memory_slot *memslot);
-extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
-				  unsigned long end);
-extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start,
-			  unsigned long end);
-extern int kvm_test_age_hva_hv(struct kvm *kvm, unsigned long hva);
-extern void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte);
+extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 
 extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
 extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index bb6773594cf8..b7bd9ca040b8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -752,51 +752,6 @@ void kvmppc_rmap_reset(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, srcu_idx);
 }
 
-typedef int (*hva_handler_fn)(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			      unsigned long gfn);
-
-static int kvm_handle_hva_range(struct kvm *kvm,
-				unsigned long start,
-				unsigned long end,
-				hva_handler_fn handler)
-{
-	int ret;
-	int retval = 0;
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
-
-	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gfn, gfn_end;
-
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-		/*
-		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-		 * {gfn, gfn+1, ..., gfn_end-1}.
-		 */
-		gfn = hva_to_gfn_memslot(hva_start, memslot);
-		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-		for (; gfn < gfn_end; ++gfn) {
-			ret = handler(kvm, memslot, gfn);
-			retval |= ret;
-		}
-	}
-
-	return retval;
-}
-
-static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
-			  hva_handler_fn handler)
-{
-	return kvm_handle_hva_range(kvm, hva, hva + 1, handler);
-}
-
 /* Must be called with both HPTE and rmap locked */
 static void kvmppc_unmap_hpte(struct kvm *kvm, unsigned long i,
 			      struct kvm_memory_slot *memslot,
@@ -840,8 +795,8 @@ static void kvmppc_unmap_hpte(struct kvm *kvm, unsigned long i,
 	}
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			   unsigned long gfn)
+static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			    unsigned long gfn)
 {
 	unsigned long i;
 	__be64 *hptep;
@@ -874,16 +829,15 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		unlock_rmap(rmapp);
 		__unlock_hpte(hptep, be64_to_cpu(hptep[0]));
 	}
-	return 0;
+	return false;
 }
 
-int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		return kvm_unmap_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_unmap_radix : kvm_unmap_rmapp;
-	kvm_handle_hva_range(kvm, start, end, handler);
-	return 0;
+	return kvm_unmap_rmapp(kvm, range->slot, range->start);
 }
 
 void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
@@ -913,8 +867,8 @@ void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 	}
 }
 
-static int kvm_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			 unsigned long gfn)
+static bool kvm_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			  unsigned long gfn)
 {
 	struct revmap_entry *rev = kvm->arch.hpt.rev;
 	unsigned long head, i, j;
@@ -968,26 +922,26 @@ static int kvm_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	return ret;
 }
 
-int kvm_age_hva_hv(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		kvm_age_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_age_radix : kvm_age_rmapp;
-	return kvm_handle_hva_range(kvm, start, end, handler);
+	return kvm_age_rmapp(kvm, range->slot, range->start);
 }
 
-static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			      unsigned long gfn)
+static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			       unsigned long gfn)
 {
 	struct revmap_entry *rev = kvm->arch.hpt.rev;
 	unsigned long head, i, j;
 	unsigned long *hp;
-	int ret = 1;
+	bool ret = true;
 	unsigned long *rmapp;
 
 	rmapp = &memslot->arch.rmap[gfn - memslot->base_gfn];
 	if (*rmapp & KVMPPC_RMAP_REFERENCED)
-		return 1;
+		return true;
 
 	lock_rmap(rmapp);
 	if (*rmapp & KVMPPC_RMAP_REFERENCED)
@@ -1002,27 +956,27 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
 				goto out;
 		} while ((i = j) != head);
 	}
-	ret = 0;
+	ret = false;
 
  out:
 	unlock_rmap(rmapp);
 	return ret;
 }
 
-int kvm_test_age_hva_hv(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		kvm_test_age_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_test_age_radix : kvm_test_age_rmapp;
-	return kvm_handle_hva(kvm, hva, handler);
+	return kvm_test_age_rmapp(kvm, range->slot, range->start);
 }
 
-void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		return kvm_unmap_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_unmap_radix : kvm_unmap_rmapp;
-	kvm_handle_hva(kvm, hva, handler);
+	return kvm_unmap_rmapp(kvm, range->slot, range->start);
 }
 
 static int vcpus_running(struct kvm *kvm)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e603de7ade52..ec4f58fa9f5a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -993,8 +993,8 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 }
 
 /* Called with kvm->mmu_lock held */
-int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		    unsigned long gfn)
+bool kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		     unsigned long gfn)
 {
 	pte_t *ptep;
 	unsigned long gpa = gfn << PAGE_SHIFT;
@@ -1002,24 +1002,24 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 
 	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE) {
 		uv_page_inval(kvm->arch.lpid, gpa, PAGE_SHIFT);
-		return 0;
+		return false;
 	}
 
 	ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
 	if (ptep && pte_present(*ptep))
 		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 				 kvm->arch.lpid);
-	return 0;
+	return false;
 }
 
 /* Called with kvm->mmu_lock held */
-int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		  unsigned long gfn)
+bool kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		   unsigned long gfn)
 {
 	pte_t *ptep;
 	unsigned long gpa = gfn << PAGE_SHIFT;
 	unsigned int shift;
-	int ref = 0;
+	bool ref = false;
 	unsigned long old, *rmapp;
 
 	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
@@ -1035,26 +1035,27 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		kvmhv_update_nest_rmap_rc_list(kvm, rmapp, _PAGE_ACCESSED, 0,
 					       old & PTE_RPN_MASK,
 					       1UL << shift);
-		ref = 1;
+		ref = true;
 	}
 	return ref;
 }
 
 /* Called with kvm->mmu_lock held */
-int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		       unsigned long gfn)
+bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			unsigned long gfn)
+
 {
 	pte_t *ptep;
 	unsigned long gpa = gfn << PAGE_SHIFT;
 	unsigned int shift;
-	int ref = 0;
+	bool ref = false;
 
 	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
 		return ref;
 
 	ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
 	if (ptep && pte_present(*ptep) && pte_young(*ptep))
-		ref = 1;
+		ref = true;
 	return ref;
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 13bad6bf4c95..07682ad4110e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4770,7 +4770,7 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
 		kvmhv_release_all_nested(kvm);
 	kvmppc_rmap_reset(kvm);
 	kvm->arch.process_table = 0;
-	/* Mutual exclusion with kvm_unmap_hva_range etc. */
+	/* Mutual exclusion with kvm_unmap_gfn_range etc. */
 	spin_lock(&kvm->mmu_lock);
 	kvm->arch.radix = 0;
 	spin_unlock(&kvm->mmu_lock);
@@ -4792,7 +4792,7 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
 	if (err)
 		return err;
 	kvmppc_rmap_reset(kvm);
-	/* Mutual exclusion with kvm_unmap_hva_range etc. */
+	/* Mutual exclusion with kvm_unmap_gfn_range etc. */
 	spin_lock(&kvm->mmu_lock);
 	kvm->arch.radix = 1;
 	spin_unlock(&kvm->mmu_lock);
@@ -5654,10 +5654,10 @@ static struct kvmppc_ops kvm_ops_hv = {
 	.flush_memslot  = kvmppc_core_flush_memslot_hv,
 	.prepare_memory_region = kvmppc_core_prepare_memory_region_hv,
 	.commit_memory_region  = kvmppc_core_commit_memory_region_hv,
-	.unmap_hva_range = kvm_unmap_hva_range_hv,
-	.age_hva  = kvm_age_hva_hv,
-	.test_age_hva = kvm_test_age_hva_hv,
-	.set_spte_hva = kvm_set_spte_hva_hv,
+	.unmap_gfn_range = kvm_unmap_gfn_range_hv,
+	.age_gfn = kvm_age_gfn_hv,
+	.test_age_gfn = kvm_test_age_gfn_hv,
+	.set_spte_gfn = kvm_set_spte_gfn_hv,
 	.free_memslot = kvmppc_core_free_memslot_hv,
 	.init_vm =  kvmppc_core_init_vm_hv,
 	.destroy_vm = kvmppc_core_destroy_vm_hv,
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 913944dc3620..d7733b07f489 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -425,61 +425,39 @@ static int kvmppc_core_check_requests_pr(struct kvm_vcpu *vcpu)
 }
 
 /************* MMU Notifiers *************/
-static void do_kvm_unmap_hva(struct kvm *kvm, unsigned long start,
-			     unsigned long end)
+static bool do_kvm_unmap_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	long i;
 	struct kvm_vcpu *vcpu;
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
 
-	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gfn, gfn_end;
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmppc_mmu_pte_pflush(vcpu, range->start << PAGE_SHIFT,
+				      range->end << PAGE_SHIFT);
 
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-		/*
-		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-		 * {gfn, gfn+1, ..., gfn_end-1}.
-		 */
-		gfn = hva_to_gfn_memslot(hva_start, memslot);
-		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-		kvm_for_each_vcpu(i, vcpu, kvm)
-			kvmppc_mmu_pte_pflush(vcpu, gfn << PAGE_SHIFT,
-					      gfn_end << PAGE_SHIFT);
-	}
+	return false;
 }
 
-static int kvm_unmap_hva_range_pr(struct kvm *kvm, unsigned long start,
-				  unsigned long end)
+static bool kvm_unmap_gfn_range_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	do_kvm_unmap_hva(kvm, start, end);
-
-	return 0;
+	return do_kvm_unmap_gfn(kvm, range);
 }
 
-static int kvm_age_hva_pr(struct kvm *kvm, unsigned long start,
-			  unsigned long end)
+static bool kvm_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-static int kvm_test_age_hva_pr(struct kvm *kvm, unsigned long hva)
+static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-static void kvm_set_spte_hva_pr(struct kvm *kvm, unsigned long hva, pte_t pte)
+static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* The page will get remapped properly on its next fault */
-	do_kvm_unmap_hva(kvm, hva, hva + PAGE_SIZE);
+	return do_kvm_unmap_gfn(kvm, range);
 }
 
 /*****************************************/
@@ -2079,10 +2057,10 @@ static struct kvmppc_ops kvm_ops_pr = {
 	.flush_memslot = kvmppc_core_flush_memslot_pr,
 	.prepare_memory_region = kvmppc_core_prepare_memory_region_pr,
 	.commit_memory_region = kvmppc_core_commit_memory_region_pr,
-	.unmap_hva_range = kvm_unmap_hva_range_pr,
-	.age_hva  = kvm_age_hva_pr,
-	.test_age_hva = kvm_test_age_hva_pr,
-	.set_spte_hva = kvm_set_spte_hva_pr,
+	.unmap_gfn_range = kvm_unmap_gfn_range_pr,
+	.age_gfn  = kvm_age_gfn_pr,
+	.test_age_gfn = kvm_test_age_gfn_pr,
+	.set_spte_gfn = kvm_set_spte_gfn_pr,
 	.free_memslot = kvmppc_core_free_memslot_pr,
 	.init_vm = kvmppc_core_init_vm_pr,
 	.destroy_vm = kvmppc_core_destroy_vm_pr,
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 648aefe1a3e7..7f16afc331ef 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -721,43 +721,36 @@ int kvmppc_load_last_inst(struct kvm_vcpu *vcpu,
 
 /************* MMU Notifiers *************/
 
-static int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+static bool kvm_e500_mmu_unmap_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/*
 	 * Flush all shadow tlb entries everywhere. This is slow, but
 	 * we are 100% sure that we catch the to be unmapped page
 	 */
-	kvm_flush_remote_tlbs(kvm);
-
-	return 0;
+	return true;
 }
 
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	/* kvm_unmap_hva flushes everything anyways */
-	kvm_unmap_hva(kvm, start);
-
-	return 0;
+	return kvm_e500_mmu_unmap_gfn(kvm, range);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* The page will get remapped properly on its next fault */
-	kvm_unmap_hva(kvm, hva);
-	return 0;
+	return kvm_e500_mmu_unmap_gfn(kvm, range);
 }
 
 /*****************************************/
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 06/10] KVM: Kill off the old hva-based MMU notifier callbacks
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (4 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 05/10] KVM: PPC: " Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper Sean Christopherson
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Yank out the hva-based MMU notifier APIs now that all architectures that
use the notifiers have moved to the gfn-based APIs.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h   |  1 -
 arch/mips/include/asm/kvm_host.h    |  1 -
 arch/powerpc/include/asm/kvm_host.h |  1 -
 arch/x86/include/asm/kvm_host.h     |  1 -
 include/linux/kvm_host.h            |  8 ---
 virt/kvm/kvm_main.c                 | 85 -----------------------------
 6 files changed, 97 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1ad729cf7b0d..72e6b4600264 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -582,7 +582,6 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 374a3c8806e8..feaa77036b67 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,7 +967,6 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
 						   bool write);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 /* Emulation */
 int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1335f0001bdd..1e83359f286b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -55,7 +55,6 @@
 #include <linux/mmu_notifier.h>
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 #define HPTEG_CACHE_NUM			(1 << 15)
 #define HPTEG_HASH_BITS_PTE		13
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a21e3698f4dc..99778ac51243 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1718,7 +1718,6 @@ asmlinkage void kvm_spurious_fault(void);
 	_ASM_EXTABLE(666b, 667b)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e6bb401dd856..40ac2d40bb5a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -219,7 +219,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -231,13 +230,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-#else
-int kvm_unmap_hva_range(struct kvm *kvm,
-			unsigned long start, unsigned long end, unsigned flags);
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
-#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7a7e62ae5eb4..2e809d73c7f1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -451,8 +451,6 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 struct kvm_hva_range {
@@ -564,8 +562,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 
 	return ret;
 }
-#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
-
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
@@ -573,9 +569,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	int idx;
-#endif
 	trace_kvm_set_spte_hva(address);
 
 	/*
@@ -585,26 +578,13 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	 */
 	WARN_ON_ONCE(!kvm->mmu_notifier_count);
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
-#else
-	idx = srcu_read_lock(&kvm->srcu);
-
-	KVM_MMU_LOCK(kvm);
-
-	if (kvm_set_spte_hva(kvm, address, pte))
-		kvm_flush_remote_tlbs(kvm);
-
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-#endif
 }
 
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	const struct kvm_hva_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
@@ -613,16 +593,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
-#else
-	int need_tlb_flush = 0, idx;
-#endif
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
 
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	idx = srcu_read_lock(&kvm->srcu);
-#endif
-
 	KVM_MMU_LOCK(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
@@ -649,20 +622,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 			max(kvm->mmu_notifier_range_end, range->end);
 	}
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	__kvm_handle_hva_range(kvm, &hva_range);
-#else
-	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end,
-					     range->flags);
-	/* we've to flush the tlb before the pages can be freed */
-	if (need_tlb_flush || kvm->tlbs_dirty)
-		kvm_flush_remote_tlbs(kvm);
-#endif
 
 	KVM_MMU_UNLOCK(kvm);
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	srcu_read_unlock(&kvm->srcu, idx);
-#endif
 
 	return 0;
 }
@@ -696,27 +658,9 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      unsigned long start,
 					      unsigned long end)
 {
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int young, idx;
-#endif
 	trace_kvm_age_hva(start, end);
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
-#else
-	idx = srcu_read_lock(&kvm->srcu);
-	KVM_MMU_LOCK(kvm);
-
-	young = kvm_age_hva(kvm, start, end);
-	if (young)
-		kvm_flush_remote_tlbs(kvm);
-
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-
-	return young;
-#endif
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -724,11 +668,6 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					unsigned long start,
 					unsigned long end)
 {
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int young, idx;
-#endif
-
 	trace_kvm_age_hva(start, end);
 
 	/*
@@ -744,41 +683,17 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
-#else
-	idx = srcu_read_lock(&kvm->srcu);
-	KVM_MMU_LOCK(kvm);
-	young = kvm_age_hva(kvm, start, end);
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-
-	return young;
-#endif
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long address)
 {
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int young, idx;
-#endif
 	trace_kvm_test_age_hva(address);
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
 					     kvm_test_age_gfn);
-#else
-	idx = srcu_read_lock(&kvm->srcu);
-	KVM_MMU_LOCK(kvm);
-	young = kvm_test_age_hva(kvm, address);
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-
-	return young;
-#endif
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (5 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 06/10] KVM: Kill off the old hva-based " Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02  9:35   ` Paolo Bonzini
  2021-04-02  0:56 ` [PATCH v2 08/10] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot Sean Christopherson
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Acquire and release mmu_lock in the __kvm_handle_hva_range() helper
instead of requiring the caller to do the same.  This paves the way for
future patches to take mmu_lock if and only if an overlapping memslot is
found, without also having to introduce the on_lock() shenanigans used
to manipulate the notifier count and sequence.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---

Note, the WARN_ON_ONCE that asserts on_lock and handler aren't both null
is optimized out of all functions on recent gcc (for x86).  I wanted to
make it a BUILD_BUG_ON, but older versions of gcc aren't agressive/smart
enough to optimize it out, and using __builtin_constant_p() to get it to
build on older compilers prevents the assertion from firing on newer
compilers when given bad input.

I'm also a-ok dropping the check altogether, it just felt wrong having
the semi-funky on_lock -> !handler combo without documenting that handler
isn't allowed to be null in the common case.

 virt/kvm/kvm_main.c | 125 +++++++++++++++++++++++++++++---------------
 1 file changed, 82 insertions(+), 43 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e809d73c7f1..25ecb5235e17 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -453,28 +453,57 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
+typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
+			     unsigned long end);
+
 struct kvm_hva_range {
 	unsigned long start;
 	unsigned long end;
 	pte_t pte;
 	hva_handler_t handler;
+	on_lock_fn_t on_lock;
 	bool flush_on_ret;
 	bool may_block;
 };
 
+/*
+ * Use a dedicated stub instead of NULL to indicate that there is no callback
+ * function/handler.  The compiler technically can't guarantee that a real
+ * function will have a non-zero address, and so it will generate code to
+ * check for !NULL, whereas comparing against a stub will be elided at compile
+ * time (unless the compiler is getting long in the tooth, e.g. gcc 4.9).
+ */
+static void kvm_null_fn(void)
+{
+
+}
+#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
+
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 						  const struct kvm_hva_range *range)
 {
-	struct kvm_memory_slot *slot;
-	struct kvm_memslots *slots;
 	struct kvm_gfn_range gfn_range;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
 	bool ret = false;
 	int i, idx;
 
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	/* A null handler is allowed if and only if on_lock() is provided. */
+	if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
+			 IS_KVM_NULL_FN(range->handler)))
+		return 0;
+
+	KVM_MMU_LOCK(kvm);
 
 	idx = srcu_read_lock(&kvm->srcu);
 
+	if (!IS_KVM_NULL_FN(range->on_lock)) {
+		range->on_lock(kvm, range->start, range->end);
+
+		if (IS_KVM_NULL_FN(range->handler))
+			goto out_unlock;
+	}
+
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, slots) {
@@ -510,6 +539,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
 		kvm_flush_remote_tlbs(kvm);
 
+out_unlock:
+	KVM_MMU_UNLOCK(kvm);
+
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	/* The notifiers are averse to booleans. :-( */
@@ -528,16 +560,12 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.end		= end,
 		.pte		= pte,
 		.handler	= handler,
+		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
 		.may_block	= false,
 	};
-	int ret;
 
-	KVM_MMU_LOCK(kvm);
-	ret = __kvm_handle_hva_range(kvm, &range);
-	KVM_MMU_UNLOCK(kvm);
-
-	return ret;
+	return __kvm_handle_hva_range(kvm, &range);
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
@@ -551,16 +579,12 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.end		= end,
 		.pte		= __pte(0),
 		.handler	= handler,
+		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= false,
 		.may_block	= false,
 	};
-	int ret;
 
-	KVM_MMU_LOCK(kvm);
-	ret = __kvm_handle_hva_range(kvm, &range);
-	KVM_MMU_UNLOCK(kvm);
-
-	return ret;
+	return __kvm_handle_hva_range(kvm, &range);
 }
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
@@ -581,22 +605,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-					const struct mmu_notifier_range *range)
+static void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
+				   unsigned long end)
 {
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
-		.start		= range->start,
-		.end		= range->end,
-		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
-		.flush_on_ret	= true,
-		.may_block	= mmu_notifier_range_blockable(range),
-	};
-
-	trace_kvm_unmap_hva_range(range->start, range->end);
-
-	KVM_MMU_LOCK(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
@@ -604,8 +615,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 */
 	kvm->mmu_notifier_count++;
 	if (likely(kvm->mmu_notifier_count == 1)) {
-		kvm->mmu_notifier_range_start = range->start;
-		kvm->mmu_notifier_range_end = range->end;
+		kvm->mmu_notifier_range_start = start;
+		kvm->mmu_notifier_range_end = end;
 	} else {
 		/*
 		 * Fully tracking multiple concurrent ranges has dimishing
@@ -617,24 +628,36 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		 * complete.
 		 */
 		kvm->mmu_notifier_range_start =
-			min(kvm->mmu_notifier_range_start, range->start);
+			min(kvm->mmu_notifier_range_start, start);
 		kvm->mmu_notifier_range_end =
-			max(kvm->mmu_notifier_range_end, range->end);
+			max(kvm->mmu_notifier_range_end, end);
 	}
-
-	__kvm_handle_hva_range(kvm, &hva_range);
-
-	KVM_MMU_UNLOCK(kvm);
-
-	return 0;
 }
 
-static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range hva_range = {
+		.start		= range->start,
+		.end		= range->end,
+		.pte		= __pte(0),
+		.handler	= kvm_unmap_gfn_range,
+		.on_lock	= kvm_inc_notifier_count,
+		.flush_on_ret	= true,
+		.may_block	= mmu_notifier_range_blockable(range),
+	};
 
-	KVM_MMU_LOCK(kvm);
+	trace_kvm_unmap_hva_range(range->start, range->end);
+
+	__kvm_handle_hva_range(kvm, &hva_range);
+
+	return 0;
+}
+
+static void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
+				   unsigned long end)
+{
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
@@ -648,7 +671,23 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	 * in conjunction with the smp_rmb in mmu_notifier_retry().
 	 */
 	kvm->mmu_notifier_count--;
-	KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
+					const struct mmu_notifier_range *range)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range hva_range = {
+		.start		= range->start,
+		.end		= range->end,
+		.pte		= __pte(0),
+		.handler	= (void *)kvm_null_fn,
+		.on_lock	= kvm_dec_notifier_count,
+		.flush_on_ret	= true,
+		.may_block	= mmu_notifier_range_blockable(range),
+	};
+
+	__kvm_handle_hva_range(kvm, &hva_range);
 
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 08/10] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (6 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02  0:56 ` [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
detected in the memslots, i.e. don't take the lock for notifications that
don't affect the guest.

For small VMs, spurious locking is a minor annoyance.  And for "volatile"
setups where the majority of notifications _are_ relevant, this barely
qualifies as an optimization.

But, for large VMs (hundreds of threads) with static setups, e.g. no
page migration, no swapping, etc..., the vast majority of MMU notifier
callbacks will be unrelated to the guest, e.g. will often be in response
to the userspace VMM adjusting its own virtual address space.  In such
large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
handling page faults.  In some scenarios it can even be "fatal" in the
sense that it causes unacceptable brownouts, e.g. when rebuilding huge
pages after live migration, a significant percentage of vCPUs will be
attempting to handle page faults.

x86's TDP MMU implementation is especially susceptible to spurious
locking due it taking mmu_lock for read when handling page faults.
Because rwlock is fair, a single writer will stall future readers, while
the writer is itself stalled waiting for in-progress readers to complete.
This is exacerbated by the MMU notifiers often firing multiple times in
quick succession, e.g. moving a page will (always?) invoke three separate
notifiers: .invalidate_range_start(), invalidate_range_end(), and
.change_pte().  Unnecessarily taking mmu_lock each time means even a
single spurious sequence can be problematic.

Note, this optimizes only the unpaired callbacks.  Optimizing the
.invalidate_range_{start,end}() pairs is more complex and will be done in
a future patch.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ecb5235e17..f6697ad741ed 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -482,10 +482,10 @@ static void kvm_null_fn(void)
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 						  const struct kvm_hva_range *range)
 {
+	bool ret = false, locked = false;
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
 	struct kvm_memslots *slots;
-	bool ret = false;
 	int i, idx;
 
 	/* A null handler is allowed if and only if on_lock() is provided. */
@@ -493,11 +493,13 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			 IS_KVM_NULL_FN(range->handler)))
 		return 0;
 
-	KVM_MMU_LOCK(kvm);
-
 	idx = srcu_read_lock(&kvm->srcu);
 
+	/* The on_lock() path does not yet support lock elision. */
 	if (!IS_KVM_NULL_FN(range->on_lock)) {
+		locked = true;
+		KVM_MMU_LOCK(kvm);
+
 		range->on_lock(kvm, range->start, range->end);
 
 		if (IS_KVM_NULL_FN(range->handler))
@@ -532,6 +534,10 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
 
+			if (!locked) {
+				locked = true;
+				KVM_MMU_LOCK(kvm);
+			}
 			ret |= range->handler(kvm, &gfn_range);
 		}
 	}
@@ -540,7 +546,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 
 out_unlock:
-	KVM_MMU_UNLOCK(kvm);
+	if (locked)
+		KVM_MMU_UNLOCK(kvm);
 
 	srcu_read_unlock(&kvm->srcu, idx);
 
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (7 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 08/10] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02  9:34   ` Paolo Bonzini
  2021-04-19  8:49   ` Wanpeng Li
  2021-04-02  0:56 ` [PATCH v2 10/10] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible Sean Christopherson
  2021-04-02 12:17 ` [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Paolo Bonzini
  10 siblings, 2 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
notifications.  Because mmu_notifier_count must be modified while holding
mmu_lock for write, and must always be paired across start->end to stay
balanced, lock elision must happen in both or none.  To meet that
requirement, add a rwsem to prevent memslot updates across range_start()
and range_end().

Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
and the lock will be endl across the entire start() ... end() sequence.
If anything in the sequence sleeps, including the caller or a different
notifier, holding the spinlock would be disastrous.

For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
the slow path of unconditionally acquiring mmu_lock.  The sane
alternative would be to try to acquire the lock and force the notifier
to retry on failure.  But since OOM is currently the _only_ scenario
where blocking is disallowed attempting to optimize a guest that has been
marked for death is pointless.

Unconditionally define and use mmu_notifier_slots_lock in the memslots
code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
is negligible when the lock is uncontested, which will always be the case
when the MMU notifiers are not used.

Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.

Based heavily on code from Ben Gardon.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h |  6 ++-
 virt/kvm/kvm_main.c      | 96 +++++++++++++++++++++++++++++++---------
 2 files changed, 80 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 40ac2d40bb5a..bc3dd2838bb8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -472,6 +472,7 @@ struct kvm {
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
 	struct mutex slots_lock;
+	struct rw_semaphore mmu_notifier_slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
 	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
@@ -660,8 +661,9 @@ static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
 {
 	as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
 	return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
-			lockdep_is_held(&kvm->slots_lock) ||
-			!refcount_read(&kvm->users_count));
+				      lockdep_is_held(&kvm->slots_lock) ||
+				      lockdep_is_held(&kvm->mmu_notifier_slots_lock) ||
+				      !refcount_read(&kvm->users_count));
 }
 
 static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f6697ad741ed..af28f39817a5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -462,6 +462,7 @@ struct kvm_hva_range {
 	pte_t pte;
 	hva_handler_t handler;
 	on_lock_fn_t on_lock;
+	bool must_lock;
 	bool flush_on_ret;
 	bool may_block;
 };
@@ -479,6 +480,25 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
+
+/* Acquire mmu_lock if necessary.  Returns %true if @handler is "null" */
+static __always_inline bool kvm_mmu_lock_and_check_handler(struct kvm *kvm,
+							   const struct kvm_hva_range *range,
+							   bool *locked)
+{
+	if (*locked)
+		return false;
+
+	*locked = true;
+
+	KVM_MMU_LOCK(kvm);
+
+	if (!IS_KVM_NULL_FN(range->on_lock))
+		range->on_lock(kvm, range->start, range->end);
+
+	return IS_KVM_NULL_FN(range->handler);
+}
+
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 						  const struct kvm_hva_range *range)
 {
@@ -495,16 +515,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 
 	idx = srcu_read_lock(&kvm->srcu);
 
-	/* The on_lock() path does not yet support lock elision. */
-	if (!IS_KVM_NULL_FN(range->on_lock)) {
-		locked = true;
-		KVM_MMU_LOCK(kvm);
-
-		range->on_lock(kvm, range->start, range->end);
-
-		if (IS_KVM_NULL_FN(range->handler))
-			goto out_unlock;
-	}
+	if (range->must_lock &&
+	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
+		goto out_unlock;
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
@@ -534,10 +547,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
 
-			if (!locked) {
-				locked = true;
-				KVM_MMU_LOCK(kvm);
-			}
+			if (kvm_mmu_lock_and_check_handler(kvm, range, &locked))
+				goto out_unlock;
+
 			ret |= range->handler(kvm, &gfn_range);
 		}
 	}
@@ -568,6 +580,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.pte		= pte,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
+		.must_lock	= false,
 		.flush_on_ret	= true,
 		.may_block	= false,
 	};
@@ -587,6 +600,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.pte		= __pte(0),
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
+		.must_lock	= false,
 		.flush_on_ret	= false,
 		.may_block	= false,
 	};
@@ -603,11 +617,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	trace_kvm_set_spte_hva(address);
 
 	/*
-	 * .change_pte() must be bookended by .invalidate_range_{start,end}(),
-	 * and so always runs with an elevated notifier count.  This obviates
-	 * the need to bump the sequence count.
+	 * .change_pte() must be bookended by .invalidate_range_{start,end}().
+	 * If mmu_notifier_count is zero, then start() didn't find a relevant
+	 * memslot and wasn't forced down the slow path; rechecking here is
+	 * unnecessary.  This can only occur if memslot updates are blocked.
 	 */
-	WARN_ON_ONCE(!kvm->mmu_notifier_count);
+	if (!kvm->mmu_notifier_count) {
+		lockdep_assert_held(&kvm->mmu_notifier_slots_lock);
+		return;
+	}
 
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
@@ -644,6 +662,7 @@ static void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
+	bool blockable = mmu_notifier_range_blockable(range);
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	const struct kvm_hva_range hva_range = {
 		.start		= range->start,
@@ -651,12 +670,29 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.pte		= __pte(0),
 		.handler	= kvm_unmap_gfn_range,
 		.on_lock	= kvm_inc_notifier_count,
+		.must_lock	= !blockable,
 		.flush_on_ret	= true,
-		.may_block	= mmu_notifier_range_blockable(range),
+		.may_block	= blockable,
 	};
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
 
+	/*
+	 * Prevent memslot modification between range_start() and range_end()
+	 * so that conditionally locking provides the same result in both
+	 * functions.  Without that guarantee, the mmu_notifier_count
+	 * adjustments will be imbalanced.
+	 *
+	 * Skip the memslot-lookup lock elision (set @must_lock above) to avoid
+	 * having to take the semaphore on non-blockable calls, e.g. OOM kill.
+	 * The complexity required to handle conditional locking for this case
+	 * is not worth the marginal benefits, the VM is likely doomed anyways.
+	 *
+	 * Pairs with the unlock in range_end().
+	 */
+	if (blockable)
+		down_read(&kvm->mmu_notifier_slots_lock);
+
 	__kvm_handle_hva_range(kvm, &hva_range);
 
 	return 0;
@@ -683,6 +719,7 @@ static void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
+	bool blockable = mmu_notifier_range_blockable(range);
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	const struct kvm_hva_range hva_range = {
 		.start		= range->start,
@@ -690,12 +727,17 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.pte		= __pte(0),
 		.handler	= (void *)kvm_null_fn,
 		.on_lock	= kvm_dec_notifier_count,
+		.must_lock	= !blockable,
 		.flush_on_ret	= true,
-		.may_block	= mmu_notifier_range_blockable(range),
+		.may_block	= blockable,
 	};
 
 	__kvm_handle_hva_range(kvm, &hva_range);
 
+	/* Pairs with the lock in range_start(). */
+	if (blockable)
+		up_read(&kvm->mmu_notifier_slots_lock);
+
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
 
@@ -908,6 +950,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	mutex_init(&kvm->lock);
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
+	init_rwsem(&kvm->mmu_notifier_slots_lock);
 	INIT_LIST_HEAD(&kvm->devices);
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
@@ -1028,6 +1071,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_coalesced_mmio_free(kvm);
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	/*
+	 * Reset the lock used to prevent memslot updates between MMU notifier
+	 * invalidate_range_start() and invalidate_range_end().  At this point,
+	 * no more MMU notifiers will run and pending calls to ...start() have
+	 * completed.  But, the lock could still be held if KVM's notifier was
+	 * removed between ...start() and ...end().  No threads can be waiting
+	 * on the lock as the last reference on KVM has been dropped.  If the
+	 * lock is still held, freeing memslots will deadlock.
+	 */
+	init_rwsem(&kvm->mmu_notifier_slots_lock);
 #else
 	kvm_arch_flush_shadow_all(kvm);
 #endif
@@ -1279,7 +1332,10 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
 	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
 	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
 
+	down_write(&kvm->mmu_notifier_slots_lock);
 	rcu_assign_pointer(kvm->memslots[as_id], slots);
+	up_write(&kvm->mmu_notifier_slots_lock);
+
 	synchronize_srcu_expedited(&kvm->srcu);
 
 	/*
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 10/10] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (8 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
@ 2021-04-02  0:56 ` Sean Christopherson
  2021-04-02 12:17 ` [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Paolo Bonzini
  10 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02  0:56 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Let the TDP MMU yield when unmapping a range in response to a MMU
notification, if yielding is allowed by said notification.  There is no
reason to disallow yielding in this case, and in theory the range being
invalidated could be quite large.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7797d24f0937..dd17d9673ff2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -885,7 +885,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id)
 		flush |= zap_gfn_range(kvm, root, range->start, range->end,
-				       false, flush);
+				       range->may_block, flush);
 
 	return flush;
 }
@@ -903,6 +903,10 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 
 	rcu_read_lock();
 
+	/*
+	 * Don't support rescheduling, none of the MMU notifiers that funnel
+	 * into this helper allow blocking; it'd be dead, wasteful code.
+	 */
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
 		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
 			ret |= handler(kvm, &iter, range);
-- 
2.31.0.208.g409f899ff0-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-02  0:56 ` [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
@ 2021-04-02  9:34   ` Paolo Bonzini
  2021-04-02 14:59     ` Sean Christopherson
  2021-04-19  8:49   ` Wanpeng Li
  1 sibling, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2021-04-02  9:34 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 02/04/21 02:56, Sean Christopherson wrote:
> Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
> notifications.  Because mmu_notifier_count must be modified while holding
> mmu_lock for write, and must always be paired across start->end to stay
> balanced, lock elision must happen in both or none.  To meet that
> requirement, add a rwsem to prevent memslot updates across range_start()
> and range_end().
> 
> Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
> and the lock will be endl across the entire start() ... end() sequence.
> If anything in the sequence sleeps, including the caller or a different
> notifier, holding the spinlock would be disastrous.
> 
> For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
> the slow path of unconditionally acquiring mmu_lock.  The sane
> alternative would be to try to acquire the lock and force the notifier
> to retry on failure.  But since OOM is currently the _only_ scenario
> where blocking is disallowed attempting to optimize a guest that has been
> marked for death is pointless.
> 
> Unconditionally define and use mmu_notifier_slots_lock in the memslots
> code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
> is negligible when the lock is uncontested, which will always be the case
> when the MMU notifiers are not used.
> 
> Note, technically flag-only memslot updates could be allowed in parallel,
> but stalling a memslot update for a relatively short amount of time is
> not a scalability issue, and this is all more than complex enough.

Proposal for the locking documentation:

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index b21a34c34a21..3e4ad7de36cb 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,6 +16,13 @@ The acquisition orders for mutexes are as follows:
  - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
    them together is quite rare.
  
+- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of
+  invalidate_range_start() and invalidate_range_end() callbacks
+  use the same memslots array.  kvm->slots_lock is taken outside the
+  write-side critical section of kvm->mmu_notifier_slots_lock, so
+  MMU notifiers must not take kvm->slots_lock.  No other write-side
+  critical sections should be added.
+
  On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
  
  Everything else is a leaf: no other lock is taken inside the critical

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper
  2021-04-02  0:56 ` [PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper Sean Christopherson
@ 2021-04-02  9:35   ` Paolo Bonzini
  2021-04-02 14:59     ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2021-04-02  9:35 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 02/04/21 02:56, Sean Christopherson wrote:
> +		.handler	= (void *)kvm_null_fn,
> +		.on_lock	= kvm_dec_notifier_count,
> +		.flush_on_ret	= true,

Doesn't really matter since the handler is null, but I think it's 
cleaner to have false here.

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte()
  2021-04-02  0:56 ` [PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte() Sean Christopherson
@ 2021-04-02 11:08   ` Paolo Bonzini
  0 siblings, 0 replies; 24+ messages in thread
From: Paolo Bonzini @ 2021-04-02 11:08 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 02/04/21 02:56, Sean Christopherson wrote:
> In KVM's .change_pte() notification callback, replace the notifier
> sequence bump with a WARN_ON assertion that the notifier count is
> elevated.  An elevated count provides stricter protections than bumping
> the sequence, and the sequence is guarnateed to be bumped before the
> count hits zero.
> 
> When .change_pte() was added by commit 828502d30073 ("ksm: add
> mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary
> as .change_pte() would be invoked without any surrounding notifications.
> 
> However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify
> with invalidate_range_start and invalidate_range_end"), all calls to
> .change_pte() are guaranteed to be bookended by start() and end(), and
> so are guaranteed to run with an elevated notifier count.
> 
> Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a
> bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats
> the purpose of .change_pte().  Every arch's kvm_set_spte_hva() assumes
> .change_pte() is called when the relevant SPTE is present in KVM's MMU,
> as the original goal was to accelerate Kernel Samepage Merging (KSM) by
> updating KVM's SPTEs without requiring a VM-Exit (due to invalidating
> the SPTE).  I.e. it means that .change_pte() is effectively dead code
> on _all_ architectures.
> 
> x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
> is guaranteed due to the prior invalidation.  PPC simply unmaps the SPTE,
> which again should be a nop due to the invalidation.  arm64 is a bit
> murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
> called without a cache pointer, which means it will map an entry if and
> only if an existing PTE was found.
> 
> For now, take advantage of the bug to simplify future consolidation of
> KVMs's MMU notifier code.   Doing so will not greatly complicate fixing
> .change_pte(), assuming it's even worth fixing.  .change_pte() has been
> broken for 8+ years and no one has complained.  Even if there are
> KSM+KVM users that care deeply about its performance, the benefits of
> avoiding VM-Exits via .change_pte() need to be reevaluated to justify
> the added complexity and testing burden.  Ripping out .change_pte()
> entirely would be a lot easier.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   virt/kvm/kvm_main.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d1de843b7618..8df091950161 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -461,12 +461,17 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   
>   	trace_kvm_set_spte_hva(address);
>   
> +	/*
> +	 * .change_pte() must be bookended by .invalidate_range_{start,end}(),

Changed to "surrounded" for the benefit of non-native speakers. :)

Paolo

> +	 * and so always runs with an elevated notifier count.  This obviates
> +	 * the need to bump the sequence count.
> +	 */
> +	WARN_ON_ONCE(!kvm->mmu_notifier_count);
> +
>   	idx = srcu_read_lock(&kvm->srcu);
>   
>   	KVM_MMU_LOCK(kvm);
>   
> -	kvm->mmu_notifier_seq++;
> -
>   	if (kvm_set_spte_hva(kvm, address, pte))
>   		kvm_flush_remote_tlbs(kvm);
>   
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers
  2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (9 preceding siblings ...)
  2021-04-02  0:56 ` [PATCH v2 10/10] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible Sean Christopherson
@ 2021-04-02 12:17 ` Paolo Bonzini
  2021-04-12 10:27   ` Marc Zyngier
  10 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2021-04-02 12:17 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 02/04/21 02:56, Sean Christopherson wrote:
> The end goal of this series is to optimize the MMU notifiers to take
> mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
> range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
> sensitive to mmu_lock being taken for write at inopportune times, and
> such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
> page shenanigans.  The vast majority of notifications for these VMs will
> be spurious (for KVM), and eliding mmu_lock for spurious notifications
> avoids an otherwise unacceptable disruption to the guest.
> 
> To get there without potentially degrading performance, e.g. due to
> multiple memslot lookups, especially on non-x86 where the use cases are
> largely unknown (from my perspective), first consolidate the MMU notifier
> logic by moving the hva->gfn lookups into common KVM.
> 
> Based on kvm/queue, commit 5f986f748438 ("KVM: x86: dump_vmcs should
> include the autoload/autostore MSR lists").
> 
> Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
> PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
> I give it even odds that I introduced an off-by-one bug somewhere.
> 
> v2:
>   - Drop the patches that have already been pushed to kvm/queue.
>   - Drop two selftest changes that had snuck in via "git commit -a".
>   - Add a patch to assert that mmu_notifier_count is elevated when
>     .change_pte() runs. [Paolo]
>   - Split out moving KVM_MMU_(UN)LOCK() to __kvm_handle_hva_range() to a
>     separate patch.  Opted not to squash it with the introduction of the
>     common hva walkers (patch 02), as that prevented sharing code between
>     the old and new APIs. [Paolo]
>   - Tweak the comment in kvm_vm_destroy() above the smashing of the new
>     slots lock. [Paolo]
>   - Make mmu_notifier_slots_lock unconditional to avoid #ifdefs. [Paolo]
> 
> v1:
>   - https://lkml.kernel.org/r/20210326021957.1424875-1-seanjc@google.com
> 
> Sean Christopherson (10):
>    KVM: Assert that notifier count is elevated in .change_pte()
>    KVM: Move x86's MMU notifier memslot walkers to generic code
>    KVM: arm64: Convert to the gfn-based MMU notifier callbacks
>    KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
>    KVM: PPC: Convert to the gfn-based MMU notifier callbacks
>    KVM: Kill off the old hva-based MMU notifier callbacks
>    KVM: Move MMU notifier's mmu_lock acquisition into common helper
>    KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
>      memslot
>    KVM: Don't take mmu_lock for range invalidation unless necessary
>    KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
>      possible
> 
>   arch/arm64/kvm/mmu.c                   | 117 +++------
>   arch/mips/kvm/mmu.c                    |  97 ++------
>   arch/powerpc/include/asm/kvm_book3s.h  |  12 +-
>   arch/powerpc/include/asm/kvm_ppc.h     |   9 +-
>   arch/powerpc/kvm/book3s.c              |  18 +-
>   arch/powerpc/kvm/book3s.h              |  10 +-
>   arch/powerpc/kvm/book3s_64_mmu_hv.c    |  98 ++------
>   arch/powerpc/kvm/book3s_64_mmu_radix.c |  25 +-
>   arch/powerpc/kvm/book3s_hv.c           |  12 +-
>   arch/powerpc/kvm/book3s_pr.c           |  56 ++---
>   arch/powerpc/kvm/e500_mmu_host.c       |  27 +-
>   arch/x86/kvm/mmu/mmu.c                 | 127 ++++------
>   arch/x86/kvm/mmu/tdp_mmu.c             | 245 +++++++------------
>   arch/x86/kvm/mmu/tdp_mmu.h             |  14 +-
>   include/linux/kvm_host.h               |  22 +-
>   virt/kvm/kvm_main.c                    | 325 +++++++++++++++++++------
>   16 files changed, 552 insertions(+), 662 deletions(-)
> 

For MIPS, I am going to post a series that simplifies TLB flushing 
further.  I applied it, and rebased this one on top, to 
kvm/mmu-notifier-queue.

Architecture maintainers, please look at the branch and review/test/ack 
your parts.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-02  9:34   ` Paolo Bonzini
@ 2021-04-02 14:59     ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02 14:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Fri, Apr 02, 2021, Paolo Bonzini wrote:
> On 02/04/21 02:56, Sean Christopherson wrote:
> > Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
> > notifications.  Because mmu_notifier_count must be modified while holding
> > mmu_lock for write, and must always be paired across start->end to stay
> > balanced, lock elision must happen in both or none.  To meet that
> > requirement, add a rwsem to prevent memslot updates across range_start()
> > and range_end().
> > 
> > Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
> > and the lock will be endl across the entire start() ... end() sequence.
> > If anything in the sequence sleeps, including the caller or a different
> > notifier, holding the spinlock would be disastrous.
> > 
> > For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
> > the slow path of unconditionally acquiring mmu_lock.  The sane
> > alternative would be to try to acquire the lock and force the notifier
> > to retry on failure.  But since OOM is currently the _only_ scenario
> > where blocking is disallowed attempting to optimize a guest that has been
> > marked for death is pointless.
> > 
> > Unconditionally define and use mmu_notifier_slots_lock in the memslots
> > code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
> > is negligible when the lock is uncontested, which will always be the case
> > when the MMU notifiers are not used.
> > 
> > Note, technically flag-only memslot updates could be allowed in parallel,
> > but stalling a memslot update for a relatively short amount of time is
> > not a scalability issue, and this is all more than complex enough.
> 
> Proposal for the locking documentation:

Argh, sorry!  Looks great, I owe you.

> diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
> index b21a34c34a21..3e4ad7de36cb 100644
> --- a/Documentation/virt/kvm/locking.rst
> +++ b/Documentation/virt/kvm/locking.rst
> @@ -16,6 +16,13 @@ The acquisition orders for mutexes are as follows:
>  - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
>    them together is quite rare.
> +- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of
> +  invalidate_range_start() and invalidate_range_end() callbacks
> +  use the same memslots array.  kvm->slots_lock is taken outside the
> +  write-side critical section of kvm->mmu_notifier_slots_lock, so
> +  MMU notifiers must not take kvm->slots_lock.  No other write-side
> +  critical sections should be added.
> +
>  On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
>  Everything else is a leaf: no other lock is taken inside the critical
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper
  2021-04-02  9:35   ` Paolo Bonzini
@ 2021-04-02 14:59     ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-02 14:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Fri, Apr 02, 2021, Paolo Bonzini wrote:
> On 02/04/21 02:56, Sean Christopherson wrote:
> > +		.handler	= (void *)kvm_null_fn,
> > +		.on_lock	= kvm_dec_notifier_count,
> > +		.flush_on_ret	= true,
> 
> Doesn't really matter since the handler is null, but I think it's cleaner to
> have false here.

Agreed.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 03/10] KVM: arm64: Convert to the gfn-based MMU notifier callbacks
  2021-04-02  0:56 ` [PATCH v2 03/10] KVM: arm64: Convert to the gfn-based MMU notifier callbacks Sean Christopherson
@ 2021-04-12 10:12   ` Marc Zyngier
  0 siblings, 0 replies; 24+ messages in thread
From: Marc Zyngier @ 2021-04-12 10:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Huacai Chen, Aleksandar Markovic, Paul Mackerras, Paolo Bonzini,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Fri, 02 Apr 2021 01:56:51 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> Move arm64 to the gfn-base MMU notifier APIs, which do the hva->gfn
> lookup in common code.
> 
> No meaningful functional change intended, though the exact order of
> operations is slightly different since the memslot lookups occur before
> calling into arch code.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Marc Zyngier <maz@kernel.org>

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers
  2021-04-02 12:17 ` [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Paolo Bonzini
@ 2021-04-12 10:27   ` Marc Zyngier
  0 siblings, 0 replies; 24+ messages in thread
From: Marc Zyngier @ 2021-04-12 10:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, James Morse, Julien Thierry, Suzuki K Poulose,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc, linux-kernel,
	Ben Gardon

On Fri, 02 Apr 2021 13:17:45 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 02/04/21 02:56, Sean Christopherson wrote:
> > The end goal of this series is to optimize the MMU notifiers to take
> > mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
> > range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
> > sensitive to mmu_lock being taken for write at inopportune times, and
> > such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
> > page shenanigans.  The vast majority of notifications for these VMs will
> > be spurious (for KVM), and eliding mmu_lock for spurious notifications
> > avoids an otherwise unacceptable disruption to the guest.
> > 
> > To get there without potentially degrading performance, e.g. due to
> > multiple memslot lookups, especially on non-x86 where the use cases are
> > largely unknown (from my perspective), first consolidate the MMU notifier
> > logic by moving the hva->gfn lookups into common KVM.
> > 
> > Based on kvm/queue, commit 5f986f748438 ("KVM: x86: dump_vmcs should
> > include the autoload/autostore MSR lists").
> > 
> > Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
> > PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
> > I give it even odds that I introduced an off-by-one bug somewhere.
> > 
> > v2:
> >   - Drop the patches that have already been pushed to kvm/queue.
> >   - Drop two selftest changes that had snuck in via "git commit -a".
> >   - Add a patch to assert that mmu_notifier_count is elevated when
> >     .change_pte() runs. [Paolo]
> >   - Split out moving KVM_MMU_(UN)LOCK() to __kvm_handle_hva_range() to a
> >     separate patch.  Opted not to squash it with the introduction of the
> >     common hva walkers (patch 02), as that prevented sharing code between
> >     the old and new APIs. [Paolo]
> >   - Tweak the comment in kvm_vm_destroy() above the smashing of the new
> >     slots lock. [Paolo]
> >   - Make mmu_notifier_slots_lock unconditional to avoid #ifdefs. [Paolo]
> > 
> > v1:
> >   - https://lkml.kernel.org/r/20210326021957.1424875-1-seanjc@google.com
> > 
> > Sean Christopherson (10):
> >    KVM: Assert that notifier count is elevated in .change_pte()
> >    KVM: Move x86's MMU notifier memslot walkers to generic code
> >    KVM: arm64: Convert to the gfn-based MMU notifier callbacks
> >    KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
> >    KVM: PPC: Convert to the gfn-based MMU notifier callbacks
> >    KVM: Kill off the old hva-based MMU notifier callbacks
> >    KVM: Move MMU notifier's mmu_lock acquisition into common helper
> >    KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
> >      memslot
> >    KVM: Don't take mmu_lock for range invalidation unless necessary
> >    KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
> >      possible
> > 
> >   arch/arm64/kvm/mmu.c                   | 117 +++------
> >   arch/mips/kvm/mmu.c                    |  97 ++------
> >   arch/powerpc/include/asm/kvm_book3s.h  |  12 +-
> >   arch/powerpc/include/asm/kvm_ppc.h     |   9 +-
> >   arch/powerpc/kvm/book3s.c              |  18 +-
> >   arch/powerpc/kvm/book3s.h              |  10 +-
> >   arch/powerpc/kvm/book3s_64_mmu_hv.c    |  98 ++------
> >   arch/powerpc/kvm/book3s_64_mmu_radix.c |  25 +-
> >   arch/powerpc/kvm/book3s_hv.c           |  12 +-
> >   arch/powerpc/kvm/book3s_pr.c           |  56 ++---
> >   arch/powerpc/kvm/e500_mmu_host.c       |  27 +-
> >   arch/x86/kvm/mmu/mmu.c                 | 127 ++++------
> >   arch/x86/kvm/mmu/tdp_mmu.c             | 245 +++++++------------
> >   arch/x86/kvm/mmu/tdp_mmu.h             |  14 +-
> >   include/linux/kvm_host.h               |  22 +-
> >   virt/kvm/kvm_main.c                    | 325 +++++++++++++++++++------
> >   16 files changed, 552 insertions(+), 662 deletions(-)
> > 
> 
> For MIPS, I am going to post a series that simplifies TLB flushing
> further.  I applied it, and rebased this one on top, to
> kvm/mmu-notifier-queue.
> 
> Architecture maintainers, please look at the branch and
> review/test/ack your parts.

I've given this a reasonably good beating on arm64 for both VHE and
nVHE HW, and nothing caught fire, although I was left with a conflict
in the x86 code after merging with linux/master.

Feel free to add a

Tested-by: Marc Zyngier <maz@kernel.org>

for the arm64 side.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-02  0:56 ` [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
  2021-04-02  9:34   ` Paolo Bonzini
@ 2021-04-19  8:49   ` Wanpeng Li
  2021-04-19 13:50     ` Paolo Bonzini
  1 sibling, 1 reply; 24+ messages in thread
From: Wanpeng Li @ 2021-04-19  8:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, James Morse, Julien Thierry, Suzuki K Poulose,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, LAK,
	kvmarm, linux-mips, kvm, kvm-ppc, LKML, Ben Gardon

On Fri, 2 Apr 2021 at 08:59, Sean Christopherson <seanjc@google.com> wrote:
>
> Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
> notifications.  Because mmu_notifier_count must be modified while holding
> mmu_lock for write, and must always be paired across start->end to stay
> balanced, lock elision must happen in both or none.  To meet that
> requirement, add a rwsem to prevent memslot updates across range_start()
> and range_end().
>
> Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
> and the lock will be endl across the entire start() ... end() sequence.
> If anything in the sequence sleeps, including the caller or a different
> notifier, holding the spinlock would be disastrous.
>
> For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
> the slow path of unconditionally acquiring mmu_lock.  The sane
> alternative would be to try to acquire the lock and force the notifier
> to retry on failure.  But since OOM is currently the _only_ scenario
> where blocking is disallowed attempting to optimize a guest that has been
> marked for death is pointless.
>
> Unconditionally define and use mmu_notifier_slots_lock in the memslots
> code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
> is negligible when the lock is uncontested, which will always be the case
> when the MMU notifiers are not used.
>
> Note, technically flag-only memslot updates could be allowed in parallel,
> but stalling a memslot update for a relatively short amount of time is
> not a scalability issue, and this is all more than complex enough.
>
> Based heavily on code from Ben Gardon.
>
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I saw this splatting:

 ======================================================
 WARNING: possible circular locking dependency detected
 5.12.0-rc3+ #6 Tainted: G           OE
 ------------------------------------------------------
 qemu-system-x86/3069 is trying to acquire lock:
 ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
at: __mmu_notifier_invalidate_range_end+0x5/0x190

 but task is already holding lock:
 ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at:
kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

 which lock already depends on the new lock.


 the existing dependency chain (in reverse order) is:

 -> #1 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}:
        down_read+0x48/0x250
        kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
        __mmu_notifier_invalidate_range_start+0xe8/0x260
        wp_page_copy+0x82b/0xa30
        do_wp_page+0xde/0x420
        __handle_mm_fault+0x935/0x1230
        handle_mm_fault+0x179/0x420
        do_user_addr_fault+0x1b3/0x690
        exc_page_fault+0x82/0x2b0
        asm_exc_page_fault+0x1e/0x30

 -> #0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
        __lock_acquire+0x110f/0x1980
        lock_acquire+0x1bc/0x400
        __mmu_notifier_invalidate_range_end+0x47/0x190
        wp_page_copy+0x796/0xa30
        do_wp_page+0xde/0x420
        __handle_mm_fault+0x935/0x1230
        handle_mm_fault+0x179/0x420
        do_user_addr_fault+0x1b3/0x690
        exc_page_fault+0x82/0x2b0
        asm_exc_page_fault+0x1e/0x30

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&kvm->mmu_notifier_slots_lock);
                                lock(mmu_notifier_invalidate_range_start);
                                lock(&kvm->mmu_notifier_slots_lock);
   lock(mmu_notifier_invalidate_range_start);

  *** DEADLOCK ***

 2 locks held by qemu-system-x86/3069:
  #0: ffff9e4269f8a9e0 (&mm->mmap_lock#2){++++}-{3:3}, at:
do_user_addr_fault+0x10e/0x690
  #1: ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3},
at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

 stack backtrace:
 CPU: 0 PID: 3069 Comm: qemu-system-x86 Tainted: G           OE
5.12.0-rc3+ #6
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS
FBKTC1AUS 02/16/2016
 Call Trace:
  dump_stack+0x87/0xb7
  print_circular_bug.isra.39+0x1b4/0x210
  check_noncircular+0x103/0x150
  __lock_acquire+0x110f/0x1980
  ? __lock_acquire+0x110f/0x1980
  lock_acquire+0x1bc/0x400
  ? __mmu_notifier_invalidate_range_end+0x5/0x190
  ? find_held_lock+0x40/0xb0
  __mmu_notifier_invalidate_range_end+0x47/0x190
  ? __mmu_notifier_invalidate_range_end+0x5/0x190
  wp_page_copy+0x796/0xa30
  do_wp_page+0xde/0x420
  __handle_mm_fault+0x935/0x1230
  handle_mm_fault+0x179/0x420
  do_user_addr_fault+0x1b3/0x690
  ? rcu_read_lock_sched_held+0x4f/0x80
  exc_page_fault+0x82/0x2b0
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x55f5bef2560f

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-19  8:49   ` Wanpeng Li
@ 2021-04-19 13:50     ` Paolo Bonzini
  2021-04-19 15:09       ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2021-04-19 13:50 UTC (permalink / raw)
  To: Wanpeng Li, Sean Christopherson
  Cc: Marc Zyngier, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon

On 19/04/21 10:49, Wanpeng Li wrote:
> I saw this splatting:
> 
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   5.12.0-rc3+ #6 Tainted: G           OE
>   ------------------------------------------------------
>   qemu-system-x86/3069 is trying to acquire lock:
>   ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
> at: __mmu_notifier_invalidate_range_end+0x5/0x190
> 
>   but task is already holding lock:
>   ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at:
> kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

I guess it is possible to open-code the wait using a readers count and a
spinlock (see patch after signature).  This allows including the
rcu_assign_pointer in the same critical section that checks the number
of readers.  Also on the plus side, the init_rwsem() is replaced by
slightly nicer code.

IIUC this could be extended to non-sleeping invalidations too, but I
am not really sure about that.

There are some issues with the patch though:

- I am not sure if this should be a raw spin lock to avoid the same issue
on PREEMPT_RT kernel.  That said the critical section is so tiny that using
a raw spin lock may make sense anyway

- this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
own interval-tree-based filter is also using a similar mechanism that is
likewise not fair, so it should be okay.

Any opinions?  For now I placed the change below in kvm/queue, but I'm
leaning towards delaying this optimization to the next merge window.

Paolo

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 8f5d5bcf5689..e628f48dfdda 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,12 +16,11 @@ The acquisition orders for mutexes are as follows:
  - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
    them together is quite rare.
  
-- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of
+- kvm->mn_active_invalidate_count ensures that pairs of
    invalidate_range_start() and invalidate_range_end() callbacks
-  use the same memslots array.  kvm->slots_lock is taken outside the
-  write-side critical section of kvm->mmu_notifier_slots_lock, so
-  MMU notifiers must not take kvm->slots_lock.  No other write-side
-  critical sections should be added.
+  use the same memslots array.  kvm->slots_lock is taken on the
+  waiting side in install_new_memslots, so MMU notifiers must not
+  take kvm->slots_lock.
  
  On x86:
  
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 76b340dd6981..44a4a0c5148a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -472,11 +472,15 @@ struct kvm {
  #endif /* KVM_HAVE_MMU_RWLOCK */
  
  	struct mutex slots_lock;
-	struct rw_semaphore mmu_notifier_slots_lock;
  	struct mm_struct *mm; /* userspace tied to this vm */
  	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
  	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
  
+	/* Used to wait for completion of MMU notifiers.  */
+	spinlock_t mn_invalidate_lock;
+	unsigned long mn_active_invalidate_count;
+	struct rcuwait mn_memslots_update_rcuwait;
+
  	/*
  	 * created_vcpus is protected by kvm->lock, and is incremented
  	 * at the beginning of KVM_CREATE_VCPU.  online_vcpus is only
@@ -662,7 +666,7 @@ static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
  	as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
  	return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
  				      lockdep_is_held(&kvm->slots_lock) ||
-				      lockdep_is_held(&kvm->mmu_notifier_slots_lock) ||
+				      READ_ONCE(kvm->mn_active_invalidate_count) ||
  				      !refcount_read(&kvm->users_count));
  }
  
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff9e95eb6960..cdaa1841e725 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -624,7 +624,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
  	 * otherwise, mmu_notifier_count is incremented unconditionally.
  	 */
  	if (!kvm->mmu_notifier_count) {
-		lockdep_assert_held(&kvm->mmu_notifier_slots_lock);
+		WARN_ON(!READ_ONCE(kvm->mn_active_invalidate_count));
  		return;
  	}
  
@@ -689,10 +689,13 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
  	 * The complexity required to handle conditional locking for this case
  	 * is not worth the marginal benefits, the VM is likely doomed anyways.
  	 *
-	 * Pairs with the up_read in range_end().
+	 * Pairs with the decrement in range_end().
  	 */
-	if (blockable)
-		down_read(&kvm->mmu_notifier_slots_lock);
+	if (blockable) {
+		spin_lock(&kvm->mn_invalidate_lock);
+		kvm->mn_active_invalidate_count++;
+		spin_unlock(&kvm->mn_invalidate_lock);
+	}
  
  	__kvm_handle_hva_range(kvm, &hva_range);
  
@@ -735,9 +738,20 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
  
  	__kvm_handle_hva_range(kvm, &hva_range);
  
-	/* Pairs with the down_read in range_start(). */
-	if (blockable)
-		up_read(&kvm->mmu_notifier_slots_lock);
+	/* Pairs with the increment in range_start(). */
+	if (blockable) {
+		bool wake;
+		spin_lock(&kvm->mn_invalidate_lock);
+		wake = (--kvm->mn_active_invalidate_count == 0);
+		spin_unlock(&kvm->mn_invalidate_lock);
+
+		/*
+		 * There can only be one waiter, since the wait happens under
+		 * slots_lock.
+		 */
+		if (wake)
+			rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+	}
  
  	BUG_ON(kvm->mmu_notifier_count < 0);
  }
@@ -951,7 +965,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
  	mutex_init(&kvm->lock);
  	mutex_init(&kvm->irq_lock);
  	mutex_init(&kvm->slots_lock);
-	init_rwsem(&kvm->mmu_notifier_slots_lock);
+	spin_lock_init(&kvm->mn_invalidate_lock);
+	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
+
  	INIT_LIST_HEAD(&kvm->devices);
  
  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
@@ -1073,15 +1089,17 @@ static void kvm_destroy_vm(struct kvm *kvm)
  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
  	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
  	/*
-	 * Reset the lock used to prevent memslot updates between MMU notifier
-	 * invalidate_range_start() and invalidate_range_end().  At this point,
-	 * no more MMU notifiers will run and pending calls to ...start() have
-	 * completed.  But, the lock could still be held if KVM's notifier was
-	 * removed between ...start() and ...end().  No threads can be waiting
-	 * on the lock as the last reference on KVM has been dropped.  If the
-	 * lock is still held, freeing memslots will deadlock.
+	 * At this point, pending calls to invalidate_range_start()
+	 * have completed but no more MMU notifiers will run, so
+	 * mn_active_invalidate_count may remain unbalanced.
+	 * No threads can be waiting in install_new_memslots as the
+	 * last reference on KVM has been dropped, but freeing
+	 * memslots will deadlock without manual intervention.
  	 */
-	init_rwsem(&kvm->mmu_notifier_slots_lock);
+	spin_lock(&kvm->mn_invalidate_lock);
+	kvm->mn_active_invalidate_count = 0;
+	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+	spin_unlock(&kvm->mn_invalidate_lock);
  #else
  	kvm_arch_flush_shadow_all(kvm);
  #endif
@@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
  	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
  	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
  
-	down_write(&kvm->mmu_notifier_slots_lock);
+	/*
+	 * This cannot be an rwsem because the MMU notifier must not run
+	 * inside the critical section.  A sleeping rwsem cannot exclude
+	 * that.
+	 */
+	spin_lock(&kvm->mn_invalidate_lock);
+	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
+	while (kvm->mn_active_invalidate_count) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock(&kvm->mn_invalidate_lock);
+		schedule();
+		spin_lock(&kvm->mn_invalidate_lock);
+	}
+	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
  	rcu_assign_pointer(kvm->memslots[as_id], slots);
-	up_write(&kvm->mmu_notifier_slots_lock);
+	spin_unlock(&kvm->mn_invalidate_lock);
  
  	synchronize_srcu_expedited(&kvm->srcu);
  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-19 13:50     ` Paolo Bonzini
@ 2021-04-19 15:09       ` Sean Christopherson
  2021-04-19 22:09         ` Paolo Bonzini
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2021-04-19 15:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Wanpeng Li, Marc Zyngier, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, LKML, Ben Gardon

On Mon, Apr 19, 2021, Paolo Bonzini wrote:
> On 19/04/21 10:49, Wanpeng Li wrote:
> > I saw this splatting:
> > 
> >   ======================================================
> >   WARNING: possible circular locking dependency detected
> >   5.12.0-rc3+ #6 Tainted: G           OE
> >   ------------------------------------------------------
> >   qemu-system-x86/3069 is trying to acquire lock:
> >   ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
> > at: __mmu_notifier_invalidate_range_end+0x5/0x190
> > 
> >   but task is already holding lock:
> >   ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at:
> > kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
> 
> I guess it is possible to open-code the wait using a readers count and a
> spinlock (see patch after signature).  This allows including the
> rcu_assign_pointer in the same critical section that checks the number
> of readers.  Also on the plus side, the init_rwsem() is replaced by
> slightly nicer code.

Ugh, the count approach is nearly identical to Ben's original code.  Using a
rwsem seemed so clever :-/

> IIUC this could be extended to non-sleeping invalidations too, but I
> am not really sure about that.

Yes, that should be fine.

> There are some issues with the patch though:
> 
> - I am not sure if this should be a raw spin lock to avoid the same issue
> on PREEMPT_RT kernel.  That said the critical section is so tiny that using
> a raw spin lock may make sense anyway

If using spinlock_t is problematic, wouldn't mmu_lock already be an issue?  Or
am I misunderstanding your concern?

> - this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
> own interval-tree-based filter is also using a similar mechanism that is
> likewise not fair, so it should be okay.

The one concern I had with an unfair mechanism of this nature is that, in theory,
the memslot update could be blocked indefinitely.

> Any opinions?  For now I placed the change below in kvm/queue, but I'm
> leaning towards delaying this optimization to the next merge window.

I think delaying it makes sense.

> @@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
>  	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
>  	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
> -	down_write(&kvm->mmu_notifier_slots_lock);
> +	/*
> +	 * This cannot be an rwsem because the MMU notifier must not run
> +	 * inside the critical section.  A sleeping rwsem cannot exclude
> +	 * that.

How on earth did you decipher that from the splat?  I stared at it for a good
five minutes and was completely befuddled.

> +	 */
> +	spin_lock(&kvm->mn_invalidate_lock);
> +	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
> +	while (kvm->mn_active_invalidate_count) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		spin_unlock(&kvm->mn_invalidate_lock);
> +		schedule();
> +		spin_lock(&kvm->mn_invalidate_lock);
> +	}
> +	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
>  	rcu_assign_pointer(kvm->memslots[as_id], slots);
> -	up_write(&kvm->mmu_notifier_slots_lock);
> +	spin_unlock(&kvm->mn_invalidate_lock);
>  	synchronize_srcu_expedited(&kvm->srcu);
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-19 15:09       ` Sean Christopherson
@ 2021-04-19 22:09         ` Paolo Bonzini
  2021-04-20  1:17           ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Paolo Bonzini @ 2021-04-19 22:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Wanpeng Li, Marc Zyngier, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, LKML, Ben Gardon

On 19/04/21 17:09, Sean Christopherson wrote:
>> - this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
>> own interval-tree-based filter is also using a similar mechanism that is
>> likewise not fair, so it should be okay.
> 
> The one concern I had with an unfair mechanism of this nature is that, in theory,
> the memslot update could be blocked indefinitely.

Yep, that's why I mentioned it.

>> @@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
>>   	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
>>   	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
>> -	down_write(&kvm->mmu_notifier_slots_lock);
>> +	/*
>> +	 * This cannot be an rwsem because the MMU notifier must not run
>> +	 * inside the critical section.  A sleeping rwsem cannot exclude
>> +	 * that.
> 
> How on earth did you decipher that from the splat?  I stared at it for a good
> five minutes and was completely befuddled.

Just scratch that, it makes no sense.  It's much simpler, but you have
to look at include/linux/mmu_notifier.h to figure it out:

     invalidate_range_start
       take pseudo lock
       down_read()           (*)
       release pseudo lock
     invalidate_range_end
       take pseudo lock      (**)
       up_read()
       release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers).  This also means that there's no need for a raw spinlock.

Given this simple explanation, I think it's okay to include this
patch in the merge window pull request, with the fix after my
signature squashed in.  The fix actually undoes a lot of the
changes to __kvm_handle_hva_range that this patch made, so the
result is relatively simple.  You can already find the result
in kvm/queue.

Paolo

 From daefeeb229ba8be5bd819a51875bc1fd5e74fc85 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Mon, 19 Apr 2021 09:01:46 -0400
Subject: [PATCH] KVM: avoid "deadlock" between install_new_memslots and MMU
  notifier

Wanpeng Li is reporting this splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.12.0-rc3+ #6 Tainted: G           OE
  ------------------------------------------------------
  qemu-system-x86/3069 is trying to acquire lock:
  ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190

  but task is already holding lock:
  ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

  which lock already depends on the new lock.

This corresponds to the following MMU notifier logic:

     invalidate_range_start
       take pseudo lock
       down_read()           (*)
       release pseudo lock
     invalidate_range_end
       take pseudo lock      (**)
       up_read()
       release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock.  This also allows handling blockable and non-blockable
critical section in the same way.

Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
  Documentation/virt/kvm/locking.rst |   9 +--
  include/linux/kvm_host.h           |   8 +-
  virt/kvm/kvm_main.c                | 119 ++++++++++++++---------------
  3 files changed, 67 insertions(+), 69 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 8f5d5bcf5689..e628f48dfdda 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,12 +16,11 @@ The acquisition orders for mutexes are as follows:
  - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
    them together is quite rare.
  
-- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of
+- kvm->mn_active_invalidate_count ensures that pairs of
    invalidate_range_start() and invalidate_range_end() callbacks
-  use the same memslots array.  kvm->slots_lock is taken outside the
-  write-side critical section of kvm->mmu_notifier_slots_lock, so
-  MMU notifiers must not take kvm->slots_lock.  No other write-side
-  critical sections should be added.
+  use the same memslots array.  kvm->slots_lock is taken on the
+  waiting side in install_new_memslots, so MMU notifiers must not
+  take kvm->slots_lock.
  
  On x86:
  
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5808c259b92b..5ef09a4bc9c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -472,11 +472,15 @@ struct kvm {
  #endif /* KVM_HAVE_MMU_RWLOCK */
  
  	struct mutex slots_lock;
-	struct rw_semaphore mmu_notifier_slots_lock;
  	struct mm_struct *mm; /* userspace tied to this vm */
  	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
  	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
  
+	/* Used to wait for completion of MMU notifiers.  */
+	spinlock_t mn_invalidate_lock;
+	unsigned long mn_active_invalidate_count;
+	struct rcuwait mn_memslots_update_rcuwait;
+
  	/*
  	 * created_vcpus is protected by kvm->lock, and is incremented
  	 * at the beginning of KVM_CREATE_VCPU.  online_vcpus is only
@@ -663,7 +667,7 @@ static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
  	as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
  	return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
  				      lockdep_is_held(&kvm->slots_lock) ||
-				      lockdep_is_held(&kvm->mmu_notifier_slots_lock) ||
+				      READ_ONCE(kvm->mn_active_invalidate_count) ||
  				      !refcount_read(&kvm->users_count));
  }
  
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 90f579e996e5..6a94ce073690 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -462,7 +462,6 @@ struct kvm_hva_range {
  	pte_t pte;
  	hva_handler_t handler;
  	on_lock_fn_t on_lock;
-	bool must_lock;
  	bool flush_on_ret;
  	bool may_block;
  };
@@ -480,25 +479,6 @@ static void kvm_null_fn(void)
  }
  #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
  
-
-/* Acquire mmu_lock if necessary.  Returns %true if @handler is "null" */
-static __always_inline bool kvm_mmu_lock_and_check_handler(struct kvm *kvm,
-							   const struct kvm_hva_range *range,
-							   bool *locked)
-{
-	if (*locked)
-		return false;
-
-	*locked = true;
-
-	KVM_MMU_LOCK(kvm);
-
-	if (!IS_KVM_NULL_FN(range->on_lock))
-		range->on_lock(kvm, range->start, range->end);
-
-	return IS_KVM_NULL_FN(range->handler);
-}
-
  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  						  const struct kvm_hva_range *range)
  {
@@ -515,10 +495,6 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  
  	idx = srcu_read_lock(&kvm->srcu);
  
-	if (range->must_lock &&
-	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
-		goto out_unlock;
-
  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
  		slots = __kvm_memslots(kvm, i);
  		kvm_for_each_memslot(slot, slots) {
@@ -547,8 +523,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
  			gfn_range.slot = slot;
  
-			if (kvm_mmu_lock_and_check_handler(kvm, range, &locked))
-				goto out_unlock;
+			if (!locked) {
+				locked = true;
+				KVM_MMU_LOCK(kvm);
+				if (!IS_KVM_NULL_FN(range->on_lock))
+					range->on_lock(kvm, range->start, range->end);
+				if (IS_KVM_NULL_FN(range->handler))
+					break;
+			}
  
  			ret |= range->handler(kvm, &gfn_range);
  		}
@@ -557,7 +539,6 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
  		kvm_flush_remote_tlbs(kvm);
  
-out_unlock:
  	if (locked)
  		KVM_MMU_UNLOCK(kvm);
  
@@ -580,7 +561,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
  		.pte		= pte,
  		.handler	= handler,
  		.on_lock	= (void *)kvm_null_fn,
-		.must_lock	= false,
  		.flush_on_ret	= true,
  		.may_block	= false,
  	};
@@ -600,7 +580,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
  		.pte		= __pte(0),
  		.handler	= handler,
  		.on_lock	= (void *)kvm_null_fn,
-		.must_lock	= false,
  		.flush_on_ret	= false,
  		.may_block	= false,
  	};
@@ -620,13 +599,11 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
  	 * .change_pte() must be surrounded by .invalidate_range_{start,end}(),
  	 * If mmu_notifier_count is zero, then start() didn't find a relevant
  	 * memslot and wasn't forced down the slow path; rechecking here is
-	 * unnecessary.  This can only occur if memslot updates are blocked;
-	 * otherwise, mmu_notifier_count is incremented unconditionally.
+	 * unnecessary.
  	 */
-	if (!kvm->mmu_notifier_count) {
-		lockdep_assert_held(&kvm->mmu_notifier_slots_lock);
+	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+	if (!kvm->mmu_notifier_count)
  		return;
-	}
  
  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
  }
@@ -663,7 +640,6 @@ static void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
  					const struct mmu_notifier_range *range)
  {
-	bool blockable = mmu_notifier_range_blockable(range);
  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
  	const struct kvm_hva_range hva_range = {
  		.start		= range->start,
@@ -671,9 +647,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
  		.pte		= __pte(0),
  		.handler	= kvm_unmap_gfn_range,
  		.on_lock	= kvm_inc_notifier_count,
-		.must_lock	= !blockable,
  		.flush_on_ret	= true,
-		.may_block	= blockable,
+		.may_block	= mmu_notifier_range_blockable(range),
  	};
  
  	trace_kvm_unmap_hva_range(range->start, range->end);
@@ -684,15 +659,11 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
  	 * functions.  Without that guarantee, the mmu_notifier_count
  	 * adjustments will be imbalanced.
  	 *
-	 * Skip the memslot-lookup lock elision (set @must_lock above) to avoid
-	 * having to take the semaphore on non-blockable calls, e.g. OOM kill.
-	 * The complexity required to handle conditional locking for this case
-	 * is not worth the marginal benefits, the VM is likely doomed anyways.
-	 *
-	 * Pairs with the up_read in range_end().
+	 * Pairs with the decrement in range_end().
  	 */
-	if (blockable)
-		down_read(&kvm->mmu_notifier_slots_lock);
+	spin_lock(&kvm->mn_invalidate_lock);
+	kvm->mn_active_invalidate_count++;
+	spin_unlock(&kvm->mn_invalidate_lock);
  
  	__kvm_handle_hva_range(kvm, &hva_range);
  
@@ -720,7 +691,6 @@ static void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
  					const struct mmu_notifier_range *range)
  {
-	bool blockable = mmu_notifier_range_blockable(range);
  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
  	const struct kvm_hva_range hva_range = {
  		.start		= range->start,
@@ -728,16 +698,24 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
  		.pte		= __pte(0),
  		.handler	= (void *)kvm_null_fn,
  		.on_lock	= kvm_dec_notifier_count,
-		.must_lock	= !blockable,
  		.flush_on_ret	= false,
-		.may_block	= blockable,
+		.may_block	= mmu_notifier_range_blockable(range),
  	};
+	bool wake;
  
  	__kvm_handle_hva_range(kvm, &hva_range);
  
-	/* Pairs with the down_read in range_start(). */
-	if (blockable)
-		up_read(&kvm->mmu_notifier_slots_lock);
+	/* Pairs with the increment in range_start(). */
+	spin_lock(&kvm->mn_invalidate_lock);
+	wake = (--kvm->mn_active_invalidate_count == 0);
+	spin_unlock(&kvm->mn_invalidate_lock);
+
+	/*
+	 * There can only be one waiter, since the wait happens under
+	 * slots_lock.
+	 */
+	if (wake)
+		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
  
  	BUG_ON(kvm->mmu_notifier_count < 0);
  }
@@ -951,7 +929,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
  	mutex_init(&kvm->lock);
  	mutex_init(&kvm->irq_lock);
  	mutex_init(&kvm->slots_lock);
-	init_rwsem(&kvm->mmu_notifier_slots_lock);
+	spin_lock_init(&kvm->mn_invalidate_lock);
+	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
+
  	INIT_LIST_HEAD(&kvm->devices);
  
  	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
@@ -1073,15 +1053,17 @@ static void kvm_destroy_vm(struct kvm *kvm)
  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
  	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
  	/*
-	 * Reset the lock used to prevent memslot updates between MMU notifier
-	 * invalidate_range_start() and invalidate_range_end().  At this point,
-	 * no more MMU notifiers will run and pending calls to ...start() have
-	 * completed.  But, the lock could still be held if KVM's notifier was
-	 * removed between ...start() and ...end().  No threads can be waiting
-	 * on the lock as the last reference on KVM has been dropped.  If the
-	 * lock is still held, freeing memslots will deadlock.
+	 * At this point, pending calls to invalidate_range_start()
+	 * have completed but no more MMU notifiers will run, so
+	 * mn_active_invalidate_count may remain unbalanced.
+	 * No threads can be waiting in install_new_memslots as the
+	 * last reference on KVM has been dropped, but freeing
+	 * memslots will deadlock without manual intervention.
  	 */
-	init_rwsem(&kvm->mmu_notifier_slots_lock);
+	spin_lock(&kvm->mn_invalidate_lock);
+	kvm->mn_active_invalidate_count = 0;
+	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
+	spin_unlock(&kvm->mn_invalidate_lock);
  #else
  	kvm_arch_flush_shadow_all(kvm);
  #endif
@@ -1333,9 +1315,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
  	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
  	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
  
-	down_write(&kvm->mmu_notifier_slots_lock);
+	/*
+	 * This cannot be an rwsem because the MMU notifier must not run
+	 * inside the critical section, which cannot be excluded with a
+	 * sleeping rwsem.
+	 */
+	spin_lock(&kvm->mn_invalidate_lock);
+	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
+	while (kvm->mn_active_invalidate_count) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock(&kvm->mn_invalidate_lock);
+		schedule();
+		spin_lock(&kvm->mn_invalidate_lock);
+	}
+	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
  	rcu_assign_pointer(kvm->memslots[as_id], slots);
-	up_write(&kvm->mmu_notifier_slots_lock);
+	spin_unlock(&kvm->mn_invalidate_lock);
  
  	synchronize_srcu_expedited(&kvm->srcu);
  
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-04-19 22:09         ` Paolo Bonzini
@ 2021-04-20  1:17           ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2021-04-20  1:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Wanpeng Li, Marc Zyngier, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, LKML, Ben Gardon

On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> On 19/04/21 17:09, Sean Christopherson wrote:
> > > - this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
> > > own interval-tree-based filter is also using a similar mechanism that is
> > > likewise not fair, so it should be okay.
> > 
> > The one concern I had with an unfair mechanism of this nature is that, in theory,
> > the memslot update could be blocked indefinitely.
> 
> Yep, that's why I mentioned it.
> 
> > > @@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
> > >   	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
> > >   	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
> > > -	down_write(&kvm->mmu_notifier_slots_lock);
> > > +	/*
> > > +	 * This cannot be an rwsem because the MMU notifier must not run
> > > +	 * inside the critical section.  A sleeping rwsem cannot exclude
> > > +	 * that.
> > 
> > How on earth did you decipher that from the splat?  I stared at it for a good
> > five minutes and was completely befuddled.
> 
> Just scratch that, it makes no sense.  It's much simpler, but you have
> to look at include/linux/mmu_notifier.h to figure it out:

LOL, glad you could figure it out, I wasn't getting anywhere, mmu_notifier.h or
not.

>     invalidate_range_start
>       take pseudo lock
>       down_read()           (*)
>       release pseudo lock
>     invalidate_range_end
>       take pseudo lock      (**)
>       up_read()
>       release pseudo lock
> 
> At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
> at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
> 
> This could cause a deadlock (ignoring for a second that the pseudo lock
> is not a lock):
> 
> - invalidate_range_start waits on down_read(), because the rwsem is
> held by install_new_memslots
> 
> - install_new_memslots waits on down_write(), because the rwsem is
> held till (another) invalidate_range_end finishes
> 
> - invalidate_range_end sits waits on the pseudo lock, held by
> invalidate_range_start.
> 
> Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
> it would change the *shared* rwsem readers into *shared recursive*
> readers).  This also means that there's no need for a raw spinlock.

Ahh, thanks, this finally made things click.

> Given this simple explanation, I think it's okay to include this

LOL, "simple".

> patch in the merge window pull request, with the fix after my
> signature squashed in.  The fix actually undoes a lot of the
> changes to __kvm_handle_hva_range that this patch made, so the
> result is relatively simple.  You can already find the result
> in kvm/queue.

...

>  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  						  const struct kvm_hva_range *range)
>  {
> @@ -515,10 +495,6 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  	idx = srcu_read_lock(&kvm->srcu);
> -	if (range->must_lock &&
> -	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
> -		goto out_unlock;
> -
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>  		slots = __kvm_memslots(kvm, i);
>  		kvm_for_each_memslot(slot, slots) {
> @@ -547,8 +523,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
>  			gfn_range.slot = slot;
> -			if (kvm_mmu_lock_and_check_handler(kvm, range, &locked))
> -				goto out_unlock;
> +			if (!locked) {
> +				locked = true;
> +				KVM_MMU_LOCK(kvm);
> +				if (!IS_KVM_NULL_FN(range->on_lock))
> +					range->on_lock(kvm, range->start, range->end);
> +				if (IS_KVM_NULL_FN(range->handler))
> +					break;

This can/should be "goto out_unlock", "break" only takes us out of the memslots
walk, we want to get out of the address space loop.  Not a functional problem,
but we might walk all SMM memslots unnecessarily.

> +			}
>  			ret |= range->handler(kvm, &gfn_range);
>  		}
> @@ -557,7 +539,6 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
>  		kvm_flush_remote_tlbs(kvm);
> -out_unlock:
>  	if (locked)
>  		KVM_MMU_UNLOCK(kvm);
> @@ -580,7 +561,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>  		.pte		= pte,
>  		.handler	= handler,
>  		.on_lock	= (void *)kvm_null_fn,
> -		.must_lock	= false,
>  		.flush_on_ret	= true,
>  		.may_block	= false,
>  	};
> @@ -600,7 +580,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>  		.pte		= __pte(0),
>  		.handler	= handler,
>  		.on_lock	= (void *)kvm_null_fn,
> -		.must_lock	= false,
>  		.flush_on_ret	= false,
>  		.may_block	= false,
>  	};
> @@ -620,13 +599,11 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	 * .change_pte() must be surrounded by .invalidate_range_{start,end}(),

While you're squashing, want to change the above comma to a period?

>  	 * If mmu_notifier_count is zero, then start() didn't find a relevant
>  	 * memslot and wasn't forced down the slow path; rechecking here is
> -	 * unnecessary.  This can only occur if memslot updates are blocked;
> -	 * otherwise, mmu_notifier_count is incremented unconditionally.
> +	 * unnecessary.
>  	 */
> -	if (!kvm->mmu_notifier_count) {
> -		lockdep_assert_held(&kvm->mmu_notifier_slots_lock);
> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> +	if (!kvm->mmu_notifier_count)
>  		return;
> -	}
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }

...

> @@ -1333,9 +1315,22 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
>  	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
>  	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
> -	down_write(&kvm->mmu_notifier_slots_lock);
> +	/*
> +	 * This cannot be an rwsem because the MMU notifier must not run
> +	 * inside the critical section, which cannot be excluded with a
> +	 * sleeping rwsem.

Any objection to replcaing this comment with a rephrased version of your
statement about "shared" vs. "shared recursive" and breaking the fairness cycle?
IIUC, it's not "running inside the critical section" that's problematic, it's
that sleeping in down_write() can cause deadlock due to blocking future readers.

Thanks much!

> +	 */
> +	spin_lock(&kvm->mn_invalidate_lock);
> +	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
> +	while (kvm->mn_active_invalidate_count) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		spin_unlock(&kvm->mn_invalidate_lock);
> +		schedule();
> +		spin_lock(&kvm->mn_invalidate_lock);
> +	}
> +	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
>  	rcu_assign_pointer(kvm->memslots[as_id], slots);
> -	up_write(&kvm->mmu_notifier_slots_lock);
> +	spin_unlock(&kvm->mn_invalidate_lock);
>  	synchronize_srcu_expedited(&kvm->srcu);
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-02  0:56 [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte() Sean Christopherson
2021-04-02 11:08   ` Paolo Bonzini
2021-04-02  0:56 ` [PATCH v2 02/10] KVM: Move x86's MMU notifier memslot walkers to generic code Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 03/10] KVM: arm64: Convert to the gfn-based MMU notifier callbacks Sean Christopherson
2021-04-12 10:12   ` Marc Zyngier
2021-04-02  0:56 ` [PATCH v2 04/10] KVM: MIPS/MMU: " Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 05/10] KVM: PPC: " Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 06/10] KVM: Kill off the old hva-based " Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper Sean Christopherson
2021-04-02  9:35   ` Paolo Bonzini
2021-04-02 14:59     ` Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 08/10] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
2021-04-02  9:34   ` Paolo Bonzini
2021-04-02 14:59     ` Sean Christopherson
2021-04-19  8:49   ` Wanpeng Li
2021-04-19 13:50     ` Paolo Bonzini
2021-04-19 15:09       ` Sean Christopherson
2021-04-19 22:09         ` Paolo Bonzini
2021-04-20  1:17           ` Sean Christopherson
2021-04-02  0:56 ` [PATCH v2 10/10] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible Sean Christopherson
2021-04-02 12:17 ` [PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers Paolo Bonzini
2021-04-12 10:27   ` Marc Zyngier

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git