kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers
@ 2021-03-26  2:19 Sean Christopherson
  2021-03-26  2:19 ` [PATCH 01/18] KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs Sean Christopherson
                   ` (19 more replies)
  0 siblings, 20 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

The end goal of this series is to optimize the MMU notifiers to take
mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
sensitive to mmu_lock being taken for write at inopportune times, and
such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
page shenanigans.  The vast majority of notifications for these VMs will
be spurious (for KVM), and eliding mmu_lock for spurious notifications
avoids an otherwise unacceptable disruption to the guest.

To get there without potentially degrading performance, e.g. due to
multiple memslot lookups, especially on non-x86 where the use cases are
largely unknown (from my perspective), first consolidate the MMU notifier
logic by moving the hva->gfn lookups into common KVM.

Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
with the TDP MMU changes in this series.  That code applies on kvm/queue
(commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
shadow paging").

Speaking of conflicts, Ben will soon be posting a series to convert a
bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
will be an absurd number of conflicts; Ben and I will sort out the
conflicts in whichever series loses the race.

Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
I give it even odds that I introduced an off-by-one bug somewhere.

[*] https://lkml.kernel.org/r/20210325200119.1359384-1-seanjc@google.com


Patches 1-7 are x86 specific prep patches to play nice with moving
the hva->gfn memslot lookups into common code.  There ended up being waaay
more of these than I expected/wanted, but I had a hell of a time getting
the flushing logic right when shuffling the memslot and address space
loops.  In the end, I was more confident I got things correct by batching
the flushes.

Patch 8 moves the existing API prototypes into common code.  It could
technically be dropped since the old APIs are gone in the end, but I
thought the switch to the new APIs would suck a bit less this way.

Patch 9 moves arm64's MMU notifier tracepoints into common code so that
they are not lost when arm64 is converted to the new APIs, and so that all
architectures can benefit.

Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
because I could actually test it.  All architectures use nearly identical
code, so I don't think it actually matters in the end.

Patches 11-13 move arm64, MIPS, and PPC to the new APIs.

Patch 14 yanks out the old APIs.

Patch 15 adds the mmu_lock elision, but only for unpaired notifications.

Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
This is quite nasty and no small part of me thinks the patch should be
burned with fire (I won't spoil it any further), but it's also the most
problematic scenario for our particular use case.  :-/

Patches 17-18 are additional x86 cleanups.

Sean Christopherson (18):
  KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible
    SPTEs
  KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy
    MMU
  KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
  KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range
    zap
  KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
  KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
  KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing
    SPTE
  KVM: Move prototypes for MMU notifier callbacks to generic code
  KVM: Move arm64's MMU notifier trace events to generic code
  KVM: Move x86's MMU notifier memslot walkers to generic code
  KVM: arm64: Convert to the gfn-based MMU notifier callbacks
  KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
  KVM: PPC: Convert to the gfn-based MMU notifier callbacks
  KVM: Kill off the old hva-based MMU notifier callbacks
  KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
    memslot
  KVM: Don't take mmu_lock for range invalidation unless necessary
  KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
    possible
  KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint

 arch/arm64/include/asm/kvm_host.h             |   5 -
 arch/arm64/kvm/mmu.c                          | 118 ++----
 arch/arm64/kvm/trace_arm.h                    |  66 ----
 arch/mips/include/asm/kvm_host.h              |   5 -
 arch/mips/kvm/mmu.c                           |  97 +----
 arch/powerpc/include/asm/kvm_book3s.h         |  12 +-
 arch/powerpc/include/asm/kvm_host.h           |   7 -
 arch/powerpc/include/asm/kvm_ppc.h            |   9 +-
 arch/powerpc/kvm/book3s.c                     |  18 +-
 arch/powerpc/kvm/book3s.h                     |  10 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |  98 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |  25 +-
 arch/powerpc/kvm/book3s_hv.c                  |  12 +-
 arch/powerpc/kvm/book3s_pr.c                  |  56 +--
 arch/powerpc/kvm/e500_mmu_host.c              |  29 +-
 arch/powerpc/kvm/trace_booke.h                |  15 -
 arch/x86/include/asm/kvm_host.h               |   6 +-
 arch/x86/kvm/mmu/mmu.c                        | 180 ++++-----
 arch/x86/kvm/mmu/mmu_internal.h               |  10 +
 arch/x86/kvm/mmu/tdp_mmu.c                    | 344 +++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.h                    |  31 +-
 include/linux/kvm_host.h                      |  22 +-
 include/trace/events/kvm.h                    |  90 +++--
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 -
 .../selftests/kvm/lib/x86_64/processor.c      |   2 +
 virt/kvm/kvm_main.c                           | 312 ++++++++++++----
 26 files changed, 697 insertions(+), 886 deletions(-)

-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/18] KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 02/18] KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU Sean Christopherson
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

When zapping collapsible SPTEs across multiple roots, gather pending
flushes and perform a single remote TLB flush at the end, as opposed to
flushing after processing every root.

Note, flush may be cleared by the result of zap_collapsible_spte_range().
This is intended and correct, e.g. yielding may have serviced a prior
pending flush.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 08667e3cf091..463f1be6ff0d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1278,21 +1278,21 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
  * Clear leaf entries which could be replaced by large mappings, for
  * GFNs within the slot.
  */
-static void zap_collapsible_spte_range(struct kvm *kvm,
+static bool zap_collapsible_spte_range(struct kvm *kvm,
 				       struct kvm_mmu_page *root,
-				       struct kvm_memory_slot *slot)
+				       struct kvm_memory_slot *slot,
+				       bool flush)
 {
 	gfn_t start = slot->base_gfn;
 	gfn_t end = start + slot->npages;
 	struct tdp_iter iter;
 	kvm_pfn_t pfn;
-	bool spte_set = false;
 
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
-		if (tdp_mmu_iter_cond_resched(kvm, &iter, spte_set)) {
-			spte_set = false;
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, flush)) {
+			flush = false;
 			continue;
 		}
 
@@ -1308,12 +1308,12 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
-		spte_set = true;
+		flush = true;
 	}
 
 	rcu_read_unlock();
-	if (spte_set)
-		kvm_flush_remote_tlbs(kvm);
+
+	return flush;
 }
 
 /*
@@ -1324,6 +1324,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       struct kvm_memory_slot *slot)
 {
 	struct kvm_mmu_page *root;
+	bool flush = false;
 	int root_as_id;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root) {
@@ -1331,8 +1332,11 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 		if (root_as_id != slot->as_id)
 			continue;
 
-		zap_collapsible_spte_range(kvm, root, slot);
+		flush = zap_collapsible_spte_range(kvm, root, slot, flush);
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
 }
 
 /*
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/18] KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
  2021-03-26  2:19 ` [PATCH 01/18] KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 03/18] KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs Sean Christopherson
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Place the onus on the caller of slot_handle_*() to flush the TLB, rather
than handling the flush in the helper, and rename parameters accordingly.
This will allow future patches to coalesce flushes between address spaces
and between the legacy and TDP MMUs.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7a99e59c8c1c..d5c9fb34971a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5227,7 +5227,7 @@ typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_
 static __always_inline bool
 slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			slot_level_handler fn, int start_level, int end_level,
-			gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
+			gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield)
 {
 	struct slot_rmap_walk_iterator iterator;
 	bool flush = false;
@@ -5238,7 +5238,7 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			flush |= fn(kvm, iterator.rmap, memslot);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-			if (flush && lock_flush_tlb) {
+			if (flush && flush_on_yield) {
 				kvm_flush_remote_tlbs_with_address(kvm,
 						start_gfn,
 						iterator.gfn - start_gfn + 1);
@@ -5248,32 +5248,26 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		}
 	}
 
-	if (flush && lock_flush_tlb) {
-		kvm_flush_remote_tlbs_with_address(kvm, start_gfn,
-						   end_gfn - start_gfn + 1);
-		flush = false;
-	}
-
 	return flush;
 }
 
 static __always_inline bool
 slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		  slot_level_handler fn, int start_level, int end_level,
-		  bool lock_flush_tlb)
+		  bool flush_on_yield)
 {
 	return slot_handle_level_range(kvm, memslot, fn, start_level,
 			end_level, memslot->base_gfn,
 			memslot->base_gfn + memslot->npages - 1,
-			lock_flush_tlb);
+			flush_on_yield);
 }
 
 static __always_inline bool
 slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		 slot_level_handler fn, bool lock_flush_tlb)
+		 slot_level_handler fn, bool flush_on_yield)
 {
 	return slot_handle_level(kvm, memslot, fn, PG_LEVEL_4K,
-				 PG_LEVEL_4K, lock_flush_tlb);
+				 PG_LEVEL_4K, flush_on_yield);
 }
 
 static void free_mmu_pages(struct kvm_mmu *mmu)
@@ -5509,10 +5503,14 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 			if (start >= end)
 				continue;
 
-			slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
-						PG_LEVEL_4K,
-						KVM_MAX_HUGEPAGE_LEVEL,
-						start, end - 1, true);
+			flush = slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
+							PG_LEVEL_4K,
+							KVM_MAX_HUGEPAGE_LEVEL,
+							start, end - 1, true);
+
+			if (flush)
+				kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
+								   gfn_end);
 		}
 	}
 
@@ -5605,9 +5603,12 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 {
 	/* FIXME: const-ify all uses of struct kvm_memory_slot.  */
 	struct kvm_memory_slot *slot = (struct kvm_memory_slot *)memslot;
+	bool flush;
 
 	write_lock(&kvm->mmu_lock);
-	slot_handle_leaf(kvm, slot, kvm_mmu_zap_collapsible_spte, true);
+	flush = slot_handle_leaf(kvm, slot, kvm_mmu_zap_collapsible_spte, true);
+	if (flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
 
 	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
@@ -5619,7 +5620,7 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
 {
 	/*
 	 * All current use cases for flushing the TLBs for a specific memslot
-	 * are related to dirty logging, and do the TLB flush out of mmu_lock.
+	 * related to dirty logging, and many do the TLB flush out of mmu_lock.
 	 * The interaction between the various operations on memslot must be
 	 * serialized by slots_locks to ensure the TLB flush from one operation
 	 * is observed by any other operation on the same memslot.
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/18] KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
  2021-03-26  2:19 ` [PATCH 01/18] KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs Sean Christopherson
  2021-03-26  2:19 ` [PATCH 02/18] KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 04/18] KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap Sean Christopherson
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Gather pending TLB flushes across both the legacy and TDP MMUs when
zapping collapsible SPTEs to avoid multiple flushes if both the legacy
MMU (for nested guests) and TDP MMU have mappings for the memslot.

Note, this also optimizes the TDP MMU to flush only the relevant range
when running as L1 with Hyper-V enlightenments.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 6 ++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 8 +++-----
 arch/x86/kvm/mmu/tdp_mmu.h | 4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d5c9fb34971a..37e2432c78ca 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5607,11 +5607,13 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 
 	write_lock(&kvm->mmu_lock);
 	flush = slot_handle_leaf(kvm, slot, kvm_mmu_zap_collapsible_spte, true);
+
+	if (is_tdp_mmu_enabled(kvm))
+		flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
+
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
 
-	if (is_tdp_mmu_enabled(kvm))
-		kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
 	write_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 463f1be6ff0d..ff2bb0c8012e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1320,11 +1320,10 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
  * Clear non-leaf entries (and free associated page tables) which could
  * be replaced by large mappings, for GFNs within the slot.
  */
-void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
-				       struct kvm_memory_slot *slot)
+bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+				       struct kvm_memory_slot *slot, bool flush)
 {
 	struct kvm_mmu_page *root;
-	bool flush = false;
 	int root_as_id;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root) {
@@ -1335,8 +1334,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 		flush = zap_collapsible_spte_range(kvm, root, slot, flush);
 	}
 
-	if (flush)
-		kvm_flush_remote_tlbs(kvm);
+	return flush;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 715aa4e0196d..9ecd8f79f861 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -54,8 +54,8 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				       struct kvm_memory_slot *slot,
 				       gfn_t gfn, unsigned long mask,
 				       bool wrprot);
-void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
-				       struct kvm_memory_slot *slot);
+bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+				       struct kvm_memory_slot *slot, bool flush);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn);
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/18] KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (2 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 03/18] KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 05/18] KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range() Sean Christopherson
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Gather pending TLB flushes across both address spaces when zapping a
given gfn range.  This requires feeding "flush" back into subsequent
calls, but on the plus side sets the stage for further batching
between the legacy MMU and TDP MMU.  It also allows refactoring the
address space iteration to cover the legacy and TDP MMUs without
introducing truly ugly code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 37e2432c78ca..e6e02360ef67 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5227,10 +5227,10 @@ typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_
 static __always_inline bool
 slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			slot_level_handler fn, int start_level, int end_level,
-			gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield)
+			gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield,
+			bool flush)
 {
 	struct slot_rmap_walk_iterator iterator;
-	bool flush = false;
 
 	for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
 			end_gfn, &iterator) {
@@ -5259,7 +5259,7 @@ slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	return slot_handle_level_range(kvm, memslot, fn, start_level,
 			end_level, memslot->base_gfn,
 			memslot->base_gfn + memslot->npages - 1,
-			flush_on_yield);
+			flush_on_yield, false);
 }
 
 static __always_inline bool
@@ -5490,7 +5490,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
 	int i;
-	bool flush;
+	bool flush = false;
 
 	write_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -5506,14 +5506,13 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 			flush = slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
 							PG_LEVEL_4K,
 							KVM_MAX_HUGEPAGE_LEVEL,
-							start, end - 1, true);
-
-			if (flush)
-				kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
-								   gfn_end);
+							start, end - 1, true, flush);
 		}
 	}
 
+	if (flush)
+		kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
+
 	if (is_tdp_mmu_enabled(kvm)) {
 		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
 		if (flush)
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/18] KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (3 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 04/18] KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 06/18] KVM: x86/mmu: Pass address space ID to TDP MMU root walkers Sean Christopherson
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Pass the address space ID to TDP MMU's primary "zap gfn range" helper to
allow the MMU notifier paths to iterate over memslots exactly once.
Currently, both the legacy MMU and TDP MMU iterate over memslots when
looking for an overlapping hva range, which can be quite costly if there
are a large number of memslots.

Add a "flush" parameter so that iterating over multiple address spaces
in the caller will continue to do the right thing when yielding while a
flush is pending from a previous address space.

Note, this also has a functional change in the form of coalescing TLB
flushes across multiple address spaces in kvm_zap_gfn_range(), and also
optimizes the TDP MMU to utilize range-based flushing when running as L1
with Hyper-V enlightenments.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 10 ++++------
 arch/x86/kvm/mmu/mmu_internal.h |  5 +++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 22 +++++++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.h      | 13 +++++++------
 4 files changed, 27 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e6e02360ef67..36c231d6bff9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5508,17 +5508,15 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 							KVM_MAX_HUGEPAGE_LEVEL,
 							start, end - 1, true, flush);
 		}
+
+		if (is_tdp_mmu_enabled(kvm))
+			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
+							  gfn_end, flush);
 	}
 
 	if (flush)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
 
-	if (is_tdp_mmu_enabled(kvm)) {
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
-		if (flush)
-			kvm_flush_remote_tlbs(kvm);
-	}
-
 	write_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 5fe9123fc932..db2faa806ab7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -129,6 +129,11 @@ static inline bool kvm_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *sp)
 	return !sp->root_count;
 }
 
+static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
+{
+	return sp->role.smm ? 1 : 0;
+}
+
 /*
  * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
  *
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ff2bb0c8012e..bf279fff70ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -190,11 +190,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level,
 				bool shared);
 
-static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
-{
-	return sp->role.smm ? 1 : 0;
-}
-
 static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
 {
 	if (!is_shadow_present_pte(old_spte) || !is_last_spte(old_spte, level))
@@ -709,14 +704,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
  */
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
-				 bool can_yield)
+bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
+				 gfn_t end, bool can_yield, bool flush)
 {
 	struct kvm_mmu_page *root;
-	bool flush = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root)
+	for_each_tdp_mmu_root_yield_safe(kvm, root) {
+		if (kvm_mmu_page_as_id(root) != as_id)
+			continue;
 		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
+	}
 
 	return flush;
 }
@@ -724,9 +721,12 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
 void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 {
 	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
-	bool flush;
+	bool flush = false;
+	int i;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, max_gfn, flush);
 
-	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 9ecd8f79f861..f224df334382 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -8,12 +8,12 @@
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
-bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end,
-				 bool can_yield);
-static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start,
-					     gfn_t end)
+bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
+				 gfn_t end, bool can_yield, bool flush);
+static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
+					     gfn_t start, gfn_t end, bool flush)
 {
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, start, end, true);
+	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
 }
 static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -28,7 +28,8 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	 * requirement), its "step sideways" will always step beyond the bounds
 	 * of the shadow page's gfn range and stop iterating before yielding.
 	 */
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, end, false);
+	return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
+					   sp->gfn, end, false, false);
 }
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/18] KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (4 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 05/18] KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range() Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 07/18] KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE Sean Christopherson
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move the address space ID check that is performed when iterating over
roots into the macro helpers to consolidate code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu_internal.h |  7 ++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 99 ++++++++++++---------------------
 2 files changed, 42 insertions(+), 64 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db2faa806ab7..cead1d81e663 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -129,9 +129,14 @@ static inline bool kvm_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *sp)
 	return !sp->root_count;
 }
 
+static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
+{
+	return role.smm ? 1 : 0;
+}
+
 static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 {
-	return sp->role.smm ? 1 : 0;
+	return kvm_mmu_role_as_id(sp->role);
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bf279fff70ea..7fe5004b1565 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -76,14 +76,18 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
  * if exiting the loop early, the caller must drop the reference to the most
  * recent root. (Unless keeping a live reference is desirable.)
  */
-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root)				\
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id)		\
 	for (_root = list_first_entry(&_kvm->arch.tdp_mmu_roots,	\
 				      typeof(*_root), link);		\
 	     tdp_mmu_next_root_valid(_kvm, _root);			\
-	     _root = tdp_mmu_next_root(_kvm, _root))
+	     _root = tdp_mmu_next_root(_kvm, _root))			\
+		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
+		} else
 
-#define for_each_tdp_mmu_root(_kvm, _root)				\
-	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
+	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)	\
+		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
+		} else
 
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			  gfn_t start, gfn_t end, bool can_yield, bool flush);
@@ -148,7 +152,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
 
 	/* Check for an existing root before allocating a new one. */
-	for_each_tdp_mmu_root(kvm, root) {
+	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
 		if (root->role.word == role.word) {
 			kvm_mmu_get_root(kvm, root);
 			goto out;
@@ -709,11 +713,8 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
 {
 	struct kvm_mmu_page *root;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
-		if (kvm_mmu_page_as_id(root) != as_id)
-			continue;
+	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
 		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
-	}
 
 	return flush;
 }
@@ -893,27 +894,28 @@ static __always_inline int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm,
 	int ret = 0;
 	int as_id;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
-		as_id = kvm_mmu_page_as_id(root);
-		slots = __kvm_memslots(kvm, as_id);
-		kvm_for_each_memslot(memslot, slots) {
-			unsigned long hva_start, hva_end;
-			gfn_t gfn_start, gfn_end;
+	for (as_id = 0; as_id < KVM_ADDRESS_SPACE_NUM; as_id++) {
+		for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) {
+			slots = __kvm_memslots(kvm, as_id);
+			kvm_for_each_memslot(memslot, slots) {
+				unsigned long hva_start, hva_end;
+				gfn_t gfn_start, gfn_end;
 
-			hva_start = max(start, memslot->userspace_addr);
-			hva_end = min(end, memslot->userspace_addr +
-				      (memslot->npages << PAGE_SHIFT));
-			if (hva_start >= hva_end)
-				continue;
-			/*
-			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-			 */
-			gfn_start = hva_to_gfn_memslot(hva_start, memslot);
-			gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
+				hva_start = max(start, memslot->userspace_addr);
+				hva_end = min(end, memslot->userspace_addr +
+					(memslot->npages << PAGE_SHIFT));
+				if (hva_start >= hva_end)
+					continue;
+				/*
+				 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+				 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+				 */
+				gfn_start = hva_to_gfn_memslot(hva_start, memslot);
+				gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
 
-			ret |= handler(kvm, memslot, root, gfn_start,
-				       gfn_end, data);
+				ret |= handler(kvm, memslot, root, gfn_start,
+					gfn_end, data);
+			}
 		}
 	}
 
@@ -1125,17 +1127,11 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 			     int min_level)
 {
 	struct kvm_mmu_page *root;
-	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
-		root_as_id = kvm_mmu_page_as_id(root);
-		if (root_as_id != slot->as_id)
-			continue;
-
+	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
 		spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
 			     slot->base_gfn + slot->npages, min_level);
-	}
 
 	return spte_set;
 }
@@ -1190,17 +1186,11 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	struct kvm_mmu_page *root;
-	int root_as_id;
 	bool spte_set = false;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
-		root_as_id = kvm_mmu_page_as_id(root);
-		if (root_as_id != slot->as_id)
-			continue;
-
+	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
 		spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
 				slot->base_gfn + slot->npages);
-	}
 
 	return spte_set;
 }
@@ -1262,16 +1252,10 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				       bool wrprot)
 {
 	struct kvm_mmu_page *root;
-	int root_as_id;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
-	for_each_tdp_mmu_root(kvm, root) {
-		root_as_id = kvm_mmu_page_as_id(root);
-		if (root_as_id != slot->as_id)
-			continue;
-
+	for_each_tdp_mmu_root(kvm, root, slot->as_id)
 		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
-	}
 }
 
 /*
@@ -1324,15 +1308,9 @@ bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       struct kvm_memory_slot *slot, bool flush)
 {
 	struct kvm_mmu_page *root;
-	int root_as_id;
-
-	for_each_tdp_mmu_root_yield_safe(kvm, root) {
-		root_as_id = kvm_mmu_page_as_id(root);
-		if (root_as_id != slot->as_id)
-			continue;
 
+	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
 		flush = zap_collapsible_spte_range(kvm, root, slot, flush);
-	}
 
 	return flush;
 }
@@ -1376,17 +1354,12 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	struct kvm_mmu_page *root;
-	int root_as_id;
 	bool spte_set = false;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
-	for_each_tdp_mmu_root(kvm, root) {
-		root_as_id = kvm_mmu_page_as_id(root);
-		if (root_as_id != slot->as_id)
-			continue;
-
+	for_each_tdp_mmu_root(kvm, root, slot->as_id)
 		spte_set |= write_protect_gfn(kvm, root, gfn);
-	}
+
 	return spte_set;
 }
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/18] KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (5 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 06/18] KVM: x86/mmu: Pass address space ID to TDP MMU root walkers Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 08/18] KVM: Move prototypes for MMU notifier callbacks to generic code Sean Christopherson
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Use the leaf-only TDP iterator when changing the SPTE in reaction to a
MMU notifier.  Practically speaking, this is a nop since the guts of the
loop explicitly looks for 4k SPTEs, which are always leaf SPTEs.  Switch
the iterator to match age_gfn_range() and test_age_gfn() so that a future
patch can consolidate the core iterating logic.

No real functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7fe5004b1565..a2b3d9699320 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1042,7 +1042,7 @@ static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 	new_pfn = pte_pfn(*ptep);
 
-	tdp_root_for_each_pte(iter, root, gfn, gfn + 1) {
+	tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
 		if (iter.level != PG_LEVEL_4K)
 			continue;
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/18] KVM: Move prototypes for MMU notifier callbacks to generic code
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (6 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 07/18] KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 09/18] KVM: Move arm64's MMU notifier trace events " Sean Christopherson
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move the prototypes for the MMU notifier callbacks out of arch code and
into common code.  There is no benefit to having each arch replicate the
prototypes since any deviation from the invocation in common code will
explode.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h   | 5 -----
 arch/mips/include/asm/kvm_host.h    | 5 -----
 arch/powerpc/include/asm/kvm_host.h | 7 -------
 arch/x86/include/asm/kvm_host.h     | 6 +-----
 include/linux/kvm_host.h            | 8 ++++++++
 5 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3d10e6527f7d..72e6b4600264 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -582,11 +582,6 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-int kvm_unmap_hva_range(struct kvm *kvm,
-			unsigned long start, unsigned long end, unsigned flags);
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 3a5612e7304c..feaa77036b67 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,11 +967,6 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
 						   bool write);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-int kvm_unmap_hva_range(struct kvm *kvm,
-			unsigned long start, unsigned long end, unsigned flags);
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
 /* Emulation */
 int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 05fb00d37609..1e83359f286b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -56,13 +56,6 @@
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 
-extern int kvm_unmap_hva_range(struct kvm *kvm,
-			       unsigned long start, unsigned long end,
-			       unsigned flags);
-extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
-extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
-extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-
 #define HPTEG_CACHE_NUM			(1 << 15)
 #define HPTEG_HASH_BITS_PTE		13
 #define HPTEG_HASH_BITS_PTE_LONG	12
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6..99778ac51243 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1718,11 +1718,7 @@ asmlinkage void kvm_spurious_fault(void);
 	_ASM_EXTABLE(666b, 667b)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
+
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_extint(struct kvm_vcpu *v);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1b65e7204344..e6d77353025c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -218,6 +218,14 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+int kvm_unmap_hva_range(struct kvm *kvm,
+			unsigned long start, unsigned long end, unsigned flags);
+int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
+int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+#endif
+
 enum {
 	OUTSIDE_GUEST_MODE,
 	IN_GUEST_MODE,
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/18] KVM: Move arm64's MMU notifier trace events to generic code
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (7 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 08/18] KVM: Move prototypes for MMU notifier callbacks to generic code Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers " Sean Christopherson
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move arm64's MMU notifier trace events into common code in preparation
for doing the hva->gfn lookup in common code.  The alternative would be
to trace the gfn instead of hva, but that's not obviously better and
could also be done in common code.  Tracing the notifiers is also quite
handy for debug regardless of architecture.

Remove a completely redundant tracepoint from PPC e500.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/mmu.c             |  7 +---
 arch/arm64/kvm/trace_arm.h       | 66 --------------------------------
 arch/powerpc/kvm/e500_mmu_host.c |  2 -
 arch/powerpc/kvm/trace_booke.h   | 15 --------
 include/trace/events/kvm.h       | 66 ++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c              | 10 +++++
 6 files changed, 78 insertions(+), 88 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..4b7e1e327337 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1111,7 +1111,6 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
-	trace_kvm_unmap_hva_range(start, end);
 	handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, &flags);
 	return 0;
 }
@@ -1141,8 +1140,6 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
-	trace_kvm_set_spte_hva(hva);
-
 	/*
 	 * We've moved a page around, probably through CoW, so let's treat it
 	 * just like a translation fault and clean the cache to the PoC.
@@ -1173,7 +1170,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
 	if (!kvm->arch.mmu.pgt)
 		return 0;
-	trace_kvm_age_hva(start, end);
+
 	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
 }
 
@@ -1181,7 +1178,7 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
 {
 	if (!kvm->arch.mmu.pgt)
 		return 0;
-	trace_kvm_test_age_hva(hva);
+
 	return handle_hva_to_gpa(kvm, hva, hva + PAGE_SIZE,
 				 kvm_test_age_hva_handler, NULL);
 }
diff --git a/arch/arm64/kvm/trace_arm.h b/arch/arm64/kvm/trace_arm.h
index ff0444352bba..33e4e7dd2719 100644
--- a/arch/arm64/kvm/trace_arm.h
+++ b/arch/arm64/kvm/trace_arm.h
@@ -135,72 +135,6 @@ TRACE_EVENT(kvm_mmio_emulate,
 		  __entry->vcpu_pc, __entry->instr, __entry->cpsr)
 );
 
-TRACE_EVENT(kvm_unmap_hva_range,
-	TP_PROTO(unsigned long start, unsigned long end),
-	TP_ARGS(start, end),
-
-	TP_STRUCT__entry(
-		__field(	unsigned long,	start		)
-		__field(	unsigned long,	end		)
-	),
-
-	TP_fast_assign(
-		__entry->start		= start;
-		__entry->end		= end;
-	),
-
-	TP_printk("mmu notifier unmap range: %#016lx -- %#016lx",
-		  __entry->start, __entry->end)
-);
-
-TRACE_EVENT(kvm_set_spte_hva,
-	TP_PROTO(unsigned long hva),
-	TP_ARGS(hva),
-
-	TP_STRUCT__entry(
-		__field(	unsigned long,	hva		)
-	),
-
-	TP_fast_assign(
-		__entry->hva		= hva;
-	),
-
-	TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva)
-);
-
-TRACE_EVENT(kvm_age_hva,
-	TP_PROTO(unsigned long start, unsigned long end),
-	TP_ARGS(start, end),
-
-	TP_STRUCT__entry(
-		__field(	unsigned long,	start		)
-		__field(	unsigned long,	end		)
-	),
-
-	TP_fast_assign(
-		__entry->start		= start;
-		__entry->end		= end;
-	),
-
-	TP_printk("mmu notifier age hva: %#016lx -- %#016lx",
-		  __entry->start, __entry->end)
-);
-
-TRACE_EVENT(kvm_test_age_hva,
-	TP_PROTO(unsigned long hva),
-	TP_ARGS(hva),
-
-	TP_STRUCT__entry(
-		__field(	unsigned long,	hva		)
-	),
-
-	TP_fast_assign(
-		__entry->hva		= hva;
-	),
-
-	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
-);
-
 TRACE_EVENT(kvm_set_way_flush,
 	    TP_PROTO(unsigned long vcpu_pc, bool cache),
 	    TP_ARGS(vcpu_pc, cache),
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index ed0c9c43d0cf..648aefe1a3e7 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -723,8 +723,6 @@ int kvmppc_load_last_inst(struct kvm_vcpu *vcpu,
 
 static int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
-	trace_kvm_unmap_hva(hva);
-
 	/*
 	 * Flush all shadow tlb entries everywhere. This is slow, but
 	 * we are 100% sure that we catch the to be unmapped page
diff --git a/arch/powerpc/kvm/trace_booke.h b/arch/powerpc/kvm/trace_booke.h
index 3837842986aa..eff6e82dbcd4 100644
--- a/arch/powerpc/kvm/trace_booke.h
+++ b/arch/powerpc/kvm/trace_booke.h
@@ -69,21 +69,6 @@ TRACE_EVENT(kvm_exit,
 		)
 );
 
-TRACE_EVENT(kvm_unmap_hva,
-	TP_PROTO(unsigned long hva),
-	TP_ARGS(hva),
-
-	TP_STRUCT__entry(
-		__field(	unsigned long,	hva		)
-	),
-
-	TP_fast_assign(
-		__entry->hva		= hva;
-	),
-
-	TP_printk("unmap hva 0x%lx\n", __entry->hva)
-);
-
 TRACE_EVENT(kvm_booke206_stlb_write,
 	TP_PROTO(__u32 mas0, __u32 mas8, __u32 mas1, __u64 mas2, __u64 mas7_3),
 	TP_ARGS(mas0, mas8, mas1, mas2, mas7_3),
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 49d7d0fe29f6..200eb7465530 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -462,6 +462,72 @@ TRACE_EVENT(kvm_dirty_ring_exit,
 	TP_printk("vcpu %d", __entry->vcpu_id)
 );
 
+TRACE_EVENT(kvm_unmap_hva_range,
+	TP_PROTO(unsigned long start, unsigned long end),
+	TP_ARGS(start, end),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	start		)
+		__field(	unsigned long,	end		)
+	),
+
+	TP_fast_assign(
+		__entry->start		= start;
+		__entry->end		= end;
+	),
+
+	TP_printk("mmu notifier unmap range: %#016lx -- %#016lx",
+		  __entry->start, __entry->end)
+);
+
+TRACE_EVENT(kvm_set_spte_hva,
+	TP_PROTO(unsigned long hva),
+	TP_ARGS(hva),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	hva		)
+	),
+
+	TP_fast_assign(
+		__entry->hva		= hva;
+	),
+
+	TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva)
+);
+
+TRACE_EVENT(kvm_age_hva,
+	TP_PROTO(unsigned long start, unsigned long end),
+	TP_ARGS(start, end),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	start		)
+		__field(	unsigned long,	end		)
+	),
+
+	TP_fast_assign(
+		__entry->start		= start;
+		__entry->end		= end;
+	),
+
+	TP_printk("mmu notifier age hva: %#016lx -- %#016lx",
+		  __entry->start, __entry->end)
+);
+
+TRACE_EVENT(kvm_test_age_hva,
+	TP_PROTO(unsigned long hva),
+	TP_ARGS(hva),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	hva		)
+	),
+
+	TP_fast_assign(
+		__entry->hva		= hva;
+	),
+
+	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
+);
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 383df23514b9..d1de843b7618 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -459,6 +459,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
 
+	trace_kvm_set_spte_hva(address);
+
 	idx = srcu_read_lock(&kvm->srcu);
 
 	KVM_MMU_LOCK(kvm);
@@ -478,6 +480,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
 
+	trace_kvm_unmap_hva_range(range->start, range->end);
+
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
 	/*
@@ -548,6 +552,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
 
+	trace_kvm_age_hva(start, end);
+
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
 
@@ -569,6 +575,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
 
+	trace_kvm_age_hva(start, end);
+
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
 	/*
@@ -598,6 +606,8 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
 
+	trace_kvm_test_age_hva(address);
+
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
 	young = kvm_test_age_hva(kvm, address);
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers to generic code
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (8 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 09/18] KVM: Move arm64's MMU notifier trace events " Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-31  7:52   ` Paolo Bonzini
  2021-03-26  2:19 ` [PATCH 11/18] KVM: arm64: Convert to the gfn-based MMU notifier callbacks Sean Christopherson
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move the hva->gfn lookup for MMU notifiers into common code.  Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.

In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.

The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.

Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.

Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.

MIPS, PPC, and arm64 will be converted one at a time in future patches.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h               |   1 +
 arch/x86/kvm/mmu/mmu.c                        | 127 ++++-----
 arch/x86/kvm/mmu/tdp_mmu.c                    | 247 +++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.h                    |  14 +-
 include/linux/kvm_host.h                      |  14 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 -
 .../selftests/kvm/lib/x86_64/processor.c      |   2 +
 virt/kvm/kvm_main.c                           | 153 +++++++++++
 8 files changed, 311 insertions(+), 251 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 99778ac51243..a21e3698f4dc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1718,6 +1718,7 @@ asmlinkage void kvm_spurious_fault(void);
 	_ASM_EXTABLE(666b, 667b)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36c231d6bff9..2070c7a91fdd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1298,26 +1298,25 @@ static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return flush;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			   struct kvm_memory_slot *slot, gfn_t gfn, int level,
-			   unsigned long data)
+static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			    struct kvm_memory_slot *slot, gfn_t gfn, int level,
+			    pte_t unused)
 {
 	return kvm_zap_rmapp(kvm, rmap_head, slot);
 }
 
-static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
-			     unsigned long data)
+static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			      struct kvm_memory_slot *slot, gfn_t gfn, int level,
+			      pte_t pte)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
 	int need_flush = 0;
 	u64 new_spte;
-	pte_t *ptep = (pte_t *)data;
 	kvm_pfn_t new_pfn;
 
-	WARN_ON(pte_huge(*ptep));
-	new_pfn = pte_pfn(*ptep);
+	WARN_ON(pte_huge(pte));
+	new_pfn = pte_pfn(pte);
 
 restart:
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
@@ -1326,7 +1325,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 		need_flush = 1;
 
-		if (pte_write(*ptep)) {
+		if (pte_write(pte)) {
 			pte_list_remove(rmap_head, sptep);
 			goto restart;
 		} else {
@@ -1414,86 +1413,52 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
 	     slot_rmap_walk_okay(_iter_);				\
 	     slot_rmap_walk_next(_iter_))
 
-typedef int (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			      struct kvm_memory_slot *slot, gfn_t gfn,
-			      int level, unsigned long data);
+typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			       struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level, pte_t pte);
 
-static __always_inline int kvm_handle_hva_range(struct kvm *kvm,
-						unsigned long start,
-						unsigned long end,
-						unsigned long data,
-						rmap_handler_t handler)
+static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
+						 struct kvm_gfn_range *range,
+						 rmap_handler_t handler)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
 	struct slot_rmap_walk_iterator iterator;
-	int ret = 0;
-	int i;
+	bool ret = false;
 
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-		slots = __kvm_memslots(kvm, i);
-		kvm_for_each_memslot(memslot, slots) {
-			unsigned long hva_start, hva_end;
-			gfn_t gfn_start, gfn_end;
-
-			hva_start = max(start, memslot->userspace_addr);
-			hva_end = min(end, memslot->userspace_addr +
-				      (memslot->npages << PAGE_SHIFT));
-			if (hva_start >= hva_end)
-				continue;
-			/*
-			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-			 */
-			gfn_start = hva_to_gfn_memslot(hva_start, memslot);
-			gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-			for_each_slot_rmap_range(memslot, PG_LEVEL_4K,
-						 KVM_MAX_HUGEPAGE_LEVEL,
-						 gfn_start, gfn_end - 1,
-						 &iterator)
-				ret |= handler(kvm, iterator.rmap, memslot,
-					       iterator.gfn, iterator.level, data);
-		}
-	}
+	for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
+				 range->start, range->end - 1, &iterator)
+		ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn,
+			       iterator.level, range->pte);
 
 	return ret;
 }
 
-static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
-			  unsigned long data, rmap_handler_t handler)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler);
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
-{
-	int r;
+	bool flush;
 
-	r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+	flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp);
 
 	if (is_tdp_mmu_enabled(kvm))
-		r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);
+		flush |= kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
 
-	return r;
+	return flush;
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	int r;
+	bool flush;
 
-	r = kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+	flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmapp);
 
 	if (is_tdp_mmu_enabled(kvm))
-		r |= kvm_tdp_mmu_set_spte_hva(kvm, hva, &pte);
+		flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range);
 
-	return r;
+	return flush;
 }
 
-static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			 struct kvm_memory_slot *slot, gfn_t gfn, int level,
-			 unsigned long data)
+static bool kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			  struct kvm_memory_slot *slot, gfn_t gfn, int level,
+			  pte_t unused)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1506,9 +1471,9 @@ static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return young;
 }
 
-static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			      struct kvm_memory_slot *slot, gfn_t gfn,
-			      int level, unsigned long data)
+static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+			       struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level, pte_t unused)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1530,29 +1495,31 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 
 	rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp);
 
-	kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, 0);
+	kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 	kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	int young = false;
+	bool young;
+
+	young = kvm_handle_gfn_range(kvm, range, kvm_age_rmapp);
 
-	young = kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
 	if (is_tdp_mmu_enabled(kvm))
-		young |= kvm_tdp_mmu_age_hva_range(kvm, start, end);
+		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
 
 	return young;
 }
 
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	int young = false;
+	bool young;
+
+	young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmapp);
 
-	young = kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
 	if (is_tdp_mmu_enabled(kvm))
-		young |= kvm_tdp_mmu_test_age_hva(kvm, hva);
+		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
 
 	return young;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a2b3d9699320..f6a82230baa5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -878,146 +878,122 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	return ret;
 }
 
-typedef int (*tdp_handler_t)(struct kvm *kvm, struct kvm_memory_slot *slot,
-			     struct kvm_mmu_page *root, gfn_t start, gfn_t end,
-			     unsigned long data);
-
-static __always_inline int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm,
-							unsigned long start,
-							unsigned long end,
-							unsigned long data,
-							tdp_handler_t handler)
+bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
+				 bool flush)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
 	struct kvm_mmu_page *root;
-	int ret = 0;
-	int as_id;
 
-	for (as_id = 0; as_id < KVM_ADDRESS_SPACE_NUM; as_id++) {
-		for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) {
-			slots = __kvm_memslots(kvm, as_id);
-			kvm_for_each_memslot(memslot, slots) {
-				unsigned long hva_start, hva_end;
-				gfn_t gfn_start, gfn_end;
+	for_each_tdp_mmu_root(kvm, root, range->slot->as_id)
+		flush |= zap_gfn_range(kvm, root, range->start, range->end,
+				       false, flush);
+
+	return flush;
+}
+
+typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
+			      struct kvm_gfn_range *range);
+
+static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
+						   struct kvm_gfn_range *range,
+						   tdp_handler_t handler)
+{
+	struct kvm_mmu_page *root;
+	struct tdp_iter iter;
+	bool ret = false;
 
-				hva_start = max(start, memslot->userspace_addr);
-				hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-				if (hva_start >= hva_end)
-					continue;
-				/*
-				 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-				 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-				 */
-				gfn_start = hva_to_gfn_memslot(hva_start, memslot);
-				gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
+	rcu_read_lock();
 
-				ret |= handler(kvm, memslot, root, gfn_start,
-					gfn_end, data);
-			}
-		}
+	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+			ret |= handler(kvm, &iter, range);
 	}
 
+	rcu_read_unlock();
+
 	return ret;
 }
 
-static __always_inline int kvm_tdp_mmu_handle_hva(struct kvm *kvm,
-						  unsigned long addr,
-						  unsigned long data,
-						  tdp_handler_t handler)
-{
-	return kvm_tdp_mmu_handle_hva_range(kvm, addr, addr + 1, data, handler);
-}
-
-static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
-				     struct kvm_memory_slot *slot,
-				     struct kvm_mmu_page *root, gfn_t start,
-				     gfn_t end, unsigned long unused)
-{
-	return zap_gfn_range(kvm, root, start, end, false, false);
-}
-
-int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
-{
-	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
-					    zap_gfn_range_hva_wrapper);
-}
-
 /*
  * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
  * if any of the GFNs in the range have been accessed.
  */
-static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
-			 struct kvm_mmu_page *root, gfn_t start, gfn_t end,
-			 unsigned long unused)
+static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
+			  struct kvm_gfn_range *range)
 {
-	struct tdp_iter iter;
-	int young = 0;
 	u64 new_spte = 0;
 
-	rcu_read_lock();
+	/* If we have a non-accessed entry we don't need to change the pte. */
+	if (!is_accessed_spte(iter->old_spte))
+		return false;
 
-	tdp_root_for_each_leaf_pte(iter, root, start, end) {
+	new_spte = iter->old_spte;
+
+	if (spte_ad_enabled(new_spte)) {
+		clear_bit((ffs(shadow_accessed_mask) - 1),
+			  (unsigned long *)&new_spte);
+	} else {
 		/*
-		 * If we have a non-accessed entry we don't need to change the
-		 * pte.
+		 * Capture the dirty status of the page, so that it doesn't get
+		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (!is_accessed_spte(iter.old_spte))
-			continue;
-
-		new_spte = iter.old_spte;
-
-		if (spte_ad_enabled(new_spte)) {
-			clear_bit((ffs(shadow_accessed_mask) - 1),
-				  (unsigned long *)&new_spte);
-		} else {
-			/*
-			 * Capture the dirty status of the page, so that it doesn't get
-			 * lost when the SPTE is marked for access tracking.
-			 */
-			if (is_writable_pte(new_spte))
-				kvm_set_pfn_dirty(spte_to_pfn(new_spte));
-
-			new_spte = mark_spte_for_access_track(new_spte);
-		}
-		new_spte &= ~shadow_dirty_mask;
-
-		tdp_mmu_set_spte_no_acc_track(kvm, &iter, new_spte);
-		young = 1;
-
-		trace_kvm_age_page(iter.gfn, iter.level, slot, young);
+		if (is_writable_pte(new_spte))
+			kvm_set_pfn_dirty(spte_to_pfn(new_spte));
+
+		new_spte = mark_spte_for_access_track(new_spte);
 	}
+	new_spte &= ~shadow_dirty_mask;
+
+	tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
 
-	rcu_read_unlock();
+	trace_kvm_age_page(iter->gfn, iter->level, range->slot, true);
 
-	return young;
+	return true;
 }
 
-int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
-					    age_gfn_range);
+	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
 }
 
-static int test_age_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-			struct kvm_mmu_page *root, gfn_t gfn, gfn_t end,
-			unsigned long unused)
+static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
+			 struct kvm_gfn_range *range)
 {
-	struct tdp_iter iter;
-
-	tdp_root_for_each_leaf_pte(iter, root, gfn, end)
-		if (is_accessed_spte(iter.old_spte))
-			return 1;
+	return is_accessed_spte(iter->old_spte);
+}
 
-	return 0;
+bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
 }
 
-int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
+static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
+			 struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_hva(kvm, hva, 0, test_age_gfn);
+	u64 new_spte;
+
+	/* Huge pages aren't expected to be modified without first being zapped. */
+	WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end);
+
+	if (iter->level != PG_LEVEL_4K ||
+	    !is_shadow_present_pte(iter->old_spte))
+		return false;
+
+	/*
+	 * Note, when changing a read-only SPTE, it's not strictly necessary to
+	 * zero the SPTE before setting the new PFN, but doing so preserves the
+	 * invariant that the PFN of a present * leaf SPTE can never change.
+	 * See __handle_changed_spte().
+	 */
+	tdp_mmu_set_spte(kvm, iter, 0);
+
+	if (!pte_write(range->pte)) {
+		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
+								  pte_pfn(range->pte));
+
+		tdp_mmu_set_spte(kvm, iter, new_spte);
+	}
+
+	return true;
 }
 
 /*
@@ -1026,60 +1002,15 @@ int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
  * notifier.
  * Returns non-zero if a flush is needed before releasing the MMU lock.
  */
-static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
-			struct kvm_mmu_page *root, gfn_t gfn, gfn_t end,
-			unsigned long data)
+bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	struct tdp_iter iter;
-	pte_t *ptep = (pte_t *)data;
-	kvm_pfn_t new_pfn;
-	u64 new_spte;
-	int need_flush = 0;
+	bool flush = kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
 
-	rcu_read_lock();
+	/* FIXME: return 'flush' instead of flushing here. */
+	if (flush)
+		kvm_flush_remote_tlbs_with_address(kvm, range->start, 1);
 
-	WARN_ON(pte_huge(*ptep) || (gfn + 1) != end);
-
-	new_pfn = pte_pfn(*ptep);
-
-	tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
-		if (iter.level != PG_LEVEL_4K)
-			continue;
-
-		if (!is_shadow_present_pte(iter.old_spte))
-			break;
-
-		/*
-		 * Note, when changing a read-only SPTE, it's not strictly
-		 * necessary to zero the SPTE before setting the new PFN, but
-		 * doing so preserves the invariant that the PFN of a present
-		 * leaf SPTE can never change.  See __handle_changed_spte().
-		 */
-		tdp_mmu_set_spte(kvm, &iter, 0);
-
-		if (!pte_write(*ptep)) {
-			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
-					iter.old_spte, new_pfn);
-
-			tdp_mmu_set_spte(kvm, &iter, new_spte);
-		}
-
-		need_flush = 1;
-	}
-
-	if (need_flush)
-		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
-
-	rcu_read_unlock();
-
-	return 0;
-}
-
-int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
-			     pte_t *host_ptep)
-{
-	return kvm_tdp_mmu_handle_hva(kvm, address, (unsigned long)host_ptep,
-				      set_tdp_spte);
+	return false;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index f224df334382..302581743e1d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -37,15 +37,11 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		    int map_writable, int max_level, kvm_pfn_t pfn,
 		    bool prefault);
 
-int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-
-int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
-
-int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
-			     pte_t *host_ptep);
+bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
+				 bool flush);
+bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 
 bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
 			     int min_level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e6d77353025c..e6bb401dd856 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -219,11 +219,25 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+struct kvm_gfn_range {
+	struct kvm_memory_slot *slot;
+	gfn_t start;
+	gfn_t end;
+	pte_t pte;
+	bool may_block;
+};
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+#else
 int kvm_unmap_hva_range(struct kvm *kvm,
 			unsigned long start, unsigned long end, unsigned flags);
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
+#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
 #endif
 
 enum {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index e5fbf16f725b..62a5e9d1c94e 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -310,10 +310,6 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
 		uint32_t vcpuid = vcpuids ? vcpuids[i] : i;
 
 		vm_vcpu_add_default(vm, vcpuid, guest_code);
-
-#ifdef __x86_64__
-		vcpu_set_cpuid(vm, vcpuid, kvm_get_supported_cpuid());
-#endif
 	}
 
 	return vm;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index a8906e60a108..1d81f2b24feb 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -590,6 +590,8 @@ void vm_vcpu_add_default(struct kvm_vm *vm, uint32_t vcpuid, void *guest_code)
 	vm_vcpu_add(vm, vcpuid);
 	vcpu_setup(vm, vcpuid, 0, 0);
 
+	vcpu_set_cpuid(vm, vcpuid, kvm_get_supported_cpuid());
+
 	/* Setup guest general purpose registers */
 	vcpu_regs_get(vm, vcpuid, &regs);
 	regs.rflags = regs.rflags | 0x2;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1de843b7618..de74379a96ca 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -451,11 +451,129 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+
+typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+
+struct kvm_hva_range {
+	unsigned long start;
+	unsigned long end;
+	pte_t pte;
+	hva_handler_t handler;
+	bool flush_on_ret;
+	bool may_block;
+};
+
+static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
+						  const struct kvm_hva_range *range)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_gfn_range gfn_range;
+	bool ret = false;
+	int i, idx;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(slot, slots) {
+			unsigned long hva_start, hva_end;
+
+			hva_start = max(range->start, slot->userspace_addr);
+			hva_end = min(range->end, slot->userspace_addr +
+						  (slot->npages << PAGE_SHIFT));
+			if (hva_start >= hva_end)
+				continue;
+
+			/*
+			 * To optimize for the likely case where the address
+			 * range is covered by zero or one memslots, don't
+			 * bother making these conditional (to avoid writes on
+			 * the second or later invocation of the handler).
+			 */
+			gfn_range.pte = range->pte;
+			gfn_range.may_block = range->may_block;
+
+			/*
+			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+			 */
+			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
+			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
+			gfn_range.slot = slot;
+
+			ret |= range->handler(kvm, &gfn_range);
+		}
+	}
+
+	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
+		kvm_flush_remote_tlbs(kvm);
+
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	/* The notifiers are averse to booleans. :-( */
+	return (int)ret;
+}
+
+static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
+						unsigned long start,
+						unsigned long end,
+						pte_t pte,
+						hva_handler_t handler)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range range = {
+		.start		= start,
+		.end		= end,
+		.pte		= pte,
+		.handler	= handler,
+		.flush_on_ret	= true,
+		.may_block	= false,
+	};
+	int ret;
+
+	KVM_MMU_LOCK(kvm);
+	ret = __kvm_handle_hva_range(kvm, &range);
+	KVM_MMU_UNLOCK(kvm);
+
+	return ret;
+}
+
+static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
+							 unsigned long start,
+							 unsigned long end,
+							 hva_handler_t handler)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range range = {
+		.start		= start,
+		.end		= end,
+		.pte		= __pte(0),
+		.handler	= handler,
+		.flush_on_ret	= false,
+		.may_block	= false,
+	};
+	int ret;
+
+	KVM_MMU_LOCK(kvm);
+	ret = __kvm_handle_hva_range(kvm, &range);
+	KVM_MMU_UNLOCK(kvm);
+
+	return ret;
+}
+#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
+
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
 					pte_t pte)
 {
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+#else
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
 
@@ -472,17 +590,30 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
+#endif
 }
 
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	const struct kvm_hva_range hva_range = {
+		.start		= range->start,
+		.end		= range->end,
+		.pte		= __pte(0),
+		.handler	= kvm_unmap_gfn_range,
+		.flush_on_ret	= true,
+		.may_block	= mmu_notifier_range_blockable(range),
+	};
+#else
 	int need_tlb_flush = 0, idx;
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
 
 	idx = srcu_read_lock(&kvm->srcu);
+#endif
+
 	KVM_MMU_LOCK(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
@@ -508,14 +639,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		kvm->mmu_notifier_range_end =
 			max(kvm->mmu_notifier_range_end, range->end);
 	}
+
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	__kvm_handle_hva_range(kvm, &hva_range);
+#else
 	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end,
 					     range->flags);
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush || kvm->tlbs_dirty)
 		kvm_flush_remote_tlbs(kvm);
+#endif
 
 	KVM_MMU_UNLOCK(kvm);
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	srcu_read_unlock(&kvm->srcu, idx);
+#endif
 
 	return 0;
 }
@@ -549,6 +687,9 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      unsigned long start,
 					      unsigned long end)
 {
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
+#else
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
 
@@ -565,6 +706,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
+#endif
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -572,6 +714,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					unsigned long start,
 					unsigned long end)
 {
+#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
 
@@ -579,6 +722,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
+#endif
 	/*
 	 * Even though we do not flush TLB, this will still adversely
 	 * affect performance on pre-Haswell Intel EPT, where there is
@@ -592,17 +736,25 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+#else
 	young = kvm_age_hva(kvm, start, end);
 	KVM_MMU_UNLOCK(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
+#endif
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long address)
 {
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
+					     kvm_test_age_gfn);
+#else
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int young, idx;
 
@@ -615,6 +767,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
+#endif
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 11/18] KVM: arm64: Convert to the gfn-based MMU notifier callbacks
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (9 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers " Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 12/18] KVM: MIPS/MMU: " Sean Christopherson
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move arm64 to the gfn-base MMU notifier APIs, which do the hva->gfn
lookup in common code.

Note, due to arch code being called if and only if a memslot is found,
the clean_dcache_guest_page() call in kvm_set_spte_*() is will no longer
be called for addresses that are not found in the guest memslots.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h |   1 +
 arch/arm64/kvm/mmu.c              | 117 ++++++++----------------------
 2 files changed, 33 insertions(+), 85 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 72e6b4600264..1ad729cf7b0d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -582,6 +582,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4b7e1e327337..35728231e9a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -839,7 +839,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
 	 * the page we just got a reference to gets unmapped before we have a
 	 * chance to grab the mmu_lock, which ensure that if the page gets
-	 * unmapped afterwards, the call to kvm_unmap_hva will take it away
+	 * unmapped afterwards, the call to kvm_unmap_gfn will take it away
 	 * from us again properly. This smp_rmb() interacts with the smp_wmb()
 	 * in kvm_mmu_notifier_invalidate_<page|range_end>.
 	 */
@@ -1064,123 +1064,70 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
-static int handle_hva_to_gpa(struct kvm *kvm,
-			     unsigned long start,
-			     unsigned long end,
-			     int (*handler)(struct kvm *kvm,
-					    gpa_t gpa, u64 size,
-					    void *data),
-			     void *data)
-{
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
-	int ret = 0;
-
-	slots = kvm_memslots(kvm);
-
-	/* we only care about the pages that the guest sees */
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gpa;
-
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-
-		gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
-		ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
-	}
-
-	return ret;
-}
-
-static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
-{
-	unsigned flags = *(unsigned *)data;
-	bool may_block = flags & MMU_NOTIFIER_RANGE_BLOCKABLE;
-
-	__unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
-	return 0;
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm,
-			unsigned long start, unsigned long end, unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
-	handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, &flags);
-	return 0;
-}
+	__unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
+			     (range->end - range->start) << PAGE_SHIFT,
+			     range->may_block);
 
-static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
-{
-	kvm_pfn_t *pfn = (kvm_pfn_t *)data;
-
-	WARN_ON(size != PAGE_SIZE);
-
-	/*
-	 * The MMU notifiers will have unmapped a huge PMD before calling
-	 * ->change_pte() (which in turn calls kvm_set_spte_hva()) and
-	 * therefore we never need to clear out a huge PMD through this
-	 * calling path and a memcache is not required.
-	 */
-	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, gpa, PAGE_SIZE,
-			       __pfn_to_phys(*pfn), KVM_PGTABLE_PROT_R, NULL);
 	return 0;
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	unsigned long end = hva + PAGE_SIZE;
-	kvm_pfn_t pfn = pte_pfn(pte);
+	kvm_pfn_t pfn = pte_pfn(range->pte);
 
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
+	WARN_ON(range->end - range->start != 1);
+
 	/*
 	 * We've moved a page around, probably through CoW, so let's treat it
 	 * just like a translation fault and clean the cache to the PoC.
 	 */
 	clean_dcache_guest_page(pfn, PAGE_SIZE);
-	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);
+
+	/*
+	 * The MMU notifiers will have unmapped a huge PMD before calling
+	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
+	 * therefore we never need to clear out a huge PMD through this
+	 * calling path and a memcache is not required.
+	 */
+	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
+			       PAGE_SIZE, __pfn_to_phys(pfn),
+			       KVM_PGTABLE_PROT_R, NULL);
+
 	return 0;
 }
 
-static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	pte_t pte;
+	u64 size = (range->end - range->start) << PAGE_SHIFT;
 	kvm_pte_t kpte;
+	pte_t pte;
+
+	if (!kvm->arch.mmu.pgt)
+		return 0;
 
 	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
-	kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt, gpa);
+
+	kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
+					range->start << PAGE_SHIFT);
 	pte = __pte(kpte);
 	return pte_valid(pte) && pte_young(pte);
 }
 
-static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
-{
-	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
-	return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt, gpa);
-}
-
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
-{
-	if (!kvm->arch.mmu.pgt)
-		return 0;
-
-	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
-}
-
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	if (!kvm->arch.mmu.pgt)
 		return 0;
 
-	return handle_hva_to_gpa(kvm, hva, hva + PAGE_SIZE,
-				 kvm_test_age_hva_handler, NULL);
+	return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
+					   range->start << PAGE_SHIFT);
 }
 
 phys_addr_t kvm_mmu_get_httbr(void)
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 12/18] KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (10 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 11/18] KVM: arm64: Convert to the gfn-based MMU notifier callbacks Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-31  7:41   ` Paolo Bonzini
  2021-03-26  2:19 ` [PATCH 13/18] KVM: PPC: " Sean Christopherson
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move MIPS to the gfn-based MMU notifier APIs, which do the hva->gfn
lookup in common code, and whose code is nearly identical to MIPS'
lookup.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/mips/include/asm/kvm_host.h |  1 +
 arch/mips/kvm/mmu.c              | 97 ++++++--------------------------
 2 files changed, 17 insertions(+), 81 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index feaa77036b67..374a3c8806e8 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,6 +967,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
 						   bool write);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 /* Emulation */
 int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 3dabeda82458..3dc885df2e32 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -439,85 +439,36 @@ static int kvm_mips_mkold_gpa_pt(struct kvm *kvm, gfn_t start_gfn,
 				  end_gfn << PAGE_SHIFT);
 }
 
-static int handle_hva_to_gpa(struct kvm *kvm,
-			     unsigned long start,
-			     unsigned long end,
-			     int (*handler)(struct kvm *kvm, gfn_t gfn,
-					    gpa_t gfn_end,
-					    struct kvm_memory_slot *memslot,
-					    void *data),
-			     void *data)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
-	int ret = 0;
-
-	slots = kvm_memslots(kvm);
-
-	/* we only care about the pages that the guest sees */
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gfn, gfn_end;
-
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-
-		/*
-		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-		 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
-		 */
-		gfn = hva_to_gfn_memslot(hva_start, memslot);
-		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-		ret |= handler(kvm, gfn, gfn_end, memslot, data);
-	}
-
-	return ret;
-}
-
-
-static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-				 struct kvm_memory_slot *memslot, void *data)
-{
-	kvm_mips_flush_gpa_pt(kvm, gfn, gfn_end);
-	return 1;
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
-{
-	handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL);
+	kvm_mips_flush_gpa_pt(kvm, range->start, range->end);
 
 	kvm_mips_callbacks->flush_shadow_all(kvm);
 	return 0;
 }
 
-static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-				struct kvm_memory_slot *memslot, void *data)
+static bool __kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	gpa_t gpa = gfn << PAGE_SHIFT;
-	pte_t hva_pte = *(pte_t *)data;
+	gpa_t gpa = range->start << PAGE_SHIFT;
+	pte_t hva_pte = range->pte;
 	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
 	pte_t old_pte;
 
 	if (!gpa_pte)
-		return 0;
+		return false;
 
 	/* Mapping may need adjusting depending on memslot flags */
 	old_pte = *gpa_pte;
-	if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
+	if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
 		hva_pte = pte_mkclean(hva_pte);
-	else if (memslot->flags & KVM_MEM_READONLY)
+	else if (range->slot->flags & KVM_MEM_READONLY)
 		hva_pte = pte_wrprotect(hva_pte);
 
 	set_pte(gpa_pte, hva_pte);
 
 	/* Replacing an absent or old page doesn't need flushes */
 	if (!pte_present(old_pte) || !pte_young(old_pte))
-		return 0;
+		return false;
 
 	/* Pages swapped, aged, moved, or cleaned require flushes */
 	return !pte_present(hva_pte) ||
@@ -526,27 +477,21 @@ static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
 	       (pte_dirty(old_pte) && !pte_dirty(hva_pte));
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	unsigned long end = hva + PAGE_SIZE;
-	int ret;
-
-	ret = handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
-	if (ret)
+	if (__kvm_set_spte_gfn(kvm, range))
 		kvm_mips_callbacks->flush_shadow_all(kvm);
-	return 0;
+	return false;
 }
 
-static int kvm_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-			       struct kvm_memory_slot *memslot, void *data)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_mips_mkold_gpa_pt(kvm, gfn, gfn_end);
+	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
 }
 
-static int kvm_test_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-				    struct kvm_memory_slot *memslot, void *data)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	gpa_t gpa = gfn << PAGE_SHIFT;
+	gpa_t gpa = range->start << PAGE_SHIFT;
 	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
 
 	if (!gpa_pte)
@@ -554,16 +499,6 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
 	return pte_young(*gpa_pte);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
-{
-	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
-}
-
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
-{
-	return handle_hva_to_gpa(kvm, hva, hva, kvm_test_age_hva_handler, NULL);
-}
-
 /**
  * _kvm_mips_map_page_fast() - Fast path GPA fault handler.
  * @vcpu:		VCPU pointer.
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 13/18] KVM: PPC: Convert to the gfn-based MMU notifier callbacks
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (11 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 12/18] KVM: MIPS/MMU: " Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 14/18] KVM: Kill off the old hva-based " Sean Christopherson
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion
PPC-internal hooks to work with gfns instead of hvas.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/powerpc/include/asm/kvm_book3s.h  | 12 ++--
 arch/powerpc/include/asm/kvm_host.h    |  1 +
 arch/powerpc/include/asm/kvm_ppc.h     |  9 ++-
 arch/powerpc/kvm/book3s.c              | 18 +++--
 arch/powerpc/kvm/book3s.h              | 10 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    | 98 +++++++-------------------
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 25 +++----
 arch/powerpc/kvm/book3s_hv.c           | 12 ++--
 arch/powerpc/kvm/book3s_pr.c           | 56 +++++----------
 arch/powerpc/kvm/e500_mmu_host.c       | 27 +++----
 10 files changed, 95 insertions(+), 173 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 2f5f919f6cd3..2d03f2930767 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -210,12 +210,12 @@ extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
 				      unsigned int lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
-extern int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			unsigned long gfn);
-extern int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			unsigned long gfn);
-extern int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			unsigned long gfn);
+extern bool kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			    unsigned long gfn);
+extern bool kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			  unsigned long gfn);
+extern bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			       unsigned long gfn);
 extern long kvmppc_hv_get_dirty_log_radix(struct kvm *kvm,
 			struct kvm_memory_slot *memslot, unsigned long *map);
 extern void kvmppc_radix_flush_memslot(struct kvm *kvm,
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1e83359f286b..1335f0001bdd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -55,6 +55,7 @@
 #include <linux/mmu_notifier.h>
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 #define HPTEG_CACHE_NUM			(1 << 15)
 #define HPTEG_HASH_BITS_PTE		13
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 8aacd76bb702..21ab0332eb42 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -281,11 +281,10 @@ struct kvmppc_ops {
 				     const struct kvm_memory_slot *old,
 				     const struct kvm_memory_slot *new,
 				     enum kvm_mr_change change);
-	int (*unmap_hva_range)(struct kvm *kvm, unsigned long start,
-			   unsigned long end);
-	int (*age_hva)(struct kvm *kvm, unsigned long start, unsigned long end);
-	int (*test_age_hva)(struct kvm *kvm, unsigned long hva);
-	void (*set_spte_hva)(struct kvm *kvm, unsigned long hva, pte_t pte);
+	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
+	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+	bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	void (*free_memslot)(struct kvm_memory_slot *slot);
 	int (*init_vm)(struct kvm *kvm);
 	void (*destroy_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 44bf567b6589..2b691f4d1f26 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -834,26 +834,24 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm,
 	kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
 }
 
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm->arch.kvm_ops->unmap_hva_range(kvm, start, end);
+	return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm->arch.kvm_ops->age_hva(kvm, start, end);
+	return kvm->arch.kvm_ops->age_gfn(kvm, range);
 }
 
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm->arch.kvm_ops->test_age_hva(kvm, hva);
+	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	kvm->arch.kvm_ops->set_spte_hva(kvm, hva, pte);
-	return 0;
+	return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
 }
 
 int kvmppc_core_init_vm(struct kvm *kvm)
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 9b6323ec8e60..740e51def5a5 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -9,12 +9,10 @@
 
 extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 					 struct kvm_memory_slot *memslot);
-extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
-				  unsigned long end);
-extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start,
-			  unsigned long end);
-extern int kvm_test_age_hva_hv(struct kvm *kvm, unsigned long hva);
-extern void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte);
+extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 
 extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
 extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index bb6773594cf8..b7bd9ca040b8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -752,51 +752,6 @@ void kvmppc_rmap_reset(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, srcu_idx);
 }
 
-typedef int (*hva_handler_fn)(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			      unsigned long gfn);
-
-static int kvm_handle_hva_range(struct kvm *kvm,
-				unsigned long start,
-				unsigned long end,
-				hva_handler_fn handler)
-{
-	int ret;
-	int retval = 0;
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
-
-	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gfn, gfn_end;
-
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-		/*
-		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-		 * {gfn, gfn+1, ..., gfn_end-1}.
-		 */
-		gfn = hva_to_gfn_memslot(hva_start, memslot);
-		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-		for (; gfn < gfn_end; ++gfn) {
-			ret = handler(kvm, memslot, gfn);
-			retval |= ret;
-		}
-	}
-
-	return retval;
-}
-
-static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
-			  hva_handler_fn handler)
-{
-	return kvm_handle_hva_range(kvm, hva, hva + 1, handler);
-}
-
 /* Must be called with both HPTE and rmap locked */
 static void kvmppc_unmap_hpte(struct kvm *kvm, unsigned long i,
 			      struct kvm_memory_slot *memslot,
@@ -840,8 +795,8 @@ static void kvmppc_unmap_hpte(struct kvm *kvm, unsigned long i,
 	}
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			   unsigned long gfn)
+static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			    unsigned long gfn)
 {
 	unsigned long i;
 	__be64 *hptep;
@@ -874,16 +829,15 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		unlock_rmap(rmapp);
 		__unlock_hpte(hptep, be64_to_cpu(hptep[0]));
 	}
-	return 0;
+	return false;
 }
 
-int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		return kvm_unmap_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_unmap_radix : kvm_unmap_rmapp;
-	kvm_handle_hva_range(kvm, start, end, handler);
-	return 0;
+	return kvm_unmap_rmapp(kvm, range->slot, range->start);
 }
 
 void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
@@ -913,8 +867,8 @@ void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 	}
 }
 
-static int kvm_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			 unsigned long gfn)
+static bool kvm_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			  unsigned long gfn)
 {
 	struct revmap_entry *rev = kvm->arch.hpt.rev;
 	unsigned long head, i, j;
@@ -968,26 +922,26 @@ static int kvm_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	return ret;
 }
 
-int kvm_age_hva_hv(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		kvm_age_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_age_radix : kvm_age_rmapp;
-	return kvm_handle_hva_range(kvm, start, end, handler);
+	return kvm_age_rmapp(kvm, range->slot, range->start);
 }
 
-static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
-			      unsigned long gfn)
+static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
+			       unsigned long gfn)
 {
 	struct revmap_entry *rev = kvm->arch.hpt.rev;
 	unsigned long head, i, j;
 	unsigned long *hp;
-	int ret = 1;
+	bool ret = true;
 	unsigned long *rmapp;
 
 	rmapp = &memslot->arch.rmap[gfn - memslot->base_gfn];
 	if (*rmapp & KVMPPC_RMAP_REFERENCED)
-		return 1;
+		return true;
 
 	lock_rmap(rmapp);
 	if (*rmapp & KVMPPC_RMAP_REFERENCED)
@@ -1002,27 +956,27 @@ static int kvm_test_age_rmapp(struct kvm *kvm, struct kvm_memory_slot *memslot,
 				goto out;
 		} while ((i = j) != head);
 	}
-	ret = 0;
+	ret = false;
 
  out:
 	unlock_rmap(rmapp);
 	return ret;
 }
 
-int kvm_test_age_hva_hv(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		kvm_test_age_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_test_age_radix : kvm_test_age_rmapp;
-	return kvm_handle_hva(kvm, hva, handler);
+	return kvm_test_age_rmapp(kvm, range->slot, range->start);
 }
 
-void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	hva_handler_fn handler;
+	if (kvm_is_radix(kvm))
+		return kvm_unmap_radix(kvm, range->slot, range->start);
 
-	handler = kvm_is_radix(kvm) ? kvm_unmap_radix : kvm_unmap_rmapp;
-	kvm_handle_hva(kvm, hva, handler);
+	return kvm_unmap_rmapp(kvm, range->slot, range->start);
 }
 
 static int vcpus_running(struct kvm *kvm)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e603de7ade52..9e19ea451354 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -993,8 +993,8 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 }
 
 /* Called with kvm->mmu_lock held */
-int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		    unsigned long gfn)
+bool kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		     unsigned long gfn)
 {
 	pte_t *ptep;
 	unsigned long gpa = gfn << PAGE_SHIFT;
@@ -1002,24 +1002,24 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 
 	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE) {
 		uv_page_inval(kvm->arch.lpid, gpa, PAGE_SHIFT);
-		return 0;
+		return false;
 	}
 
 	ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
 	if (ptep && pte_present(*ptep))
 		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 				 kvm->arch.lpid);
-	return 0;
+	return false;
 }
 
 /* Called with kvm->mmu_lock held */
-int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		  unsigned long gfn)
+bool kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		   unsigned long gfn)
 {
 	pte_t *ptep;
 	unsigned long gpa = gfn << PAGE_SHIFT;
 	unsigned int shift;
-	int ref = 0;
+	bool ref = false;
 	unsigned long old, *rmapp;
 
 	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
@@ -1035,26 +1035,27 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		kvmhv_update_nest_rmap_rc_list(kvm, rmapp, _PAGE_ACCESSED, 0,
 					       old & PTE_RPN_MASK,
 					       1UL << shift);
-		ref = 1;
+		ref = true;
 	}
 	return ref;
 }
 
 /* Called with kvm->mmu_lock held */
-int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-		       unsigned long gfn)
+bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+		        unsigned long gfn)
+
 {
 	pte_t *ptep;
 	unsigned long gpa = gfn << PAGE_SHIFT;
 	unsigned int shift;
-	int ref = 0;
+	bool ref = false;
 
 	if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
 		return ref;
 
 	ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
 	if (ptep && pte_present(*ptep) && pte_young(*ptep))
-		ref = 1;
+		ref = true;
 	return ref;
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 13bad6bf4c95..07682ad4110e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4770,7 +4770,7 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
 		kvmhv_release_all_nested(kvm);
 	kvmppc_rmap_reset(kvm);
 	kvm->arch.process_table = 0;
-	/* Mutual exclusion with kvm_unmap_hva_range etc. */
+	/* Mutual exclusion with kvm_unmap_gfn_range etc. */
 	spin_lock(&kvm->mmu_lock);
 	kvm->arch.radix = 0;
 	spin_unlock(&kvm->mmu_lock);
@@ -4792,7 +4792,7 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
 	if (err)
 		return err;
 	kvmppc_rmap_reset(kvm);
-	/* Mutual exclusion with kvm_unmap_hva_range etc. */
+	/* Mutual exclusion with kvm_unmap_gfn_range etc. */
 	spin_lock(&kvm->mmu_lock);
 	kvm->arch.radix = 1;
 	spin_unlock(&kvm->mmu_lock);
@@ -5654,10 +5654,10 @@ static struct kvmppc_ops kvm_ops_hv = {
 	.flush_memslot  = kvmppc_core_flush_memslot_hv,
 	.prepare_memory_region = kvmppc_core_prepare_memory_region_hv,
 	.commit_memory_region  = kvmppc_core_commit_memory_region_hv,
-	.unmap_hva_range = kvm_unmap_hva_range_hv,
-	.age_hva  = kvm_age_hva_hv,
-	.test_age_hva = kvm_test_age_hva_hv,
-	.set_spte_hva = kvm_set_spte_hva_hv,
+	.unmap_gfn_range = kvm_unmap_gfn_range_hv,
+	.age_gfn = kvm_age_gfn_hv,
+	.test_age_gfn = kvm_test_age_gfn_hv,
+	.set_spte_gfn = kvm_set_spte_gfn_hv,
 	.free_memslot = kvmppc_core_free_memslot_hv,
 	.init_vm =  kvmppc_core_init_vm_hv,
 	.destroy_vm = kvmppc_core_destroy_vm_hv,
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 913944dc3620..d7733b07f489 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -425,61 +425,39 @@ static int kvmppc_core_check_requests_pr(struct kvm_vcpu *vcpu)
 }
 
 /************* MMU Notifiers *************/
-static void do_kvm_unmap_hva(struct kvm *kvm, unsigned long start,
-			     unsigned long end)
+static bool do_kvm_unmap_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	long i;
 	struct kvm_vcpu *vcpu;
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *memslot;
 
-	slots = kvm_memslots(kvm);
-	kvm_for_each_memslot(memslot, slots) {
-		unsigned long hva_start, hva_end;
-		gfn_t gfn, gfn_end;
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmppc_mmu_pte_pflush(vcpu, range->start << PAGE_SHIFT,
+				      range->end << PAGE_SHIFT);
 
-		hva_start = max(start, memslot->userspace_addr);
-		hva_end = min(end, memslot->userspace_addr +
-					(memslot->npages << PAGE_SHIFT));
-		if (hva_start >= hva_end)
-			continue;
-		/*
-		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
-		 * {gfn, gfn+1, ..., gfn_end-1}.
-		 */
-		gfn = hva_to_gfn_memslot(hva_start, memslot);
-		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-		kvm_for_each_vcpu(i, vcpu, kvm)
-			kvmppc_mmu_pte_pflush(vcpu, gfn << PAGE_SHIFT,
-					      gfn_end << PAGE_SHIFT);
-	}
+	return false;
 }
 
-static int kvm_unmap_hva_range_pr(struct kvm *kvm, unsigned long start,
-				  unsigned long end)
+static bool kvm_unmap_gfn_range_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	do_kvm_unmap_hva(kvm, start, end);
-
-	return 0;
+	return do_kvm_unmap_gfn(kvm, range);
 }
 
-static int kvm_age_hva_pr(struct kvm *kvm, unsigned long start,
-			  unsigned long end)
+static bool kvm_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-static int kvm_test_age_hva_pr(struct kvm *kvm, unsigned long hva)
+static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-static void kvm_set_spte_hva_pr(struct kvm *kvm, unsigned long hva, pte_t pte)
+static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* The page will get remapped properly on its next fault */
-	do_kvm_unmap_hva(kvm, hva, hva + PAGE_SIZE);
+	return do_kvm_unmap_gfn(kvm, range);
 }
 
 /*****************************************/
@@ -2079,10 +2057,10 @@ static struct kvmppc_ops kvm_ops_pr = {
 	.flush_memslot = kvmppc_core_flush_memslot_pr,
 	.prepare_memory_region = kvmppc_core_prepare_memory_region_pr,
 	.commit_memory_region = kvmppc_core_commit_memory_region_pr,
-	.unmap_hva_range = kvm_unmap_hva_range_pr,
-	.age_hva  = kvm_age_hva_pr,
-	.test_age_hva = kvm_test_age_hva_pr,
-	.set_spte_hva = kvm_set_spte_hva_pr,
+	.unmap_gfn_range = kvm_unmap_gfn_range_pr,
+	.age_gfn  = kvm_age_gfn_pr,
+	.test_age_gfn = kvm_test_age_gfn_pr,
+	.set_spte_gfn = kvm_set_spte_gfn_pr,
 	.free_memslot = kvmppc_core_free_memslot_pr,
 	.init_vm = kvmppc_core_init_vm_pr,
 	.destroy_vm = kvmppc_core_destroy_vm_pr,
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 648aefe1a3e7..7f16afc331ef 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -721,43 +721,36 @@ int kvmppc_load_last_inst(struct kvm_vcpu *vcpu,
 
 /************* MMU Notifiers *************/
 
-static int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+static bool kvm_e500_mmu_unmap_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/*
 	 * Flush all shadow tlb entries everywhere. This is slow, but
 	 * we are 100% sure that we catch the to be unmapped page
 	 */
-	kvm_flush_remote_tlbs(kvm);
-
-	return 0;
+	return true;
 }
 
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
-			unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	/* kvm_unmap_hva flushes everything anyways */
-	kvm_unmap_hva(kvm, start);
-
-	return 0;
+	return kvm_e500_mmu_unmap_gfn(kvm, range);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* XXX could be more clever ;) */
-	return 0;
+	return false;
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	/* The page will get remapped properly on its next fault */
-	kvm_unmap_hva(kvm, hva);
-	return 0;
+	return kvm_e500_mmu_unmap_gfn(kvm, range);
 }
 
 /*****************************************/
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 14/18] KVM: Kill off the old hva-based MMU notifier callbacks
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (12 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 13/18] KVM: PPC: " Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 15/18] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot Sean Christopherson
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Yank out the hva-based MMU notifier APIs now that all architectures that
use the notifiers have moved to the gfn-based APIs.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/include/asm/kvm_host.h   |  1 -
 arch/mips/include/asm/kvm_host.h    |  1 -
 arch/powerpc/include/asm/kvm_host.h |  1 -
 arch/x86/include/asm/kvm_host.h     |  1 -
 include/linux/kvm_host.h            |  8 ---
 virt/kvm/kvm_main.c                 | 93 ++---------------------------
 6 files changed, 5 insertions(+), 100 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1ad729cf7b0d..72e6b4600264 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -582,7 +582,6 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 			      struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 374a3c8806e8..feaa77036b67 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,7 +967,6 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
 						   bool write);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 /* Emulation */
 int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1335f0001bdd..1e83359f286b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -55,7 +55,6 @@
 #include <linux/mmu_notifier.h>
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 #define HPTEG_CACHE_NUM			(1 << 15)
 #define HPTEG_HASH_BITS_PTE		13
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a21e3698f4dc..99778ac51243 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1718,7 +1718,6 @@ asmlinkage void kvm_spurious_fault(void);
 	_ASM_EXTABLE(666b, 667b)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e6bb401dd856..40ac2d40bb5a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -219,7 +219,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
@@ -231,13 +230,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-#else
-int kvm_unmap_hva_range(struct kvm *kvm,
-			unsigned long start, unsigned long end, unsigned flags);
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
-#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de74379a96ca..bfa43eea891a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -451,8 +451,6 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 struct kvm_hva_range {
@@ -564,40 +562,20 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 
 	return ret;
 }
-#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
-
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
 					pte_t pte)
 {
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
-#else
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int idx;
-
 	trace_kvm_set_spte_hva(address);
 
-	idx = srcu_read_lock(&kvm->srcu);
-
-	KVM_MMU_LOCK(kvm);
-
-	kvm->mmu_notifier_seq++;
-
-	if (kvm_set_spte_hva(kvm, address, pte))
-		kvm_flush_remote_tlbs(kvm);
-
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-#endif
+	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	const struct kvm_hva_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
@@ -606,14 +584,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
-#else
-	int need_tlb_flush = 0, idx;
 
 	trace_kvm_unmap_hva_range(range->start, range->end);
 
-	idx = srcu_read_lock(&kvm->srcu);
-#endif
-
 	KVM_MMU_LOCK(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
@@ -640,20 +613,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 			max(kvm->mmu_notifier_range_end, range->end);
 	}
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	__kvm_handle_hva_range(kvm, &hva_range);
-#else
-	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end,
-					     range->flags);
-	/* we've to flush the tlb before the pages can be freed */
-	if (need_tlb_flush || kvm->tlbs_dirty)
-		kvm_flush_remote_tlbs(kvm);
-#endif
 
 	KVM_MMU_UNLOCK(kvm);
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	srcu_read_unlock(&kvm->srcu, idx);
-#endif
 
 	return 0;
 }
@@ -687,26 +649,9 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 					      unsigned long start,
 					      unsigned long end)
 {
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	trace_kvm_age_hva(start, end);
+
 	return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
-#else
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int young, idx;
-
-	trace_kvm_age_hva(start, end);
-
-	idx = srcu_read_lock(&kvm->srcu);
-	KVM_MMU_LOCK(kvm);
-
-	young = kvm_age_hva(kvm, start, end);
-	if (young)
-		kvm_flush_remote_tlbs(kvm);
-
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-
-	return young;
-#endif
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -714,15 +659,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					unsigned long start,
 					unsigned long end)
 {
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int young, idx;
-
 	trace_kvm_age_hva(start, end);
 
-	idx = srcu_read_lock(&kvm->srcu);
-	KVM_MMU_LOCK(kvm);
-#endif
 	/*
 	 * Even though we do not flush TLB, this will still adversely
 	 * affect performance on pre-Haswell Intel EPT, where there is
@@ -736,38 +674,17 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * cadence. If we find this inaccurate, we might come up with a
 	 * more sophisticated heuristic later.
 	 */
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
-#else
-	young = kvm_age_hva(kvm, start, end);
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-
-	return young;
-#endif
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long address)
 {
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	trace_kvm_test_age_hva(address);
+
 	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
 					     kvm_test_age_gfn);
-#else
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int young, idx;
-
-	trace_kvm_test_age_hva(address);
-
-	idx = srcu_read_lock(&kvm->srcu);
-	KVM_MMU_LOCK(kvm);
-	young = kvm_test_age_hva(kvm, address);
-	KVM_MMU_UNLOCK(kvm);
-	srcu_read_unlock(&kvm->srcu, idx);
-
-	return young;
-#endif
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 15/18] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (13 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 14/18] KVM: Kill off the old hva-based " Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
detected in the memslots, i.e. don't take the lock for notifications that
don't affect the guest.

For small VMs, spurious locking is a minor annoyance.  And for "volatile"
setups where the majority of notifications _are_ relevant, this barely
qualifies as an optimization.

But, for large VMs (hundreds of threads) with static setups, e.g. no
page migration, no swapping, etc..., the vast majority of MMU notifier
callbacks will be unrelated to the guest, e.g. will often be in response
to the userspace VMM adjusting its own virtual address space.  In such
large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
handling page faults.  In some scenarios it can even be "fatal" in the
sense that it causes unacceptable brownouts, e.g. when rebuilding huge
pages after live migration, a significant percentage of vCPUs will be
attempting to handle page faults.

x86's TDP MMU implementation is especially susceptible to spurious
locking due it taking mmu_lock for read when handling page faults.
Because rwlock is fair, a single writer will stall future readers, while
the writer is itself stalled waiting for in-progress readers to complete.
This is exacerbated by the MMU notifiers often firing multiple times in
quick succession, e.g. moving a page will (always?) invoke three separate
notifiers: .invalidate_range_start(), invalidate_range_end(), and
.change_pte().  Unnecessarily taking mmu_lock each time means even a
single spurious sequence can be problematic.

Note, this optimizes only the unpaired callbacks.  Optimizing the
.invalidate_range_{start,end}() pairs is more complex and will be done in
a future patch.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bfa43eea891a..0c2aff8a4aa1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -458,6 +458,7 @@ struct kvm_hva_range {
 	unsigned long end;
 	pte_t pte;
 	hva_handler_t handler;
+	bool caller_locked;
 	bool flush_on_ret;
 	bool may_block;
 };
@@ -465,14 +466,12 @@ struct kvm_hva_range {
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 						  const struct kvm_hva_range *range)
 {
-	struct kvm_memory_slot *slot;
-	struct kvm_memslots *slots;
+	bool ret = false, locked = range->caller_locked;
 	struct kvm_gfn_range gfn_range;
-	bool ret = false;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
 	int i, idx;
 
-	lockdep_assert_held_write(&kvm->mmu_lock);
-
 	idx = srcu_read_lock(&kvm->srcu);
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -503,6 +502,10 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
 
+			if (!locked) {
+				locked = true;
+				KVM_MMU_LOCK(kvm);
+			}
 			ret |= range->handler(kvm, &gfn_range);
 		}
 	}
@@ -510,6 +513,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
 		kvm_flush_remote_tlbs(kvm);
 
+	if (locked && !range->caller_locked)
+		KVM_MMU_UNLOCK(kvm);
+
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	/* The notifiers are averse to booleans. :-( */
@@ -528,16 +534,11 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.end		= end,
 		.pte		= pte,
 		.handler	= handler,
+		.caller_locked	= false,
 		.flush_on_ret	= true,
 		.may_block	= false,
 	};
-	int ret;
-
-	KVM_MMU_LOCK(kvm);
-	ret = __kvm_handle_hva_range(kvm, &range);
-	KVM_MMU_UNLOCK(kvm);
-
-	return ret;
+	return __kvm_handle_hva_range(kvm, &range);
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
@@ -551,16 +552,12 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.end		= end,
 		.pte		= __pte(0),
 		.handler	= handler,
+		.caller_locked	= false,
 		.flush_on_ret	= false,
 		.may_block	= false,
 	};
-	int ret;
 
-	KVM_MMU_LOCK(kvm);
-	ret = __kvm_handle_hva_range(kvm, &range);
-	KVM_MMU_UNLOCK(kvm);
-
-	return ret;
+	return __kvm_handle_hva_range(kvm, &range);
 }
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
@@ -581,6 +578,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.end		= range->end,
 		.pte		= __pte(0),
 		.handler	= kvm_unmap_gfn_range,
+		.caller_locked	= true,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (14 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 15/18] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-31  7:52   ` Paolo Bonzini
  2021-03-31  8:35   ` Paolo Bonzini
  2021-03-26  2:19 ` [PATCH 17/18] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible Sean Christopherson
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
notifications.  Because mmu_notifier_count must be modified while holding
mmu_lock for write, and must always be paired across start->end to stay
balanced, lock elision must happen in both or none.  To meet that
requirement, add a rwsem to prevent memslot updates across range_start()
and range_end().

For notifiers that disallow blocking, e.g. OOM reaping, simply go down
the slow path of unconditionally acquiring mmu_lock.  The sane
alternative would be to try to acquire the lock and force the notifier
to retry on failure.  But since OOM is currently the _only_ scenario
where blocking is disallowed attempting to optimize a guest that has been
marked for death is pointless.

Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.

Based heavily on code from Ben Gardon.

Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h |   8 +-
 virt/kvm/kvm_main.c      | 174 ++++++++++++++++++++++++++++++---------
 2 files changed, 142 insertions(+), 40 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 40ac2d40bb5a..2cc0f87d936e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -523,6 +523,7 @@ struct kvm {
 	long mmu_notifier_count;
 	unsigned long mmu_notifier_range_start;
 	unsigned long mmu_notifier_range_end;
+	struct rw_semaphore mmu_notifier_slots_lock;
 #endif
 	long tlbs_dirty;
 	struct list_head devices;
@@ -660,8 +661,11 @@ static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
 {
 	as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
 	return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
-			lockdep_is_held(&kvm->slots_lock) ||
-			!refcount_read(&kvm->users_count));
+				      lockdep_is_held(&kvm->slots_lock) ||
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+				      lockdep_is_held(&kvm->mmu_notifier_slots_lock) ||
+#endif
+				      !refcount_read(&kvm->users_count));
 }
 
 static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0c2aff8a4aa1..9ebc6d3e4a21 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -453,20 +453,56 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
+typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
+			     unsigned long end);
+
 struct kvm_hva_range {
 	unsigned long start;
 	unsigned long end;
 	pte_t pte;
 	hva_handler_t handler;
-	bool caller_locked;
+	on_lock_fn_t on_lock;
+	bool must_lock;
 	bool flush_on_ret;
 	bool may_block;
 };
 
+/*
+ * Use a dedicated stub instead of NULL to indicate that there is no callback
+ * function/handler.  The compiler technically can't guarantee that a real
+ * function will have a non-zero address, and so it will generate code to
+ * check for !NULL, whereas comparing against a stub will be elided at compile
+ * time (unless the compiler is getting long in the tooth, e.g. gcc 4.9).
+ */
+static void kvm_null_fn(void)
+{
+
+}
+#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
+
+
+/* Acquire mmu_lock if necessary.  Returns %true if @handler is "null" */
+static __always_inline bool kvm_mmu_lock_and_check_handler(struct kvm *kvm,
+							   const struct kvm_hva_range *range,
+							   bool *locked)
+{
+	if (*locked)
+		return false;
+
+	*locked = true;
+
+	KVM_MMU_LOCK(kvm);
+
+	if (!IS_KVM_NULL_FN(range->on_lock))
+		range->on_lock(kvm, range->start, range->end);
+
+	return IS_KVM_NULL_FN(range->handler);
+}
+
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 						  const struct kvm_hva_range *range)
 {
-	bool ret = false, locked = range->caller_locked;
+	bool ret = false, locked = false;
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
 	struct kvm_memslots *slots;
@@ -474,6 +510,10 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 
 	idx = srcu_read_lock(&kvm->srcu);
 
+	if (range->must_lock &&
+	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
+		goto out_unlock;
+
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(slot, slots) {
@@ -502,10 +542,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
 
-			if (!locked) {
-				locked = true;
-				KVM_MMU_LOCK(kvm);
-			}
+			if (kvm_mmu_lock_and_check_handler(kvm, range, &locked))
+				goto out_unlock;
+
 			ret |= range->handler(kvm, &gfn_range);
 		}
 	}
@@ -513,7 +552,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
 		kvm_flush_remote_tlbs(kvm);
 
-	if (locked && !range->caller_locked)
+out_unlock:
+	if (locked)
 		KVM_MMU_UNLOCK(kvm);
 
 	srcu_read_unlock(&kvm->srcu, idx);
@@ -534,10 +574,12 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.end		= end,
 		.pte		= pte,
 		.handler	= handler,
-		.caller_locked	= false,
+		.on_lock	= (void *)kvm_null_fn,
+		.must_lock	= false,
 		.flush_on_ret	= true,
 		.may_block	= false,
 	};
+
 	return __kvm_handle_hva_range(kvm, &range);
 }
 
@@ -552,7 +594,8 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 		.end		= end,
 		.pte		= __pte(0),
 		.handler	= handler,
-		.caller_locked	= false,
+		.on_lock	= (void *)kvm_null_fn,
+		.must_lock	= false,
 		.flush_on_ret	= false,
 		.may_block	= false,
 	};
@@ -569,23 +612,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-					const struct mmu_notifier_range *range)
+static void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
+				   unsigned long end)
 {
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range hva_range = {
-		.start		= range->start,
-		.end		= range->end,
-		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
-		.caller_locked	= true,
-		.flush_on_ret	= true,
-		.may_block	= mmu_notifier_range_blockable(range),
-	};
-
-	trace_kvm_unmap_hva_range(range->start, range->end);
-
-	KVM_MMU_LOCK(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
@@ -593,8 +622,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 */
 	kvm->mmu_notifier_count++;
 	if (likely(kvm->mmu_notifier_count == 1)) {
-		kvm->mmu_notifier_range_start = range->start;
-		kvm->mmu_notifier_range_end = range->end;
+		kvm->mmu_notifier_range_start = start;
+		kvm->mmu_notifier_range_end = end;
 	} else {
 		/*
 		 * Fully tracking multiple concurrent ranges has dimishing
@@ -606,24 +635,54 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		 * complete.
 		 */
 		kvm->mmu_notifier_range_start =
-			min(kvm->mmu_notifier_range_start, range->start);
+			min(kvm->mmu_notifier_range_start, start);
 		kvm->mmu_notifier_range_end =
-			max(kvm->mmu_notifier_range_end, range->end);
+			max(kvm->mmu_notifier_range_end, end);
 	}
-
-	__kvm_handle_hva_range(kvm, &hva_range);
-
-	KVM_MMU_UNLOCK(kvm);
-
-	return 0;
 }
 
-static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
+	bool blockable = mmu_notifier_range_blockable(range);
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range hva_range = {
+		.start		= range->start,
+		.end		= range->end,
+		.pte		= __pte(0),
+		.handler	= kvm_unmap_gfn_range,
+		.on_lock	= kvm_inc_notifier_count,
+		.must_lock	= !blockable,
+		.flush_on_ret	= true,
+		.may_block	= blockable,
+	};
 
-	KVM_MMU_LOCK(kvm);
+	trace_kvm_unmap_hva_range(range->start, range->end);
+
+	/*
+	 * Prevent memslot modification between range_start() and range_end()
+	 * so that conditionally locking provides the same result in both
+	 * functions.  Without that guarantee, the mmu_notifier_count
+	 * adjustments will be imbalanced.
+	 *
+	 * Skip the memslot-lookup lock elision (set @must_lock above) to avoid
+	 * having to take the semaphore on non-blockable calls, e.g. OOM kill.
+	 * The complexity required to handle conditional locking for this case
+	 * is not worth the marginal benefits, the VM is likely doomed anyways.
+	 *
+	 * Pairs with the unlock in range_end().
+	 */
+	if (blockable)
+		down_read(&kvm->mmu_notifier_slots_lock);
+
+	__kvm_handle_hva_range(kvm, &hva_range);
+
+	return 0;
+}
+
+static void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
+				   unsigned long end)
+{
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
@@ -637,7 +696,29 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	 * in conjunction with the smp_rmb in mmu_notifier_retry().
 	 */
 	kvm->mmu_notifier_count--;
-	KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
+					const struct mmu_notifier_range *range)
+{
+	bool blockable = mmu_notifier_range_blockable(range);
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	const struct kvm_hva_range hva_range = {
+		.start		= range->start,
+		.end		= range->end,
+		.pte		= __pte(0),
+		.handler	= (void *)kvm_null_fn,
+		.on_lock	= kvm_dec_notifier_count,
+		.must_lock	= !blockable,
+		.flush_on_ret	= true,
+		.may_block	= blockable,
+	};
+
+	__kvm_handle_hva_range(kvm, &hva_range);
+
+	/* Pairs with the lock in range_start(). */
+	if (blockable)
+		up_read(&kvm->mmu_notifier_slots_lock);
 
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
@@ -709,6 +790,8 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
 {
+	init_rwsem(&kvm->mmu_notifier_slots_lock);
+
 	kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
@@ -971,6 +1054,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_coalesced_mmio_free(kvm);
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+	/*
+	 * Reset the lock used to prevent memslot updates between MMU notifier
+	 * range_start and range_end.  At this point no more MMU notifiers will
+	 * run, but the lock could still be held if KVM's notifier was removed
+	 * between range_start and range_end.  No threads can be waiting on the
+	 * lock as the last reference on KVM has been dropped.  If the lock is
+	 * still held, freeing memslots will deadlock.
+	 */
+	init_rwsem(&kvm->mmu_notifier_slots_lock);
 #else
 	kvm_arch_flush_shadow_all(kvm);
 #endif
@@ -1222,7 +1314,13 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
 	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
 	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+	down_write(&kvm->mmu_notifier_slots_lock);
+#endif
 	rcu_assign_pointer(kvm->memslots[as_id], slots);
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+	up_write(&kvm->mmu_notifier_slots_lock);
+#endif
 	synchronize_srcu_expedited(&kvm->srcu);
 
 	/*
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 17/18] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (15 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-26  2:19 ` [PATCH 18/18] KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint Sean Christopherson
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Let the TDP MMU yield when unmapping a range in response to a MMU
notification, if yielding is allowed by said notification.  There is no
reason to disallow yielding in this case, and in theory the range being
invalidated could be quite large.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f6a82230baa5..21bb8b68a553 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -885,7 +885,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id)
 		flush |= zap_gfn_range(kvm, root, range->start, range->end,
-				       false, flush);
+				       range->may_block, flush);
 
 	return flush;
 }
@@ -903,6 +903,10 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 
 	rcu_read_lock();
 
+	/*
+	 * Don't support rescheduling, none of the MMU notifiers that funnel
+	 * into this helper allow blocking; it'd be dead, wasteful code.
+	 */
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
 		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
 			ret |= handler(kvm, &iter, range);
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 18/18] KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (16 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 17/18] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible Sean Christopherson
@ 2021-03-26  2:19 ` Sean Christopherson
  2021-03-30 18:32 ` [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Ben Gardon
  2021-03-31  7:57 ` Paolo Bonzini
  19 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-26  2:19 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini
  Cc: James Morse, Julien Thierry, Suzuki K Poulose,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc,
	linux-kernel, Ben Gardon

Remove x86's trace_kvm_age_page() tracepoint.  It's mostly redundant with
the common trace_kvm_age_hva() tracepoint, and if there is a need for the
extra details, e.g. gfn, referenced, etc... those details should be added
to the common tracepoint so that all architectures and MMUs benefit from
the info.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  1 -
 arch/x86/kvm/mmu/tdp_mmu.c |  2 --
 include/trace/events/kvm.h | 24 ------------------------
 3 files changed, 27 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2070c7a91fdd..9eded46785b5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1467,7 +1467,6 @@ static bool kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	for_each_rmap_spte(rmap_head, &iter, sptep)
 		young |= mmu_spte_age(sptep);
 
-	trace_kvm_age_page(gfn, level, slot, young);
 	return young;
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 21bb8b68a553..2941beb6559a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -949,8 +949,6 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 
 	tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
 
-	trace_kvm_age_page(iter->gfn, iter->level, range->slot, true);
-
 	return true;
 }
 
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 200eb7465530..37e1e1a2d67d 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -255,30 +255,6 @@ TRACE_EVENT(kvm_fpu,
 	TP_printk("%s", __print_symbolic(__entry->load, kvm_fpu_load_symbol))
 );
 
-TRACE_EVENT(kvm_age_page,
-	TP_PROTO(ulong gfn, int level, struct kvm_memory_slot *slot, int ref),
-	TP_ARGS(gfn, level, slot, ref),
-
-	TP_STRUCT__entry(
-		__field(	u64,	hva		)
-		__field(	u64,	gfn		)
-		__field(	u8,	level		)
-		__field(	u8,	referenced	)
-	),
-
-	TP_fast_assign(
-		__entry->gfn		= gfn;
-		__entry->level		= level;
-		__entry->hva		= ((gfn - slot->base_gfn) <<
-					    PAGE_SHIFT) + slot->userspace_addr;
-		__entry->referenced	= ref;
-	),
-
-	TP_printk("hva %llx gfn %llx level %u %s",
-		  __entry->hva, __entry->gfn, __entry->level,
-		  __entry->referenced ? "YOUNG" : "OLD")
-);
-
 #ifdef CONFIG_KVM_ASYNC_PF
 DECLARE_EVENT_CLASS(kvm_async_get_page_class,
 
-- 
2.31.0.291.g576ba9dcdaf-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (17 preceding siblings ...)
  2021-03-26  2:19 ` [PATCH 18/18] KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint Sean Christopherson
@ 2021-03-30 18:32 ` Ben Gardon
  2021-03-30 19:48   ` Paolo Bonzini
  2021-03-30 19:58   ` Sean Christopherson
  2021-03-31  7:57 ` Paolo Bonzini
  19 siblings, 2 replies; 44+ messages in thread
From: Ben Gardon @ 2021-03-30 18:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, James Morse, Julien Thierry, Suzuki K Poulose,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc, LKML

On Thu, Mar 25, 2021 at 7:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> The end goal of this series is to optimize the MMU notifiers to take
> mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
> range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
> sensitive to mmu_lock being taken for write at inopportune times, and
> such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
> page shenanigans.  The vast majority of notifications for these VMs will
> be spurious (for KVM), and eliding mmu_lock for spurious notifications
> avoids an otherwise unacceptable disruption to the guest.
>
> To get there without potentially degrading performance, e.g. due to
> multiple memslot lookups, especially on non-x86 where the use cases are
> largely unknown (from my perspective), first consolidate the MMU notifier
> logic by moving the hva->gfn lookups into common KVM.
>
> Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
> with the TDP MMU changes in this series.  That code applies on kvm/queue
> (commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
> shadow paging").
>
> Speaking of conflicts, Ben will soon be posting a series to convert a
> bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
> will be an absurd number of conflicts; Ben and I will sort out the
> conflicts in whichever series loses the race.
>
> Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
> PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
> I give it even odds that I introduced an off-by-one bug somewhere.
>
> [*] https://lkml.kernel.org/r/20210325200119.1359384-1-seanjc@google.com
>
>
> Patches 1-7 are x86 specific prep patches to play nice with moving
> the hva->gfn memslot lookups into common code.  There ended up being waaay
> more of these than I expected/wanted, but I had a hell of a time getting
> the flushing logic right when shuffling the memslot and address space
> loops.  In the end, I was more confident I got things correct by batching
> the flushes.
>
> Patch 8 moves the existing API prototypes into common code.  It could
> technically be dropped since the old APIs are gone in the end, but I
> thought the switch to the new APIs would suck a bit less this way.

Patches 1-8 look good to me. Feel free to add my Reviewed-by tag to those.
I appreciate the care you took to make all those changes tiny and reviewable.

>
> Patch 9 moves arm64's MMU notifier tracepoints into common code so that
> they are not lost when arm64 is converted to the new APIs, and so that all
> architectures can benefit.
>
> Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
> because I could actually test it.  All architectures use nearly identical
> code, so I don't think it actually matters in the end.

I'm still reviewing 10 and 14-18. 10 is a huge change and the diff is
pretty hard to parse.

>
> Patches 11-13 move arm64, MIPS, and PPC to the new APIs.
>
> Patch 14 yanks out the old APIs.
>
> Patch 15 adds the mmu_lock elision, but only for unpaired notifications.

Reading through all this code and considering the changes I'm
preparing for the TDP MMU have me wondering if it might help to have a
more general purpose MMU lock context struct which could be embedded
in the structs added in this patch. I'm thinking something like:
enum kvm_mmu_lock_mode {
    KVM_MMU_LOCK_NONE,
    KVM_MMU_LOCK_READ,
    KVM_MMU_LOCK_WRITE,
};

struct kvm_mmu_lock_context {
    enum kvm_mmu_lock_mode lock_mode;
    bool can_block;
    bool can_yield;
    bool flush;
};

This could yield some grossly long lines, but it would also have
potential to unify a bunch of ad-hoc handling.
The above struct could also fit into a single byte, so it'd be pretty
easy to pass it around.

>
> Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
> This is quite nasty and no small part of me thinks the patch should be
> burned with fire (I won't spoil it any further), but it's also the most
> problematic scenario for our particular use case.  :-/
>
> Patches 17-18 are additional x86 cleanups.
>
> Sean Christopherson (18):
>   KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible
>     SPTEs
>   KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy
>     MMU
>   KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
>   KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range
>     zap
>   KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
>   KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
>   KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing
>     SPTE
>   KVM: Move prototypes for MMU notifier callbacks to generic code
>   KVM: Move arm64's MMU notifier trace events to generic code
>   KVM: Move x86's MMU notifier memslot walkers to generic code
>   KVM: arm64: Convert to the gfn-based MMU notifier callbacks
>   KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
>   KVM: PPC: Convert to the gfn-based MMU notifier callbacks
>   KVM: Kill off the old hva-based MMU notifier callbacks
>   KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
>     memslot
>   KVM: Don't take mmu_lock for range invalidation unless necessary
>   KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
>     possible
>   KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint
>
>  arch/arm64/include/asm/kvm_host.h             |   5 -
>  arch/arm64/kvm/mmu.c                          | 118 ++----
>  arch/arm64/kvm/trace_arm.h                    |  66 ----
>  arch/mips/include/asm/kvm_host.h              |   5 -
>  arch/mips/kvm/mmu.c                           |  97 +----
>  arch/powerpc/include/asm/kvm_book3s.h         |  12 +-
>  arch/powerpc/include/asm/kvm_host.h           |   7 -
>  arch/powerpc/include/asm/kvm_ppc.h            |   9 +-
>  arch/powerpc/kvm/book3s.c                     |  18 +-
>  arch/powerpc/kvm/book3s.h                     |  10 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c           |  98 ++---
>  arch/powerpc/kvm/book3s_64_mmu_radix.c        |  25 +-
>  arch/powerpc/kvm/book3s_hv.c                  |  12 +-
>  arch/powerpc/kvm/book3s_pr.c                  |  56 +--
>  arch/powerpc/kvm/e500_mmu_host.c              |  29 +-
>  arch/powerpc/kvm/trace_booke.h                |  15 -
>  arch/x86/include/asm/kvm_host.h               |   6 +-
>  arch/x86/kvm/mmu/mmu.c                        | 180 ++++-----
>  arch/x86/kvm/mmu/mmu_internal.h               |  10 +
>  arch/x86/kvm/mmu/tdp_mmu.c                    | 344 +++++++-----------
>  arch/x86/kvm/mmu/tdp_mmu.h                    |  31 +-
>  include/linux/kvm_host.h                      |  22 +-
>  include/trace/events/kvm.h                    |  90 +++--
>  tools/testing/selftests/kvm/lib/kvm_util.c    |   4 -
>  .../selftests/kvm/lib/x86_64/processor.c      |   2 +
>  virt/kvm/kvm_main.c                           | 312 ++++++++++++----
>  26 files changed, 697 insertions(+), 886 deletions(-)
>
> --
> 2.31.0.291.g576ba9dcdaf-goog
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers
  2021-03-30 18:32 ` [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Ben Gardon
@ 2021-03-30 19:48   ` Paolo Bonzini
  2021-03-30 19:58   ` Sean Christopherson
  1 sibling, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-30 19:48 UTC (permalink / raw)
  To: Ben Gardon, Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, LKML

On 30/03/21 20:32, Ben Gardon wrote:
>> Patches 1-7 are x86 specific prep patches to play nice with moving
>> the hva->gfn memslot lookups into common code.  There ended up being waaay
>> more of these than I expected/wanted, but I had a hell of a time getting
>> the flushing logic right when shuffling the memslot and address space
>> loops.  In the end, I was more confident I got things correct by batching
>> the flushes.
>>
>> Patch 8 moves the existing API prototypes into common code.  It could
>> technically be dropped since the old APIs are gone in the end, but I
>> thought the switch to the new APIs would suck a bit less this way.
> Patches 1-8 look good to me. Feel free to add my Reviewed-by tag to those.
> I appreciate the care you took to make all those changes tiny and reviewable.
> 

Just finished reviewing that part too, they were very nice and I've 
queued them.  I'll continue tomorrow with the rest.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers
  2021-03-30 18:32 ` [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Ben Gardon
  2021-03-30 19:48   ` Paolo Bonzini
@ 2021-03-30 19:58   ` Sean Christopherson
  1 sibling, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-30 19:58 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, James Morse, Julien Thierry, Suzuki K Poulose,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc, LKML

On Tue, Mar 30, 2021, Ben Gardon wrote:
> On Thu, Mar 25, 2021 at 7:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
> > because I could actually test it.  All architectures use nearly identical
> > code, so I don't think it actually matters in the end.
> 
> I'm still reviewing 10 and 14-18. 10 is a huge change and the diff is
> pretty hard to parse.

Ya :-/  I don't see an easy way to break it up without creating a massive diff,
e.g. it could be staged in x86 and moved to common, but I don't think that would
fundamentally change the diff.  Although I admittedly didn't spend _that_ much
time thinking about how to break it up.

> > Patches 11-13 move arm64, MIPS, and PPC to the new APIs.
> >
> > Patch 14 yanks out the old APIs.
> >
> > Patch 15 adds the mmu_lock elision, but only for unpaired notifications.
> 
> Reading through all this code and considering the changes I'm
> preparing for the TDP MMU have me wondering if it might help to have a
> more general purpose MMU lock context struct which could be embedded
> in the structs added in this patch. I'm thinking something like:
> enum kvm_mmu_lock_mode {
>     KVM_MMU_LOCK_NONE,
>     KVM_MMU_LOCK_READ,
>     KVM_MMU_LOCK_WRITE,
> };
> 
> struct kvm_mmu_lock_context {
>     enum kvm_mmu_lock_mode lock_mode;
>     bool can_block;
>     bool can_yield;

Not that it matters right now, but can_block and can_yield are the same thing.
I considered s/can_yield/can_block to make it all consistent, but that felt like
unnecessary thrash.

>     bool flush;

Drat.  This made me realize that the 'struct kvm_gfn_range' passed to arch code
isn't tagged 'const'.  I thought I had done that, but obviously not.

Anyways, what I was going to say before that realization is that the downside to
putting flush into kvm_gfn_range is that it would have to lose its 'const'
qualifier.  That's all a moot point if it's not easily constified though.

Const aside, my gut reaction is that it will probably be cleaner to keep the
flush stuff in arch code, separate from the kvm_gfn_range passed in by common
KVM.  Looping 'flush' back into the helpers is x86 specific at this point, and
AFAICT that's not likely to change any time soon.

For rwlock support, if we get to the point where kvm_age_gfn() and/or
kvm_test_age_gfn() can take mmu_lock for read, then it definitely makes sense to
track locking in kvm_gfn_range, assuming it isn't the sole entity that prevents
consifying kvm_range_range.

> };
> 
> This could yield some grossly long lines, but it would also have
> potential to unify a bunch of ad-hoc handling.
> The above struct could also fit into a single byte, so it'd be pretty
> easy to pass it around.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 12/18] KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
  2021-03-26  2:19 ` [PATCH 12/18] KVM: MIPS/MMU: " Sean Christopherson
@ 2021-03-31  7:41   ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31  7:41 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 26/03/21 03:19, Sean Christopherson wrote:
> Move MIPS to the gfn-based MMU notifier APIs, which do the hva->gfn
> lookup in common code, and whose code is nearly identical to MIPS'
> lookup.
> 
> No meaningful functional change intended, though the exact order of
> operations is slightly different since the memslot lookups occur before
> calling into arch code.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I'll post a couple patches to enable more coalescing of the flushes, but 
this particular patch is okay.

Paolo

> ---
>   arch/mips/include/asm/kvm_host.h |  1 +
>   arch/mips/kvm/mmu.c              | 97 ++++++--------------------------
>   2 files changed, 17 insertions(+), 81 deletions(-)
> 
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index feaa77036b67..374a3c8806e8 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -967,6 +967,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
>   						   bool write);
>   
>   #define KVM_ARCH_WANT_MMU_NOTIFIER
> +#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
>   
>   /* Emulation */
>   int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
> diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
> index 3dabeda82458..3dc885df2e32 100644
> --- a/arch/mips/kvm/mmu.c
> +++ b/arch/mips/kvm/mmu.c
> @@ -439,85 +439,36 @@ static int kvm_mips_mkold_gpa_pt(struct kvm *kvm, gfn_t start_gfn,
>   				  end_gfn << PAGE_SHIFT);
>   }
>   
> -static int handle_hva_to_gpa(struct kvm *kvm,
> -			     unsigned long start,
> -			     unsigned long end,
> -			     int (*handler)(struct kvm *kvm, gfn_t gfn,
> -					    gpa_t gfn_end,
> -					    struct kvm_memory_slot *memslot,
> -					    void *data),
> -			     void *data)
> +bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
> -	struct kvm_memslots *slots;
> -	struct kvm_memory_slot *memslot;
> -	int ret = 0;
> -
> -	slots = kvm_memslots(kvm);
> -
> -	/* we only care about the pages that the guest sees */
> -	kvm_for_each_memslot(memslot, slots) {
> -		unsigned long hva_start, hva_end;
> -		gfn_t gfn, gfn_end;
> -
> -		hva_start = max(start, memslot->userspace_addr);
> -		hva_end = min(end, memslot->userspace_addr +
> -					(memslot->npages << PAGE_SHIFT));
> -		if (hva_start >= hva_end)
> -			continue;
> -
> -		/*
> -		 * {gfn(page) | page intersects with [hva_start, hva_end)} =
> -		 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
> -		 */
> -		gfn = hva_to_gfn_memslot(hva_start, memslot);
> -		gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
> -
> -		ret |= handler(kvm, gfn, gfn_end, memslot, data);
> -	}
> -
> -	return ret;
> -}
> -
> -
> -static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
> -				 struct kvm_memory_slot *memslot, void *data)
> -{
> -	kvm_mips_flush_gpa_pt(kvm, gfn, gfn_end);
> -	return 1;
> -}
> -
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
> -			unsigned flags)
> -{
> -	handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL);
> +	kvm_mips_flush_gpa_pt(kvm, range->start, range->end);
>   
>   	kvm_mips_callbacks->flush_shadow_all(kvm);
>   	return 0;
>   }
>   
> -static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
> -				struct kvm_memory_slot *memslot, void *data)
> +static bool __kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
> -	gpa_t gpa = gfn << PAGE_SHIFT;
> -	pte_t hva_pte = *(pte_t *)data;
> +	gpa_t gpa = range->start << PAGE_SHIFT;
> +	pte_t hva_pte = range->pte;
>   	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
>   	pte_t old_pte;
>   
>   	if (!gpa_pte)
> -		return 0;
> +		return false;
>   
>   	/* Mapping may need adjusting depending on memslot flags */
>   	old_pte = *gpa_pte;
> -	if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
> +	if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
>   		hva_pte = pte_mkclean(hva_pte);
> -	else if (memslot->flags & KVM_MEM_READONLY)
> +	else if (range->slot->flags & KVM_MEM_READONLY)
>   		hva_pte = pte_wrprotect(hva_pte);
>   
>   	set_pte(gpa_pte, hva_pte);
>   
>   	/* Replacing an absent or old page doesn't need flushes */
>   	if (!pte_present(old_pte) || !pte_young(old_pte))
> -		return 0;
> +		return false;
>   
>   	/* Pages swapped, aged, moved, or cleaned require flushes */
>   	return !pte_present(hva_pte) ||
> @@ -526,27 +477,21 @@ static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
>   	       (pte_dirty(old_pte) && !pte_dirty(hva_pte));
>   }
>   
> -int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> +bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
> -	unsigned long end = hva + PAGE_SIZE;
> -	int ret;
> -
> -	ret = handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
> -	if (ret)
> +	if (__kvm_set_spte_gfn(kvm, range))
>   		kvm_mips_callbacks->flush_shadow_all(kvm);
> -	return 0;
> +	return false;
>   }
>   
> -static int kvm_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
> -			       struct kvm_memory_slot *memslot, void *data)
> +bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
> -	return kvm_mips_mkold_gpa_pt(kvm, gfn, gfn_end);
> +	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
>   }
>   
> -static int kvm_test_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
> -				    struct kvm_memory_slot *memslot, void *data)
> +bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
> -	gpa_t gpa = gfn << PAGE_SHIFT;
> +	gpa_t gpa = range->start << PAGE_SHIFT;
>   	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
>   
>   	if (!gpa_pte)
> @@ -554,16 +499,6 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
>   	return pte_young(*gpa_pte);
>   }
>   
> -int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
> -{
> -	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
> -}
> -
> -int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
> -{
> -	return handle_hva_to_gpa(kvm, hva, hva, kvm_test_age_hva_handler, NULL);
> -}
> -
>   /**
>    * _kvm_mips_map_page_fast() - Fast path GPA fault handler.
>    * @vcpu:		VCPU pointer.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-26  2:19 ` [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
@ 2021-03-31  7:52   ` Paolo Bonzini
  2021-03-31  8:35   ` Paolo Bonzini
  1 sibling, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31  7:52 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 26/03/21 03:19, Sean Christopherson wrote:
> Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
> notifications.  Because mmu_notifier_count must be modified while holding
> mmu_lock for write, and must always be paired across start->end to stay
> balanced, lock elision must happen in both or none.  To meet that
> requirement, add a rwsem to prevent memslot updates across range_start()
> and range_end().
> 
> For notifiers that disallow blocking, e.g. OOM reaping, simply go down
> the slow path of unconditionally acquiring mmu_lock.  The sane
> alternative would be to try to acquire the lock and force the notifier
> to retry on failure.  But since OOM is currently the _only_ scenario
> where blocking is disallowed attempting to optimize a guest that has been
> marked for death is pointless.
> 
> Note, technically flag-only memslot updates could be allowed in parallel,
> but stalling a memslot update for a relatively short amount of time is
> not a scalability issue, and this is all more than complex enough.
> 
> Based heavily on code from Ben Gardon.
> 
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Please submit this as a separate patch.

Paolo

> ---
>   include/linux/kvm_host.h |   8 +-
>   virt/kvm/kvm_main.c      | 174 ++++++++++++++++++++++++++++++---------
>   2 files changed, 142 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 40ac2d40bb5a..2cc0f87d936e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -523,6 +523,7 @@ struct kvm {
>   	long mmu_notifier_count;
>   	unsigned long mmu_notifier_range_start;
>   	unsigned long mmu_notifier_range_end;
> +	struct rw_semaphore mmu_notifier_slots_lock;
>   #endif
>   	long tlbs_dirty;
>   	struct list_head devices;
> @@ -660,8 +661,11 @@ static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
>   {
>   	as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
>   	return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
> -			lockdep_is_held(&kvm->slots_lock) ||
> -			!refcount_read(&kvm->users_count));
> +				      lockdep_is_held(&kvm->slots_lock) ||
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +				      lockdep_is_held(&kvm->mmu_notifier_slots_lock) ||
> +#endif
> +				      !refcount_read(&kvm->users_count));
>   }
>   
>   static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0c2aff8a4aa1..9ebc6d3e4a21 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -453,20 +453,56 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>   
>   typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>   
> +typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> +			     unsigned long end);
> +
>   struct kvm_hva_range {
>   	unsigned long start;
>   	unsigned long end;
>   	pte_t pte;
>   	hva_handler_t handler;
> -	bool caller_locked;
> +	on_lock_fn_t on_lock;
> +	bool must_lock;
>   	bool flush_on_ret;
>   	bool may_block;
>   };
>   
> +/*
> + * Use a dedicated stub instead of NULL to indicate that there is no callback
> + * function/handler.  The compiler technically can't guarantee that a real
> + * function will have a non-zero address, and so it will generate code to
> + * check for !NULL, whereas comparing against a stub will be elided at compile
> + * time (unless the compiler is getting long in the tooth, e.g. gcc 4.9).
> + */
> +static void kvm_null_fn(void)
> +{
> +
> +}
> +#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
> +
> +
> +/* Acquire mmu_lock if necessary.  Returns %true if @handler is "null" */
> +static __always_inline bool kvm_mmu_lock_and_check_handler(struct kvm *kvm,
> +							   const struct kvm_hva_range *range,
> +							   bool *locked)
> +{
> +	if (*locked)
> +		return false;
> +
> +	*locked = true;
> +
> +	KVM_MMU_LOCK(kvm);
> +
> +	if (!IS_KVM_NULL_FN(range->on_lock))
> +		range->on_lock(kvm, range->start, range->end);
> +
> +	return IS_KVM_NULL_FN(range->handler);
> +}
> +
>   static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>   						  const struct kvm_hva_range *range)
>   {
> -	bool ret = false, locked = range->caller_locked;
> +	bool ret = false, locked = false;
>   	struct kvm_gfn_range gfn_range;
>   	struct kvm_memory_slot *slot;
>   	struct kvm_memslots *slots;
> @@ -474,6 +510,10 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   
> +	if (range->must_lock &&
> +	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
> +		goto out_unlock;
> +
>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>   		slots = __kvm_memslots(kvm, i);
>   		kvm_for_each_memslot(slot, slots) {
> @@ -502,10 +542,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>   			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
>   			gfn_range.slot = slot;
>   
> -			if (!locked) {
> -				locked = true;
> -				KVM_MMU_LOCK(kvm);
> -			}
> +			if (kvm_mmu_lock_and_check_handler(kvm, range, &locked))
> +				goto out_unlock;
> +
>   			ret |= range->handler(kvm, &gfn_range);
>   		}
>   	}
> @@ -513,7 +552,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>   	if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
>   		kvm_flush_remote_tlbs(kvm);
>   
> -	if (locked && !range->caller_locked)
> +out_unlock:
> +	if (locked)
>   		KVM_MMU_UNLOCK(kvm);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> @@ -534,10 +574,12 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
>   		.end		= end,
>   		.pte		= pte,
>   		.handler	= handler,
> -		.caller_locked	= false,
> +		.on_lock	= (void *)kvm_null_fn,
> +		.must_lock	= false,
>   		.flush_on_ret	= true,
>   		.may_block	= false,
>   	};
> +
>   	return __kvm_handle_hva_range(kvm, &range);
>   }
>   
> @@ -552,7 +594,8 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>   		.end		= end,
>   		.pte		= __pte(0),
>   		.handler	= handler,
> -		.caller_locked	= false,
> +		.on_lock	= (void *)kvm_null_fn,
> +		.must_lock	= false,
>   		.flush_on_ret	= false,
>   		.may_block	= false,
>   	};
> @@ -569,23 +612,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>   }
>   
> -static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> -					const struct mmu_notifier_range *range)
> +static void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
> +				   unsigned long end)
>   {
> -	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -	const struct kvm_hva_range hva_range = {
> -		.start		= range->start,
> -		.end		= range->end,
> -		.pte		= __pte(0),
> -		.handler	= kvm_unmap_gfn_range,
> -		.caller_locked	= true,
> -		.flush_on_ret	= true,
> -		.may_block	= mmu_notifier_range_blockable(range),
> -	};
> -
> -	trace_kvm_unmap_hva_range(range->start, range->end);
> -
> -	KVM_MMU_LOCK(kvm);
>   	/*
>   	 * The count increase must become visible at unlock time as no
>   	 * spte can be established without taking the mmu_lock and
> @@ -593,8 +622,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   	 */
>   	kvm->mmu_notifier_count++;
>   	if (likely(kvm->mmu_notifier_count == 1)) {
> -		kvm->mmu_notifier_range_start = range->start;
> -		kvm->mmu_notifier_range_end = range->end;
> +		kvm->mmu_notifier_range_start = start;
> +		kvm->mmu_notifier_range_end = end;
>   	} else {
>   		/*
>   		 * Fully tracking multiple concurrent ranges has dimishing
> @@ -606,24 +635,54 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   		 * complete.
>   		 */
>   		kvm->mmu_notifier_range_start =
> -			min(kvm->mmu_notifier_range_start, range->start);
> +			min(kvm->mmu_notifier_range_start, start);
>   		kvm->mmu_notifier_range_end =
> -			max(kvm->mmu_notifier_range_end, range->end);
> +			max(kvm->mmu_notifier_range_end, end);
>   	}
> -
> -	__kvm_handle_hva_range(kvm, &hva_range);
> -
> -	KVM_MMU_UNLOCK(kvm);
> -
> -	return 0;
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   					const struct mmu_notifier_range *range)
>   {
> +	bool blockable = mmu_notifier_range_blockable(range);
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	const struct kvm_hva_range hva_range = {
> +		.start		= range->start,
> +		.end		= range->end,
> +		.pte		= __pte(0),
> +		.handler	= kvm_unmap_gfn_range,
> +		.on_lock	= kvm_inc_notifier_count,
> +		.must_lock	= !blockable,
> +		.flush_on_ret	= true,
> +		.may_block	= blockable,
> +	};
>   
> -	KVM_MMU_LOCK(kvm);
> +	trace_kvm_unmap_hva_range(range->start, range->end);
> +
> +	/*
> +	 * Prevent memslot modification between range_start() and range_end()
> +	 * so that conditionally locking provides the same result in both
> +	 * functions.  Without that guarantee, the mmu_notifier_count
> +	 * adjustments will be imbalanced.
> +	 *
> +	 * Skip the memslot-lookup lock elision (set @must_lock above) to avoid
> +	 * having to take the semaphore on non-blockable calls, e.g. OOM kill.
> +	 * The complexity required to handle conditional locking for this case
> +	 * is not worth the marginal benefits, the VM is likely doomed anyways.
> +	 *
> +	 * Pairs with the unlock in range_end().
> +	 */
> +	if (blockable)
> +		down_read(&kvm->mmu_notifier_slots_lock);
> +
> +	__kvm_handle_hva_range(kvm, &hva_range);
> +
> +	return 0;
> +}
> +
> +static void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
> +				   unsigned long end)
> +{
>   	/*
>   	 * This sequence increase will notify the kvm page fault that
>   	 * the page that is going to be mapped in the spte could have
> @@ -637,7 +696,29 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>   	 * in conjunction with the smp_rmb in mmu_notifier_retry().
>   	 */
>   	kvm->mmu_notifier_count--;
> -	KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> +					const struct mmu_notifier_range *range)
> +{
> +	bool blockable = mmu_notifier_range_blockable(range);
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	const struct kvm_hva_range hva_range = {
> +		.start		= range->start,
> +		.end		= range->end,
> +		.pte		= __pte(0),
> +		.handler	= (void *)kvm_null_fn,
> +		.on_lock	= kvm_dec_notifier_count,
> +		.must_lock	= !blockable,
> +		.flush_on_ret	= true,
> +		.may_block	= blockable,
> +	};
> +
> +	__kvm_handle_hva_range(kvm, &hva_range);
> +
> +	/* Pairs with the lock in range_start(). */
> +	if (blockable)
> +		up_read(&kvm->mmu_notifier_slots_lock);
>   
>   	BUG_ON(kvm->mmu_notifier_count < 0);
>   }
> @@ -709,6 +790,8 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>   
>   static int kvm_init_mmu_notifier(struct kvm *kvm)
>   {
> +	init_rwsem(&kvm->mmu_notifier_slots_lock);
> +
>   	kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
>   	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
>   }
> @@ -971,6 +1054,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   	kvm_coalesced_mmio_free(kvm);
>   #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>   	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
> +	/*
> +	 * Reset the lock used to prevent memslot updates between MMU notifier
> +	 * range_start and range_end.  At this point no more MMU notifiers will
> +	 * run, but the lock could still be held if KVM's notifier was removed
> +	 * between range_start and range_end.  No threads can be waiting on the
> +	 * lock as the last reference on KVM has been dropped.  If the lock is
> +	 * still held, freeing memslots will deadlock.
> +	 */
> +	init_rwsem(&kvm->mmu_notifier_slots_lock);
>   #else
>   	kvm_arch_flush_shadow_all(kvm);
>   #endif
> @@ -1222,7 +1314,13 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
>   	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
>   	slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
>   
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +	down_write(&kvm->mmu_notifier_slots_lock);
> +#endif
>   	rcu_assign_pointer(kvm->memslots[as_id], slots);
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +	up_write(&kvm->mmu_notifier_slots_lock);
> +#endif
>   	synchronize_srcu_expedited(&kvm->srcu);
>   
>   	/*
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers to generic code
  2021-03-26  2:19 ` [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers " Sean Christopherson
@ 2021-03-31  7:52   ` Paolo Bonzini
  2021-03-31 16:20     ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31  7:52 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 26/03/21 03:19, Sean Christopherson wrote:
> +#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
> +	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> +#else
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int idx;
>  	trace_kvm_set_spte_hva(address);
>  
> 	idx = srcu_read_lock(&kvm->srcu);
> 
> 	KVM_MMU_LOCK(kvm);
> 
> 	kvm->mmu_notifier_seq++;
> 
> 	if (kvm_set_spte_hva(kvm, address, pte))
> 		kvm_flush_remote_tlbs(kvm);
> 
>   	KVM_MMU_UNLOCK(kvm);
>   	srcu_read_unlock(&kvm->srcu, idx);
> +#endif

The kvm->mmu_notifier_seq is missing in the new API side.  I guess you 
can add an argument to __kvm_handle_hva_range and handle it also in 
patch 15 ("KVM: Take mmu_lock when handling MMU notifier iff the hva 
hits a memslot").

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers
  2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
                   ` (18 preceding siblings ...)
  2021-03-30 18:32 ` [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Ben Gardon
@ 2021-03-31  7:57 ` Paolo Bonzini
  2021-03-31  9:34   ` Marc Zyngier
  19 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31  7:57 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 26/03/21 03:19, Sean Christopherson wrote:
> The end goal of this series is to optimize the MMU notifiers to take
> mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
> range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
> sensitive to mmu_lock being taken for write at inopportune times, and
> such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
> page shenanigans.  The vast majority of notifications for these VMs will
> be spurious (for KVM), and eliding mmu_lock for spurious notifications
> avoids an otherwise unacceptable disruption to the guest.
> 
> To get there without potentially degrading performance, e.g. due to
> multiple memslot lookups, especially on non-x86 where the use cases are
> largely unknown (from my perspective), first consolidate the MMU notifier
> logic by moving the hva->gfn lookups into common KVM.
> 
> Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
> with the TDP MMU changes in this series.  That code applies on kvm/queue
> (commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
> shadow paging").
> 
> Speaking of conflicts, Ben will soon be posting a series to convert a
> bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
> will be an absurd number of conflicts; Ben and I will sort out the
> conflicts in whichever series loses the race.
> 
> Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
> PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
> I give it even odds that I introduced an off-by-one bug somewhere.
> 
> [*] https://lkml.kernel.org/r/20210325200119.1359384-1-seanjc@google.com
> 
> 
> Patches 1-7 are x86 specific prep patches to play nice with moving
> the hva->gfn memslot lookups into common code.  There ended up being waaay
> more of these than I expected/wanted, but I had a hell of a time getting
> the flushing logic right when shuffling the memslot and address space
> loops.  In the end, I was more confident I got things correct by batching
> the flushes.
> 
> Patch 8 moves the existing API prototypes into common code.  It could
> technically be dropped since the old APIs are gone in the end, but I
> thought the switch to the new APIs would suck a bit less this way.
> 
> Patch 9 moves arm64's MMU notifier tracepoints into common code so that
> they are not lost when arm64 is converted to the new APIs, and so that all
> architectures can benefit.
> 
> Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
> because I could actually test it.  All architectures use nearly identical
> code, so I don't think it actually matters in the end.
> 
> Patches 11-13 move arm64, MIPS, and PPC to the new APIs.
> 
> Patch 14 yanks out the old APIs.
> 
> Patch 15 adds the mmu_lock elision, but only for unpaired notifications.
> 
> Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
> This is quite nasty and no small part of me thinks the patch should be
> burned with fire (I won't spoil it any further), but it's also the most
> problematic scenario for our particular use case.  :-/
> 
> Patches 17-18 are additional x86 cleanups.

Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that 
prevented me from committing 10-15, but they mostly look good.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-26  2:19 ` [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
  2021-03-31  7:52   ` Paolo Bonzini
@ 2021-03-31  8:35   ` Paolo Bonzini
  2021-03-31 16:41     ` Sean Christopherson
                       ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31  8:35 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras
  Cc: James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 26/03/21 03:19, Sean Christopherson wrote:
> +	/*
> +	 * Reset the lock used to prevent memslot updates between MMU notifier
> +	 * range_start and range_end.  At this point no more MMU notifiers will
> +	 * run, but the lock could still be held if KVM's notifier was removed
> +	 * between range_start and range_end.  No threads can be waiting on the
> +	 * lock as the last reference on KVM has been dropped.  If the lock is
> +	 * still held, freeing memslots will deadlock.
> +	 */
> +	init_rwsem(&kvm->mmu_notifier_slots_lock);

I was going to say that this is nasty, then I noticed that 
mmu_notifier_unregister uses SRCU to ensure completion of concurrent 
calls to the MMU notifier.  So I guess it's fine, but it's better to 
point it out:

	/*
	 * At this point no more MMU notifiers will run and pending
	 * calls to range_start have completed, but the lock would
	 * still be held and never released if the MMU notifier was
	 * removed between range_start and range_end.  Since the last
	 * reference to the struct kvm has been dropped, no threads can
	 * be waiting on the lock, but we might still end up taking it
	 * when freeing memslots in kvm_arch_destroy_vm.  Reset the lock
	 * to avoid deadlocks.
	 */

That said, the easiest way to avoid this would be to always update 
mmu_notifier_count.  I don't mind the rwsem, but at least I suggest that 
you split the patch in two---the first one keeping the 
mmu_notifier_count update unconditional, and the second one introducing 
the rwsem and the on_lock function kvm_inc_notifier_count.  Please 
document the new lock in Documentation/virt/kvm/locking.rst too.

Also, related to the first part of the series, perhaps you could 
structure the series in a slightly different way:

1) introduce the HVA walking API in common code, complete with on_lock 
and patch 15, so that you can use on_lock to increase mmu_notifier_seq

2) then migrate all architectures including x86 to the new API

IOW, first half of patch 10 and all of patch 15; then the second half of 
patch 10; then patches 11-14.

> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +	down_write(&kvm->mmu_notifier_slots_lock);
> +#endif
>  	rcu_assign_pointer(kvm->memslots[as_id], slots);
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> +	up_write(&kvm->mmu_notifier_slots_lock);
> +#endif

Please do this unconditionally, the cost is minimal if the rwsem is not 
contended (as is the case if the architecture doesn't use MMU notifiers 
at all).

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers
  2021-03-31  7:57 ` Paolo Bonzini
@ 2021-03-31  9:34   ` Marc Zyngier
  2021-03-31  9:41     ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Marc Zyngier @ 2021-03-31  9:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, James Morse, Julien Thierry, Suzuki K Poulose,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc, linux-kernel,
	Ben Gardon

On 2021-03-31 08:57, Paolo Bonzini wrote:

> Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that
> prevented me from committing 10-15, but they mostly look good.

Can you please push the resulting merge somewhere?

I'm concerned that it will conflict in interesting way with other stuff
that is on its way on the arm64 side, not to mentiobn that this hasn't
been tested at all on anything but x86 (and given the series was posted
on Friday, that's a bit of a short notice).

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers
  2021-03-31  9:34   ` Marc Zyngier
@ 2021-03-31  9:41     ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31  9:41 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Sean Christopherson, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, James Morse, Julien Thierry, Suzuki K Poulose,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	linux-arm-kernel, kvmarm, linux-mips, kvm, kvm-ppc, linux-kernel,
	Ben Gardon

On 31/03/21 11:34, Marc Zyngier wrote:
> 
>> Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that
>> prevented me from committing 10-15, but they mostly look good.
> 
> Can you please push the resulting merge somewhere?
> 
> I'm concerned that it will conflict in interesting way with other stuff
> that is on its way on the arm64 side, not to mentiobn that this hasn't
> been tested at all on anything but x86 (and given the series was posted
> on Friday, that's a bit of a short notice).

Yes, I will push it shortly to kvm/queue.  Note that the patches I have 
pushed are x86 only apart from changes to tracepoints.  The rest will 
certainly need a lot more baking, which is also why I got rid quickly of 
the easy ones.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers to generic code
  2021-03-31  7:52   ` Paolo Bonzini
@ 2021-03-31 16:20     ` Sean Christopherson
  2021-03-31 16:36       ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 16:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> On 26/03/21 03:19, Sean Christopherson wrote:
> > +#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
> > +	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> > +#else
> >   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >   	int idx;
> >  	trace_kvm_set_spte_hva(address);
> > 	idx = srcu_read_lock(&kvm->srcu);
> > 
> > 	KVM_MMU_LOCK(kvm);
> > 
> > 	kvm->mmu_notifier_seq++;
> > 
> > 	if (kvm_set_spte_hva(kvm, address, pte))
> > 		kvm_flush_remote_tlbs(kvm);
> > 
> >   	KVM_MMU_UNLOCK(kvm);
> >   	srcu_read_unlock(&kvm->srcu, idx);
> > +#endif
> 
> The kvm->mmu_notifier_seq is missing in the new API side.  I guess you can
> add an argument to __kvm_handle_hva_range and handle it also in patch 15
> ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
> memslot").

Yikes.  Superb eyes!

That does bring up an oddity I discovered when digging into this.  Every call
to .change_pte() is bookended by .invalidate_range_{start,end}(), i.e. the above
missing kvm->mmu_notifier_seq++ is benign because kvm->mmu_notifier_count is
guaranteed to be non-zero.

I'm also fairly certain it means kvm_set_spte_gfn() is effectively dead code on
_all_ architectures.  x86 and MIPS are clearcut nops if the old SPTE is
not-present, and that's guaranteed due to the prior invalidation.  PPC simply
unmaps the SPTE, which again should be a nop due to the invalidation.  arm64 is
a bit murky, but if I'm reading the code correctly, it's also a nop because
kvm_pgtable_stage2_map() is called without a cache pointer, which I think means
it will map an entry if and only if an existing PTE was found.

I haven't actually tested the above analysis, e.g. by asserting that
kvm->mmu_notifier_count is indeed non-zero.  I'll do that sooner than later.
But, given the shortlog of commit:

  6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start
                 and invalidate_range_end")

I'm fairly confident my analysis is correct.  And if so, it also means that the
whole point of adding .change_pte() in the first place (for KSM, commit
828502d30073, "ksm: add mmu_notifier set_pte_at_notify()"), has since been lost.

When it was originally added, .change_pte() was a pure alternative to
invalidating the entry.

  void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
                               pte_t pte)
  {
        struct mmu_notifier *mn;
        struct hlist_node *n;

        rcu_read_lock();
        hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
                if (mn->ops->change_pte)
                        mn->ops->change_pte(mn, mm, address, pte);
                /*
                 * Some drivers don't have change_pte,
                 * so we must call invalidate_page in that case.
                 */
                else if (mn->ops->invalidate_page)
                        mn->ops->invalidate_page(mn, mm, address);
        }
        rcu_read_unlock();
  }

The aforementioned commit 6bdb913f0a70 wrapped set_pte_at_notify() with
invalidate_range_{start,end}() so that .invalidate_page() implementations could
sleep.  But, no one noticed that in doing so, .change_pte() was completely
neutered.

Assuming all of the above is correct, I'm very tempted to rip out .change_pte()
entirely.  It's been dead weight for 8+ years and no one has complained about
KSM+KVM performance (I'd also be curious to know how much performance was gained
by shaving VM-Exits).  As KVM is the only user of .change_pte(), dropping it in
KVM would mean the entire MMU notifier could also go away.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers to generic code
  2021-03-31 16:20     ` Sean Christopherson
@ 2021-03-31 16:36       ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31 16:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 31/03/21 18:20, Sean Christopherson wrote:
> Every call
> to .change_pte() is bookended by .invalidate_range_{start,end}(), i.e. the above
> missing kvm->mmu_notifier_seq++ is benign because kvm->mmu_notifier_count is
> guaranteed to be non-zero.

In fact ARM even relies on invalidate wrapping the change_pte handler.

         /*
          * The MMU notifiers will have unmapped a huge PMD before calling
          * ->change_pte() (which in turn calls kvm_set_spte_hva()) and
          * therefore we never need to clear out a huge PMD through this
          * calling path and a memcache is not required.
          */

> Assuming all of the above is correct, I'm very tempted to rip out .change_pte()
> entirely.

There is still the performance benefit from immediately remapping the 
page to the new destination without waiting for a fault.  Yes it's 
hypothetical but I would prefer to leave that change for later.

The fact that the count is nonzero means that you will not even have to 
complicate kvm_mmu_notifier_change_pte to handle the removal of 
mmu_notifier_seq; just add a patch before this one to WARN if it is 
zero.  (The rest of my review to patch 16 still holds).

Paolo

> It's been dead weight for 8+ years and no one has complained about
> KSM+KVM performance (I'd also be curious to know how much performance was gained
> by shaving VM-Exits).  As KVM is the only user of .change_pte(), dropping it in
> KVM would mean the entire MMU notifier could also go away.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31  8:35   ` Paolo Bonzini
@ 2021-03-31 16:41     ` Sean Christopherson
  2021-03-31 16:47       ` Paolo Bonzini
  2021-03-31 20:15     ` Sean Christopherson
  2021-03-31 20:52     ` Sean Christopherson
  2 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 16:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> On 26/03/21 03:19, Sean Christopherson wrote:
> > +	/*
> > +	 * Reset the lock used to prevent memslot updates between MMU notifier
> > +	 * range_start and range_end.  At this point no more MMU notifiers will
> > +	 * run, but the lock could still be held if KVM's notifier was removed
> > +	 * between range_start and range_end.  No threads can be waiting on the
> > +	 * lock as the last reference on KVM has been dropped.  If the lock is
> > +	 * still held, freeing memslots will deadlock.
> > +	 */
> > +	init_rwsem(&kvm->mmu_notifier_slots_lock);
> 
> I was going to say that this is nasty,

Heh, I still think it's nasty.

> then I noticed that
> mmu_notifier_unregister uses SRCU to ensure completion of concurrent calls
> to the MMU notifier.  So I guess it's fine, but it's better to point it out:
> 
> 	/*
> 	 * At this point no more MMU notifiers will run and pending
> 	 * calls to range_start have completed, but the lock would
> 	 * still be held and never released if the MMU notifier was
> 	 * removed between range_start and range_end.  Since the last
> 	 * reference to the struct kvm has been dropped, no threads can
> 	 * be waiting on the lock, but we might still end up taking it
> 	 * when freeing memslots in kvm_arch_destroy_vm.  Reset the lock
> 	 * to avoid deadlocks.
> 	 */
> 
> That said, the easiest way to avoid this would be to always update
> mmu_notifier_count.

Updating mmu_notifier_count requires taking mmu_lock, which would defeat the
purpose of these shenanigans.  I think it could be made atomic, since mmu_lock
would be taken before the elevated count _must_ be visible, but that would
break the mmu_notifier_range_{start,end} optimization that was recently added.

Or did I completely misunderstand what you're suggesting?

> I don't mind the rwsem, but at least I suggest that you
> split the patch in two---the first one keeping the mmu_notifier_count update
> unconditional, and the second one introducing the rwsem and the on_lock
> function kvm_inc_notifier_count.  Please document the new lock in
> Documentation/virt/kvm/locking.rst too.

Note, will update docs.

> Also, related to the first part of the series, perhaps you could structure
> the series in a slightly different way:
> 
> 1) introduce the HVA walking API in common code, complete with on_lock and
> patch 15, so that you can use on_lock to increase mmu_notifier_seq
> 
> 2) then migrate all architectures including x86 to the new API
> 
> IOW, first half of patch 10 and all of patch 15; then the second half of
> patch 10; then patches 11-14.
> 
> > +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > +	down_write(&kvm->mmu_notifier_slots_lock);
> > +#endif
> >  	rcu_assign_pointer(kvm->memslots[as_id], slots);
> > +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > +	up_write(&kvm->mmu_notifier_slots_lock);
> > +#endif
> 
> Please do this unconditionally, the cost is minimal if the rwsem is not
> contended (as is the case if the architecture doesn't use MMU notifiers at
> all).

It's not the cost, it's that mmu_notifier_slots_lock doesn't exist.  That's an
easily solved problem, but then the lock wouldn't be initialized since
kvm_init_mmu_notifier() is a nop.  That's again easy to solve, but IMO would
look rather weird.  I guess the counter argument is that __kvm_memslots()
wouldn't need #ifdeffery.

These are the to ideas I've come up with:

Option 1:
	static int kvm_init_mmu_notifier(struct kvm *kvm)
	{
		init_rwsem(&kvm->mmu_notifier_slots_lock);

	#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
		kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
		return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
	#else
		return 0;
	#endif
	}


Option 2:
	kvm_mmu_notifier_lock(kvm);
	rcu_assign_pointer(kvm->memslots[as_id], slots);
	kvm_mmu_notifier_unlock(kvm);





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 16:41     ` Sean Christopherson
@ 2021-03-31 16:47       ` Paolo Bonzini
  2021-03-31 19:47         ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31 16:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 31/03/21 18:41, Sean Christopherson wrote:
>> That said, the easiest way to avoid this would be to always update
>> mmu_notifier_count.
> Updating mmu_notifier_count requires taking mmu_lock, which would defeat the
> purpose of these shenanigans.

Okay; I wasn't sure if the problem was contention with page faults in 
general, or just the long critical sections from the MMU notifier 
callbacks.  Still updating mmu_notifier_count unconditionally is a good 
way to break up the patch in two and keep one commit just for the rwsem 
nastiness.

>>> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>>> +	down_write(&kvm->mmu_notifier_slots_lock);
>>> +#endif
>>>   	rcu_assign_pointer(kvm->memslots[as_id], slots);
>>> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>>> +	up_write(&kvm->mmu_notifier_slots_lock);
>>> +#endif
>> Please do this unconditionally, the cost is minimal if the rwsem is not
>> contended (as is the case if the architecture doesn't use MMU notifiers at
>> all).
> It's not the cost, it's that mmu_notifier_slots_lock doesn't exist.  That's an
> easily solved problem, but then the lock wouldn't be initialized since
> kvm_init_mmu_notifier() is a nop.  That's again easy to solve, but IMO would
> look rather weird.  I guess the counter argument is that __kvm_memslots()
> wouldn't need #ifdeffery.

Yep.  Less #ifdefs usually wins. :)

> These are the to ideas I've come up with:
> 
> Option 1:
> 	static int kvm_init_mmu_notifier(struct kvm *kvm)
> 	{
> 		init_rwsem(&kvm->mmu_notifier_slots_lock);
> 
> 	#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> 		kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
> 		return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
> 	#else
> 		return 0;
> 	#endif
> 	}

Option 2 is also okay I guess, but the simplest is option 1 + just init 
it in kvm_create_vm.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 16:47       ` Paolo Bonzini
@ 2021-03-31 19:47         ` Sean Christopherson
  2021-03-31 20:42           ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 19:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> On 31/03/21 18:41, Sean Christopherson wrote:
> > > That said, the easiest way to avoid this would be to always update
> > > mmu_notifier_count.
> > Updating mmu_notifier_count requires taking mmu_lock, which would defeat the
> > purpose of these shenanigans.
> 
> Okay; I wasn't sure if the problem was contention with page faults in
> general, or just the long critical sections from the MMU notifier callbacks.
> Still updating mmu_notifier_count unconditionally is a good way to break up
> the patch in two and keep one commit just for the rwsem nastiness.

Rereading things, a small chunk of the rwsem nastiness can go away.  I don't see
any reason to use rw_semaphore instead of rwlock_t.  install_new_memslots() only
holds the lock for a handful of instructions.  Readers could get queued up
behind a writer, but since install_new_memslots() is serialized by slots_lock
(the existing mutex), there is no chance of multiple writers, i.e. the worst
case wait duration is bounded at the length of an in-flight notification.  And
that's _already_ the worst case since notifications are currently serialized by
mmu_lock.  In practice, the existing worst case is probably far worse since
there can be far more writers trying to acquire mmu_lock.

In other words, there's no strong argument for sleeping instead of busy waiting
in the notifiers.

By switching to rwlock_t, taking mmu_notifier_slots_lock doesn't have to depend
on mmu_notifier_range_blockable(), and the must_lock path also goes away.

> > > > +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > +	down_write(&kvm->mmu_notifier_slots_lock);
> > > > +#endif
> > > >   	rcu_assign_pointer(kvm->memslots[as_id], slots);
> > > > +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > +	up_write(&kvm->mmu_notifier_slots_lock);
> > > > +#endif
> > > Please do this unconditionally, the cost is minimal if the rwsem is not
> > > contended (as is the case if the architecture doesn't use MMU notifiers at
> > > all).
> > It's not the cost, it's that mmu_notifier_slots_lock doesn't exist.  That's an
> > easily solved problem, but then the lock wouldn't be initialized since
> > kvm_init_mmu_notifier() is a nop.  That's again easy to solve, but IMO would
> > look rather weird.  I guess the counter argument is that __kvm_memslots()
> > wouldn't need #ifdeffery.
> 
> Yep.  Less #ifdefs usually wins. :)
> 
> > These are the to ideas I've come up with:
> > 
> > Option 1:
> > 	static int kvm_init_mmu_notifier(struct kvm *kvm)
> > 	{
> > 		init_rwsem(&kvm->mmu_notifier_slots_lock);
> > 
> > 	#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > 		kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
> > 		return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
> > 	#else
> > 		return 0;
> > 	#endif
> > 	}
> 
> Option 2 is also okay I guess, but the simplest is option 1 + just init it
> in kvm_create_vm.

Arr.  I'll play around with it to try and purge the #ifdefs.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31  8:35   ` Paolo Bonzini
  2021-03-31 16:41     ` Sean Christopherson
@ 2021-03-31 20:15     ` Sean Christopherson
  2021-03-31 20:30       ` Paolo Bonzini
  2021-03-31 20:52     ` Sean Christopherson
  2 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 20:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> On 26/03/21 03:19, Sean Christopherson wrote:
> > +	/*
> > +	 * Reset the lock used to prevent memslot updates between MMU notifier
> > +	 * range_start and range_end.  At this point no more MMU notifiers will
> > +	 * run, but the lock could still be held if KVM's notifier was removed
> > +	 * between range_start and range_end.  No threads can be waiting on the
> > +	 * lock as the last reference on KVM has been dropped.  If the lock is
> > +	 * still held, freeing memslots will deadlock.
> > +	 */
> > +	init_rwsem(&kvm->mmu_notifier_slots_lock);
> 
> I was going to say that this is nasty, then I noticed that
> mmu_notifier_unregister uses SRCU to ensure completion of concurrent calls
> to the MMU notifier.  So I guess it's fine, but it's better to point it out:
> 
> 	/*
> 	 * At this point no more MMU notifiers will run and pending
> 	 * calls to range_start have completed, but the lock would
> 	 * still be held and never released if the MMU notifier was
> 	 * removed between range_start and range_end.  Since the last
> 	 * reference to the struct kvm has been dropped, no threads can
> 	 * be waiting on the lock, but we might still end up taking it
> 	 * when freeing memslots in kvm_arch_destroy_vm.  Reset the lock
> 	 * to avoid deadlocks.
> 	 */

An alternative would be to not take the lock in install_new_memslots() if
kvm->users_count == 0.  It'd be weirder to document, and the conditional locking
would still be quite ugly.  Not sure if that's better than blasting a lock
during destruction?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 20:15     ` Sean Christopherson
@ 2021-03-31 20:30       ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31 20:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 31/03/21 22:15, Sean Christopherson wrote:
> On Wed, Mar 31, 2021, Paolo Bonzini wrote:
>> On 26/03/21 03:19, Sean Christopherson wrote:
>>> +	/*
>>> +	 * Reset the lock used to prevent memslot updates between MMU notifier
>>> +	 * range_start and range_end.  At this point no more MMU notifiers will
>>> +	 * run, but the lock could still be held if KVM's notifier was removed
>>> +	 * between range_start and range_end.  No threads can be waiting on the
>>> +	 * lock as the last reference on KVM has been dropped.  If the lock is
>>> +	 * still held, freeing memslots will deadlock.
>>> +	 */
>>> +	init_rwsem(&kvm->mmu_notifier_slots_lock);
>>
>> I was going to say that this is nasty, then I noticed that
>> mmu_notifier_unregister uses SRCU to ensure completion of concurrent calls
>> to the MMU notifier.  So I guess it's fine, but it's better to point it out:
>>
>> 	/*
>> 	 * At this point no more MMU notifiers will run and pending
>> 	 * calls to range_start have completed, but the lock would
>> 	 * still be held and never released if the MMU notifier was
>> 	 * removed between range_start and range_end.  Since the last
>> 	 * reference to the struct kvm has been dropped, no threads can
>> 	 * be waiting on the lock, but we might still end up taking it
>> 	 * when freeing memslots in kvm_arch_destroy_vm.  Reset the lock
>> 	 * to avoid deadlocks.
>> 	 */
> 
> An alternative would be to not take the lock in install_new_memslots() if
> kvm->users_count == 0.  It'd be weirder to document, and the conditional locking
> would still be quite ugly.  Not sure if that's better than blasting a lock
> during destruction?

No, that's worse...

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 19:47         ` Sean Christopherson
@ 2021-03-31 20:42           ` Paolo Bonzini
  2021-03-31 21:05             ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31 20:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 31/03/21 21:47, Sean Christopherson wrote:
> Rereading things, a small chunk of the rwsem nastiness can go away.  I don't see
> any reason to use rw_semaphore instead of rwlock_t.

Wouldn't it be incorrect to lock a mutex (e.g. inside *another* MMU 
notifier's invalidate callback) while holding an rwlock_t?  That makes 
sense because anybody that's busy waiting in write_lock potentially 
cannot be preempted until the other task gets the mutex.  This is a 
potential deadlock.

I also thought of busy waiting on down_read_trylock if the MMU notifier 
cannot block, but that would also be invalid for the opposite reason 
(the down_write task might be asleep, waiting for other readers to 
release the task, and the down_read_trylock busy loop might not let that 
task run).

> And that's _already_ the worst case since notifications are currently
> serialized by mmu_lock.

But right now notifications are not a single critical section, they're 
two, aren't they?

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31  8:35   ` Paolo Bonzini
  2021-03-31 16:41     ` Sean Christopherson
  2021-03-31 20:15     ` Sean Christopherson
@ 2021-03-31 20:52     ` Sean Christopherson
  2021-03-31 21:00       ` Paolo Bonzini
  2 siblings, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 20:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> On 26/03/21 03:19, Sean Christopherson wrote:
> Also, related to the first part of the series, perhaps you could structure
> the series in a slightly different way:
> 
> 1) introduce the HVA walking API in common code, complete with on_lock and
> patch 15, so that you can use on_lock to increase mmu_notifier_seq
> 
> 2) then migrate all architectures including x86 to the new API
> 
> IOW, first half of patch 10 and all of patch 15; then the second half of
> patch 10; then patches 11-14.

100% agree with introducing on_lock separately from the conditional locking.

Not so sure about introducing conditional locking and then converting non-x86
archs.  I'd prefer to keep the conditional locking after arch conversion.
If something does go awry, it would be nice to be able to preciesly bisect to
the conditional locking.  Ditto if it needs to be reverted because it breaks an
arch.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 20:52     ` Sean Christopherson
@ 2021-03-31 21:00       ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31 21:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 31/03/21 22:52, Sean Christopherson wrote:
> 100% agree with introducing on_lock separately from the conditional locking.
> 
> Not so sure about introducing conditional locking and then converting non-x86
> archs.  I'd prefer to keep the conditional locking after arch conversion.
> If something does go awry, it would be nice to be able to preciesly bisect to
> the conditional locking.  Ditto if it needs to be reverted because it breaks an
> arch.

Ok, that sounds good too.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 20:42           ` Paolo Bonzini
@ 2021-03-31 21:05             ` Sean Christopherson
  2021-03-31 21:22               ` Sean Christopherson
  2021-03-31 21:35               ` Paolo Bonzini
  0 siblings, 2 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> On 31/03/21 21:47, Sean Christopherson wrote:
> > Rereading things, a small chunk of the rwsem nastiness can go away.  I don't see
> > any reason to use rw_semaphore instead of rwlock_t.
> 
> Wouldn't it be incorrect to lock a mutex (e.g. inside *another* MMU
> notifier's invalidate callback) while holding an rwlock_t?  That makes sense
> because anybody that's busy waiting in write_lock potentially cannot be
> preempted until the other task gets the mutex.  This is a potential
> deadlock.

Yes?  I don't think I follow your point though.  Nesting a spinlock or rwlock
inside a rwlock is ok, so long as the locks are always taken in the same order,
i.e. it's never mmu_lock -> mmu_notifier_slots_lock.

> I also thought of busy waiting on down_read_trylock if the MMU notifier
> cannot block, but that would also be invalid for the opposite reason (the
> down_write task might be asleep, waiting for other readers to release the
> task, and the down_read_trylock busy loop might not let that task run).
> 
> > And that's _already_ the worst case since notifications are currently
> > serialized by mmu_lock.
> 
> But right now notifications are not a single critical section, they're two,
> aren't they?

Ah, crud, yes.  Holding a spinlock across the entire start() ... end() would be
bad, especially when the notifier can block since that opens up the possibility
of the task sleeping/blocking/yielding while the spinlock is held.  Bummer.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 21:05             ` Sean Christopherson
@ 2021-03-31 21:22               ` Sean Christopherson
  2021-03-31 21:36                 ` Paolo Bonzini
  2021-03-31 21:35               ` Paolo Bonzini
  1 sibling, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 21:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Sean Christopherson wrote:
> On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> > On 31/03/21 21:47, Sean Christopherson wrote:
> > I also thought of busy waiting on down_read_trylock if the MMU notifier
> > cannot block, but that would also be invalid for the opposite reason (the
> > down_write task might be asleep, waiting for other readers to release the
> > task, and the down_read_trylock busy loop might not let that task run).
> > 
> > > And that's _already_ the worst case since notifications are currently
> > > serialized by mmu_lock.
> > 
> > But right now notifications are not a single critical section, they're two,
> > aren't they?
> 
> Ah, crud, yes.  Holding a spinlock across the entire start() ... end() would be
> bad, especially when the notifier can block since that opens up the possibility
> of the task sleeping/blocking/yielding while the spinlock is held.  Bummer.

On a related topic, any preference on whether to have an explicit "must_lock"
flag (what I posted), or derive the logic based on other params?

The helper I posted does:

	if (range->must_lock &&
	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
		goto out_unlock;

but it could be:

	if (!IS_KVM_NULL_FN(range->on_lock) && !range->may_block &&
	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
		goto out_unlock;

The generated code should be nearly identical on a modern compiler, so it's
purely a question of aesthetics.  I slightly prefer the explicit "must_lock" to
avoid spreading out the logic too much, but it also feels a bit superfluous.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 21:05             ` Sean Christopherson
  2021-03-31 21:22               ` Sean Christopherson
@ 2021-03-31 21:35               ` Paolo Bonzini
  2021-03-31 21:47                 ` Sean Christopherson
  1 sibling, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31 21:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 31/03/21 23:05, Sean Christopherson wrote:
>> Wouldn't it be incorrect to lock a mutex (e.g. inside*another*  MMU
>> notifier's invalidate callback) while holding an rwlock_t?  That makes sense
>> because anybody that's busy waiting in write_lock potentially cannot be
>> preempted until the other task gets the mutex.  This is a potential
>> deadlock.
>
> Yes?  I don't think I follow your point though.  Nesting a spinlock or rwlock
> inside a rwlock is ok, so long as the locks are always taken in the same order,
> i.e. it's never mmu_lock -> mmu_notifier_slots_lock.

*Another* MMU notifier could nest a mutex inside KVM's rwlock.

But... is it correct that the MMU notifier invalidate callbacks are 
always called with the mmap_sem taken (sometimes for reading, e.g. 
try_to_merge_with_ksm_page->try_to_merge_one_page->write_protect_page)? 
  We could take it temporarily in install_memslots, since the MMU 
notifier's mm is stored in kvm->mm.

In this case, a pair of kvm_mmu_notifier_lock/unlock functions would be 
the best way to abstract it.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 21:22               ` Sean Christopherson
@ 2021-03-31 21:36                 ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2021-03-31 21:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On 31/03/21 23:22, Sean Christopherson wrote:
> On a related topic, any preference on whether to have an explicit "must_lock"
> flag (what I posted), or derive the logic based on other params?
> 
> The helper I posted does:
> 
> 	if (range->must_lock &&
> 	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
> 		goto out_unlock;
> 
> but it could be:
> 
> 	if (!IS_KVM_NULL_FN(range->on_lock) && !range->may_block &&
> 	    kvm_mmu_lock_and_check_handler(kvm, range, &locked))
> 		goto out_unlock;
> 
> The generated code should be nearly identical on a modern compiler, so it's
> purely a question of aesthetics.  I slightly prefer the explicit "must_lock" to
> avoid spreading out the logic too much, but it also feels a bit superfluous.

I do as well, but I hope we don't need any lock after all as in the 
email I've just sent.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary
  2021-03-31 21:35               ` Paolo Bonzini
@ 2021-03-31 21:47                 ` Sean Christopherson
  0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2021-03-31 21:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	James Morse, Julien Thierry, Suzuki K Poulose, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, linux-arm-kernel, kvmarm,
	linux-mips, kvm, kvm-ppc, linux-kernel, Ben Gardon

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
> On 31/03/21 23:05, Sean Christopherson wrote:
> > > Wouldn't it be incorrect to lock a mutex (e.g. inside*another*  MMU
> > > notifier's invalidate callback) while holding an rwlock_t?  That makes sense
> > > because anybody that's busy waiting in write_lock potentially cannot be
> > > preempted until the other task gets the mutex.  This is a potential
> > > deadlock.
> > 
> > Yes?  I don't think I follow your point though.  Nesting a spinlock or rwlock
> > inside a rwlock is ok, so long as the locks are always taken in the same order,
> > i.e. it's never mmu_lock -> mmu_notifier_slots_lock.
> 
> *Another* MMU notifier could nest a mutex inside KVM's rwlock.
> 
> But... is it correct that the MMU notifier invalidate callbacks are always
> called with the mmap_sem taken (sometimes for reading, e.g.
> try_to_merge_with_ksm_page->try_to_merge_one_page->write_protect_page)?

No :-(

File-based invalidations through the rmaps do not take mmap_sem.  They get at
the VMAs via the address_space's interval tree, which is protected by its own
i_mmap_rwsem.

E.g. try_to_unmap() -> rmap_walk_file() -> try_to_unmap_one() 

> We could take it temporarily in install_memslots, since the MMU notifier's mm
> is stored in kvm->mm.
> 
> In this case, a pair of kvm_mmu_notifier_lock/unlock functions would be the
> best way to abstract it.
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-03-31 21:48 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26  2:19 [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Sean Christopherson
2021-03-26  2:19 ` [PATCH 01/18] KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs Sean Christopherson
2021-03-26  2:19 ` [PATCH 02/18] KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU Sean Christopherson
2021-03-26  2:19 ` [PATCH 03/18] KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs Sean Christopherson
2021-03-26  2:19 ` [PATCH 04/18] KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap Sean Christopherson
2021-03-26  2:19 ` [PATCH 05/18] KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range() Sean Christopherson
2021-03-26  2:19 ` [PATCH 06/18] KVM: x86/mmu: Pass address space ID to TDP MMU root walkers Sean Christopherson
2021-03-26  2:19 ` [PATCH 07/18] KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE Sean Christopherson
2021-03-26  2:19 ` [PATCH 08/18] KVM: Move prototypes for MMU notifier callbacks to generic code Sean Christopherson
2021-03-26  2:19 ` [PATCH 09/18] KVM: Move arm64's MMU notifier trace events " Sean Christopherson
2021-03-26  2:19 ` [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers " Sean Christopherson
2021-03-31  7:52   ` Paolo Bonzini
2021-03-31 16:20     ` Sean Christopherson
2021-03-31 16:36       ` Paolo Bonzini
2021-03-26  2:19 ` [PATCH 11/18] KVM: arm64: Convert to the gfn-based MMU notifier callbacks Sean Christopherson
2021-03-26  2:19 ` [PATCH 12/18] KVM: MIPS/MMU: " Sean Christopherson
2021-03-31  7:41   ` Paolo Bonzini
2021-03-26  2:19 ` [PATCH 13/18] KVM: PPC: " Sean Christopherson
2021-03-26  2:19 ` [PATCH 14/18] KVM: Kill off the old hva-based " Sean Christopherson
2021-03-26  2:19 ` [PATCH 15/18] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot Sean Christopherson
2021-03-26  2:19 ` [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary Sean Christopherson
2021-03-31  7:52   ` Paolo Bonzini
2021-03-31  8:35   ` Paolo Bonzini
2021-03-31 16:41     ` Sean Christopherson
2021-03-31 16:47       ` Paolo Bonzini
2021-03-31 19:47         ` Sean Christopherson
2021-03-31 20:42           ` Paolo Bonzini
2021-03-31 21:05             ` Sean Christopherson
2021-03-31 21:22               ` Sean Christopherson
2021-03-31 21:36                 ` Paolo Bonzini
2021-03-31 21:35               ` Paolo Bonzini
2021-03-31 21:47                 ` Sean Christopherson
2021-03-31 20:15     ` Sean Christopherson
2021-03-31 20:30       ` Paolo Bonzini
2021-03-31 20:52     ` Sean Christopherson
2021-03-31 21:00       ` Paolo Bonzini
2021-03-26  2:19 ` [PATCH 17/18] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible Sean Christopherson
2021-03-26  2:19 ` [PATCH 18/18] KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint Sean Christopherson
2021-03-30 18:32 ` [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers Ben Gardon
2021-03-30 19:48   ` Paolo Bonzini
2021-03-30 19:58   ` Sean Christopherson
2021-03-31  7:57 ` Paolo Bonzini
2021-03-31  9:34   ` Marc Zyngier
2021-03-31  9:41     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).