linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()
@ 2024-04-05 11:58 Paolo Bonzini
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Paolo Bonzini @ 2024-04-05 11:58 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.

It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence.  Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.

Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().

Please review!  Also feel free to take the KVM patches through the mm
tree, as I don't expect any conflicts.

Thanks,

Paolo

Paolo Bonzini (4):
  KVM: delete .change_pte MMU notifier callback
  KVM: remove unused argument of kvm_handle_hva_range()
  mmu_notifier: remove the .change_pte() callback
  mm: replace set_pte_at_notify() with just set_pte_at()

 arch/arm64/kvm/mmu.c                  | 34 -----------------
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c              | 32 ----------------
 arch/mips/kvm/mmu.c                   | 30 ---------------
 arch/powerpc/include/asm/kvm_ppc.h    |  1 -
 arch/powerpc/kvm/book3s.c             |  5 ---
 arch/powerpc/kvm/book3s.h             |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 ------
 arch/powerpc/kvm/book3s_hv.c          |  1 -
 arch/powerpc/kvm/book3s_pr.c          |  7 ----
 arch/powerpc/kvm/e500_mmu_host.c      |  6 ---
 arch/riscv/kvm/mmu.c                  | 20 ----------
 arch/x86/kvm/mmu/mmu.c                | 54 +--------------------------
 arch/x86/kvm/mmu/spte.c               | 16 --------
 arch/x86/kvm/mmu/spte.h               |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c            | 46 -----------------------
 arch/x86/kvm/mmu/tdp_mmu.h            |  1 -
 include/linux/kvm_host.h              |  2 -
 include/linux/mmu_notifier.h          | 44 ----------------------
 include/trace/events/kvm.h            | 15 --------
 kernel/events/uprobes.c               |  5 +--
 mm/ksm.c                              |  4 +-
 mm/memory.c                           |  7 +---
 mm/migrate_device.c                   |  8 +---
 mm/mmu_notifier.c                     | 17 ---------
 virt/kvm/kvm_main.c                   | 50 +------------------------
 26 files changed, 10 insertions(+), 411 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-05 11:58 [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Paolo Bonzini
@ 2024-04-05 11:58 ` Paolo Bonzini
  2024-04-07  4:50   ` Anup Patel
                     ` (4 more replies)
  2024-04-05 11:58 ` [PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range() Paolo Bonzini
                   ` (4 subsequent siblings)
  5 siblings, 5 replies; 28+ messages in thread
From: Paolo Bonzini @ 2024-04-05 11:58 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/arm64/kvm/mmu.c                  | 34 -----------------
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c              | 32 ----------------
 arch/mips/kvm/mmu.c                   | 30 ---------------
 arch/powerpc/include/asm/kvm_ppc.h    |  1 -
 arch/powerpc/kvm/book3s.c             |  5 ---
 arch/powerpc/kvm/book3s.h             |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 ------
 arch/powerpc/kvm/book3s_hv.c          |  1 -
 arch/powerpc/kvm/book3s_pr.c          |  7 ----
 arch/powerpc/kvm/e500_mmu_host.c      |  6 ---
 arch/riscv/kvm/mmu.c                  | 20 ----------
 arch/x86/kvm/mmu/mmu.c                | 54 +--------------------------
 arch/x86/kvm/mmu/spte.c               | 16 --------
 arch/x86/kvm/mmu/spte.h               |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c            | 46 -----------------------
 arch/x86/kvm/mmu/tdp_mmu.h            |  1 -
 include/linux/kvm_host.h              |  2 -
 include/trace/events/kvm.h            | 15 --------
 virt/kvm/kvm_main.c                   | 43 ---------------------
 20 files changed, 2 insertions(+), 327 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..ff17849be9f4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-
-	if (!kvm->arch.mmu.pgt)
-		return false;
-
-	WARN_ON(range->end - range->start != 1);
-
-	/*
-	 * If the page isn't tagged, defer to user_mem_abort() for sanitising
-	 * the MTE tags. The S2 pte should have been unmapped by
-	 * mmu_notifier_invalidate_range_end().
-	 */
-	if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
-		return false;
-
-	/*
-	 * We've moved a page around, probably through CoW, so let's treat
-	 * it just like a translation fault and the map handler will clean
-	 * the cache to the PoC.
-	 *
-	 * The MMU notifiers will have unmapped a huge PMD before calling
-	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
-	 * therefore we never need to clear out a huge PMD through this
-	 * calling path and a memcache is not required.
-	 */
-	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
-			       PAGE_SIZE, __pfn_to_phys(pfn),
-			       KVM_PGTABLE_PROT_R, NULL, 0);
-
-	return false;
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	u64 size = (range->end - range->start) << PAGE_SHIFT;
diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h
index 2d62f7b0d377..69305441f40d 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
 void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
 int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
 
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a556cff35740..98883aa23ab8 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 			range->end << PAGE_SHIFT, &ctx);
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	unsigned long prot_bits;
-	kvm_pte_t *ptep;
-	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-	gpa_t gpa = range->start << PAGE_SHIFT;
-
-	ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
-	if (!ptep)
-		return false;
-
-	/* Replacing an absent or old page doesn't need flushes */
-	if (!kvm_pte_present(NULL, ptep) || !kvm_pte_young(*ptep)) {
-		kvm_set_pte(ptep, 0);
-		return false;
-	}
-
-	/* Fill new pte if write protected or page migrated */
-	prot_bits = _PAGE_PRESENT | __READABLE;
-	prot_bits |= _CACHE_MASK & pte_val(range->arg.pte);
-
-	/*
-	 * Set _PAGE_WRITE or _PAGE_DIRTY iff old and new pte both support
-	 * _PAGE_WRITE for map_page_fast if next page write fault
-	 * _PAGE_DIRTY since gpa has already recorded as dirty page
-	 */
-	prot_bits |= __WRITEABLE & *ptep & pte_val(range->arg.pte);
-	kvm_set_pte(ptep, kvm_pfn_pte(pfn, __pgprot(prot_bits)));
-
-	return true;
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	kvm_ptw_ctx ctx;
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 467ee6b95ae1..c17157e700c0 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -444,36 +444,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return true;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	gpa_t gpa = range->start << PAGE_SHIFT;
-	pte_t hva_pte = range->arg.pte;
-	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
-	pte_t old_pte;
-
-	if (!gpa_pte)
-		return false;
-
-	/* Mapping may need adjusting depending on memslot flags */
-	old_pte = *gpa_pte;
-	if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
-		hva_pte = pte_mkclean(hva_pte);
-	else if (range->slot->flags & KVM_MEM_READONLY)
-		hva_pte = pte_wrprotect(hva_pte);
-
-	set_pte(gpa_pte, hva_pte);
-
-	/* Replacing an absent or old page doesn't need flushes */
-	if (!pte_present(old_pte) || !pte_young(old_pte))
-		return false;
-
-	/* Pages swapped, aged, moved, or cleaned require flushes */
-	return !pte_present(hva_pte) ||
-	       !pte_young(hva_pte) ||
-	       pte_pfn(old_pte) != pte_pfn(hva_pte) ||
-	       (pte_dirty(old_pte) && !pte_dirty(hva_pte));
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 3281215097cc..ca3829d47ab7 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -287,7 +287,6 @@ struct kvmppc_ops {
 	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
-	bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	void (*free_memslot)(struct kvm_memory_slot *slot);
 	int (*init_vm)(struct kvm *kvm);
 	void (*destroy_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 8acec144120e..0d0624088e6b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -899,11 +899,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
-}
-
 int kvmppc_core_init_vm(struct kvm *kvm)
 {
 
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 58391b4b32ed..4aa2ab89afbc 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -12,7 +12,6 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
-extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 
 extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
 extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2b1f0cdd8c18..1b51b1c4713b 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1010,18 +1010,6 @@ bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
 		return kvm_test_age_rmapp(kvm, range->slot, range->start);
 }
 
-bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	WARN_ON(range->start + 1 != range->end);
-
-	if (kvm_is_radix(kvm))
-		kvm_unmap_radix(kvm, range->slot, range->start);
-	else
-		kvm_unmap_rmapp(kvm, range->slot, range->start);
-
-	return false;
-}
-
 static int vcpus_running(struct kvm *kvm)
 {
 	return atomic_read(&kvm->arch.vcpus_running) != 0;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8e86eb577eb8..35cb014a0c51 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6364,7 +6364,6 @@ static struct kvmppc_ops kvm_ops_hv = {
 	.unmap_gfn_range = kvm_unmap_gfn_range_hv,
 	.age_gfn = kvm_age_gfn_hv,
 	.test_age_gfn = kvm_test_age_gfn_hv,
-	.set_spte_gfn = kvm_set_spte_gfn_hv,
 	.free_memslot = kvmppc_core_free_memslot_hv,
 	.init_vm =  kvmppc_core_init_vm_hv,
 	.destroy_vm = kvmppc_core_destroy_vm_hv,
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 5b92619a05fd..a7d7137ea0c8 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -461,12 +461,6 @@ static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	/* The page will get remapped properly on its next fault */
-	return do_kvm_unmap_gfn(kvm, range);
-}
-
 /*****************************************/
 
 static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr)
@@ -2071,7 +2065,6 @@ static struct kvmppc_ops kvm_ops_pr = {
 	.unmap_gfn_range = kvm_unmap_gfn_range_pr,
 	.age_gfn  = kvm_age_gfn_pr,
 	.test_age_gfn = kvm_test_age_gfn_pr,
-	.set_spte_gfn = kvm_set_spte_gfn_pr,
 	.free_memslot = kvmppc_core_free_memslot_pr,
 	.init_vm = kvmppc_core_init_vm_pr,
 	.destroy_vm = kvmppc_core_destroy_vm_pr,
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index ccb8f16ffe41..c664fdec75b1 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -747,12 +747,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	/* The page will get remapped properly on its next fault */
-	return kvm_e500_mmu_unmap_gfn(kvm, range);
-}
-
 /*****************************************/
 
 int e500_mmu_host_init(struct kvmppc_vcpu_e500 *vcpu_e500)
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index a9e2fd7245e1..b63650f9b966 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -550,26 +550,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return false;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	int ret;
-	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-
-	if (!kvm->arch.pgd)
-		return false;
-
-	WARN_ON(range->end - range->start != 1);
-
-	ret = gstage_map_page(kvm, NULL, range->start << PAGE_SHIFT,
-			      __pfn_to_phys(pfn), PAGE_SIZE, true, true);
-	if (ret) {
-		kvm_debug("Failed to map G-stage page (error %d)\n", ret);
-		return true;
-	}
-
-	return false;
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	pte_t *ptep;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0049d49aa913..87ba2a9da196 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -432,8 +432,8 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
  * The idea using the light way get the spte on x86_32 guest is from
  * gup_get_pte (mm/gup.c).
  *
- * An spte tlb flush may be pending, because kvm_set_pte_rmap
- * coalesces them and we are running out of the MMU lock.  Therefore
+ * An spte tlb flush may be pending, because they are coalesced and
+ * we are running out of the MMU lock.  Therefore
  * we need to protect against in-progress updates of the spte.
  *
  * Reading the spte while an update is in progress may get the old value
@@ -1454,43 +1454,6 @@ static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return __kvm_zap_rmap(kvm, rmap_head, slot);
 }
 
-static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
-			     pte_t pte)
-{
-	u64 *sptep;
-	struct rmap_iterator iter;
-	bool need_flush = false;
-	u64 new_spte;
-	kvm_pfn_t new_pfn;
-
-	WARN_ON_ONCE(pte_huge(pte));
-	new_pfn = pte_pfn(pte);
-
-restart:
-	for_each_rmap_spte(rmap_head, &iter, sptep) {
-		need_flush = true;
-
-		if (pte_write(pte)) {
-			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
-			goto restart;
-		} else {
-			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
-					*sptep, new_pfn);
-
-			mmu_spte_clear_track_bits(kvm, sptep);
-			mmu_spte_set(sptep, new_spte);
-		}
-	}
-
-	if (need_flush && kvm_available_flush_remote_tlbs_range()) {
-		kvm_flush_remote_tlbs_gfn(kvm, gfn, level);
-		return false;
-	}
-
-	return need_flush;
-}
-
 struct slot_rmap_walk_iterator {
 	/* input fields. */
 	const struct kvm_memory_slot *slot;
@@ -1596,19 +1559,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return flush;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	bool flush = false;
-
-	if (kvm_memslots_have_rmaps(kvm))
-		flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap);
-
-	if (tdp_mmu_enabled)
-		flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range);
-
-	return flush;
-}
-
 static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			 struct kvm_memory_slot *slot, gfn_t gfn, int level,
 			 pte_t unused)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 318135daf685..283af5b90016 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -322,22 +322,6 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 	return spte;
 }
 
-u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
-{
-	u64 new_spte;
-
-	new_spte = old_spte & ~SPTE_BASE_ADDR_MASK;
-	new_spte |= (u64)new_pfn << PAGE_SHIFT;
-
-	new_spte &= ~PT_WRITABLE_MASK;
-	new_spte &= ~shadow_host_writable_mask;
-	new_spte &= ~shadow_mmu_writable_mask;
-
-	new_spte = mark_spte_for_access_track(new_spte);
-
-	return new_spte;
-}
-
 u64 mark_spte_for_access_track(u64 spte)
 {
 	if (spte_ad_enabled(spte))
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1a163aee9ec6..92da4c419171 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -511,8 +511,6 @@ static inline u64 restore_acc_track_spte(u64 spte)
 	return spte;
 }
 
-u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
-
 void __init kvm_mmu_spte_module_init(void);
 void kvm_mmu_reset_all_pte_masks(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3627744fcab6..fbb86932b766 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1250,52 +1250,6 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
 }
 
-static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
-			 struct kvm_gfn_range *range)
-{
-	u64 new_spte;
-
-	/* Huge pages aren't expected to be modified without first being zapped. */
-	WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end);
-
-	if (iter->level != PG_LEVEL_4K ||
-	    !is_shadow_present_pte(iter->old_spte))
-		return false;
-
-	/*
-	 * Note, when changing a read-only SPTE, it's not strictly necessary to
-	 * zero the SPTE before setting the new PFN, but doing so preserves the
-	 * invariant that the PFN of a present * leaf SPTE can never change.
-	 * See handle_changed_spte().
-	 */
-	tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
-
-	if (!pte_write(range->arg.pte)) {
-		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
-								  pte_pfn(range->arg.pte));
-
-		tdp_mmu_iter_set_spte(kvm, iter, new_spte);
-	}
-
-	return true;
-}
-
-/*
- * Handle the changed_pte MMU notifier for the TDP MMU.
- * data is a pointer to the new pte_t mapping the HVA specified by the MMU
- * notifier.
- * Returns non-zero if a flush is needed before releasing the MMU lock.
- */
-bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	/*
-	 * No need to handle the remote TLB flush under RCU protection, the
-	 * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
-	 * shadow page. See the WARN on pfn_changed in handle_changed_spte().
-	 */
-	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
-}
-
 /*
  * Remove write access from all SPTEs at or above min_level that map GFNs
  * [start, end). Returns true if an SPTE has been changed and the TLBs need to
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 6e1ea04ca885..58b55e61bd33 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -31,7 +31,6 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush);
 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 
 bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 			     const struct kvm_memory_slot *slot, int min_level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ab439706ea2f..8dea11701ab2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -259,7 +259,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
-	pte_t pte;
 	unsigned long attributes;
 };
 
@@ -273,7 +272,6 @@ struct kvm_gfn_range {
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 #endif
 
 enum {
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 011fba6b5552..74e40d5d4af4 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -456,21 +456,6 @@ TRACE_EVENT(kvm_unmap_hva_range,
 		  __entry->start, __entry->end)
 );
 
-TRACE_EVENT(kvm_set_spte_hva,
-	TP_PROTO(unsigned long hva),
-	TP_ARGS(hva),
-
-	TP_STRUCT__entry(
-		__field(	unsigned long,	hva		)
-	),
-
-	TP_fast_assign(
-		__entry->hva		= hva;
-	),
-
-	TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva)
-);
-
 TRACE_EVENT(kvm_age_hva,
 	TP_PROTO(unsigned long start, unsigned long end),
 	TP_ARGS(start, end),
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4eb8afd0b961..2fcd9979752a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -717,48 +717,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 	return __kvm_handle_hva_range(kvm, &range).ret;
 }
 
-static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-	/*
-	 * Skipping invalid memslots is correct if and only change_pte() is
-	 * surrounded by invalidate_range_{start,end}(), which is currently
-	 * guaranteed by the primary MMU.  If that ever changes, KVM needs to
-	 * unmap the memslot instead of skipping the memslot to ensure that KVM
-	 * doesn't hold references to the old PFN.
-	 */
-	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
-
-	if (range->slot->flags & KVM_MEMSLOT_INVALID)
-		return false;
-
-	return kvm_set_spte_gfn(kvm, range);
-}
-
-static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
-					struct mm_struct *mm,
-					unsigned long address,
-					pte_t pte)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const union kvm_mmu_notifier_arg arg = { .pte = pte };
-
-	trace_kvm_set_spte_hva(address);
-
-	/*
-	 * .change_pte() must be surrounded by .invalidate_range_{start,end}().
-	 * If mmu_invalidate_in_progress is zero, then no in-progress
-	 * invalidations, including this one, found a relevant memslot at
-	 * start(); rechecking memslots here is unnecessary.  Note, a false
-	 * positive (count elevated by a different invalidation) is sub-optimal
-	 * but functionally ok.
-	 */
-	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
-	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
-		return;
-
-	kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
-}
-
 void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);
@@ -976,7 +934,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
-	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
 };
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range()
  2024-04-05 11:58 [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Paolo Bonzini
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
@ 2024-04-05 11:58 ` Paolo Bonzini
  2024-04-08  6:31   ` Philippe Mathieu-Daudé
  2024-04-05 11:58 ` [PATCH 3/4] mmu_notifier: remove the .change_pte() callback Paolo Bonzini
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2024-04-05 11:58 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

The only user was kvm_mmu_notifier_change_pte(), which is now gone.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 virt/kvm/kvm_main.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2fcd9979752a..9701888811ad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,8 +595,6 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
-static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
-
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define kvm_for_each_memslot_in_hva_range(node, slots, start, last)	     \
 	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
@@ -682,14 +680,12 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
-						union kvm_mmu_notifier_arg arg,
 						gfn_handler_t handler)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	const struct kvm_mmu_notifier_range range = {
 		.start		= start,
 		.end		= end,
-		.arg		= arg,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= true,
@@ -880,8 +876,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 {
 	trace_kvm_age_hva(start, end);
 
-	return kvm_handle_hva_range(mn, start, end, KVM_MMU_NOTIFIER_NO_ARG,
-				    kvm_age_gfn);
+	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] mmu_notifier: remove the .change_pte() callback
  2024-04-05 11:58 [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Paolo Bonzini
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
  2024-04-05 11:58 ` [PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range() Paolo Bonzini
@ 2024-04-05 11:58 ` Paolo Bonzini
  2024-04-08  7:35   ` David Hildenbrand
  2024-04-05 11:58 ` [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at() Paolo Bonzini
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2024-04-05 11:58 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/linux/mmu_notifier.h | 46 ++----------------------------------
 mm/mmu_notifier.c            | 17 -------------
 2 files changed, 2 insertions(+), 61 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..8c72bf651606 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -122,15 +122,6 @@ struct mmu_notifier_ops {
 			  struct mm_struct *mm,
 			  unsigned long address);
 
-	/*
-	 * change_pte is called in cases that pte mapping to page is changed:
-	 * for example, when ksm remaps pte to point to a new shared page.
-	 */
-	void (*change_pte)(struct mmu_notifier *subscription,
-			   struct mm_struct *mm,
-			   unsigned long address,
-			   pte_t pte);
-
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_lock and/or the
@@ -392,8 +383,6 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
 				      unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
-extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -439,13 +428,6 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 	return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
-{
-	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -581,26 +563,6 @@ static inline void mmu_notifier_range_init_owner(
 	__young;							\
 })
 
-/*
- * set_pte_at_notify() sets the pte _after_ running the notifier.
- * This is safe to start by updating the secondary MMUs, because the primary MMU
- * pte invalidate must have already happened with a ptep_clear_flush() before
- * set_pte_at_notify() has been invoked.  Updating the secondary MMUs first is
- * required when we change both the protection of the mapping from read-only to
- * read-write and the pfn (like during copy on write page faults). Otherwise the
- * old page would remain mapped readonly in the secondary MMUs after the new
- * page is already writable by some CPU through the primary MMU.
- */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
-({									\
-	struct mm_struct *___mm = __mm;					\
-	unsigned long ___address = __address;				\
-	pte_t ___pte = __pte;						\
-									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
-	set_pte_at(___mm, ___address, __ptep, ___pte);			\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 struct mmu_notifier_range {
@@ -650,11 +612,6 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 	return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
-{
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -693,7 +650,6 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 #define	ptep_clear_flush_notify ptep_clear_flush
 #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush
 #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
-#define set_pte_at_notify set_pte_at
 
 static inline void mmu_notifier_synchronize(void)
 {
@@ -701,4 +657,6 @@ static inline void mmu_notifier_synchronize(void)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
+#define set_pte_at_notify set_pte_at
+
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index ec3b068cbbe6..8982e6139d07 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -424,23 +424,6 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
-{
-	struct mmu_notifier *subscription;
-	int id;
-
-	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(subscription,
-				 &mm->notifier_subscriptions->list, hlist,
-				 srcu_read_lock_held(&srcu)) {
-		if (subscription->ops->change_pte)
-			subscription->ops->change_pte(subscription, mm, address,
-						      pte);
-	}
-	srcu_read_unlock(&srcu, id);
-}
-
 static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
 			       const struct mmu_notifier_range *range)
 {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()
  2024-04-05 11:58 [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Paolo Bonzini
                   ` (2 preceding siblings ...)
  2024-04-05 11:58 ` [PATCH 3/4] mmu_notifier: remove the .change_pte() callback Paolo Bonzini
@ 2024-04-05 11:58 ` Paolo Bonzini
  2024-04-08  6:28   ` Philippe Mathieu-Daudé
  2024-04-08  7:36   ` David Hildenbrand
  2024-04-10 21:30 ` [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Andrew Morton
  2024-04-12 13:07 ` Marc Zyngier
  5 siblings, 2 replies; 28+ messages in thread
From: Paolo Bonzini @ 2024-04-05 11:58 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/linux/mmu_notifier.h | 2 --
 kernel/events/uprobes.c      | 5 ++---
 mm/ksm.c                     | 4 ++--
 mm/memory.c                  | 7 +------
 mm/migrate_device.c          | 8 ++------
 5 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 8c72bf651606..d39ebb10caeb 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -657,6 +657,4 @@ static inline void mmu_notifier_synchronize(void)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
-#define set_pte_at_notify set_pte_at
-
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e4834d23e1d1..f4523b95c945 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -18,7 +18,6 @@
 #include <linux/sched/coredump.h>
 #include <linux/export.h>
 #include <linux/rmap.h>		/* anon_vma_prepare */
-#include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* folio_free_swap */
 #include <linux/ptrace.h>	/* user_enable_single_step */
 #include <linux/kdebug.h>	/* notifier mechanism */
@@ -195,8 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
 	ptep_clear_flush(vma, addr, pvmw.pte);
 	if (new_page)
-		set_pte_at_notify(mm, addr, pvmw.pte,
-				  mk_pte(new_page, vma->vm_page_prot));
+		set_pte_at(mm, addr, pvmw.pte,
+			   mk_pte(new_page, vma->vm_page_prot));
 
 	folio_remove_rmap_pte(old_folio, old_page, vma);
 	if (!folio_mapped(old_folio))
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c001819cf10..108a4d167824 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1345,7 +1345,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_write(entry))
 			entry = pte_wrprotect(entry);
 
-		set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
+		set_pte_at(mm, pvmw.address, pvmw.pte, entry);
 	}
 	*orig_pte = entry;
 	err = 0;
@@ -1447,7 +1447,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	 * See Documentation/mm/mmu_notifier.rst
 	 */
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, newpte);
+	set_pte_at(mm, addr, ptep, newpte);
 
 	folio = page_folio(page);
 	folio_remove_rmap_pte(folio, page, vma);
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..9a6f4d8aa379 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3327,13 +3327,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		ptep_clear_flush(vma, vmf->address, vmf->pte);
 		folio_add_new_anon_rmap(new_folio, vma, vmf->address);
 		folio_add_lru_vma(new_folio, vma);
-		/*
-		 * We call the notify macro here because, when using secondary
-		 * mmu page tables (such as kvm shadow page tables), we want the
-		 * new page to be mapped directly into the secondary page table.
-		 */
 		BUG_ON(unshare && pte_write(entry));
-		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
+		set_pte_at(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
 		if (old_folio) {
 			/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index b6c27c76e1a0..66206734b1b9 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -664,13 +664,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(orig_pte));
 		ptep_clear_flush(vma, addr, ptep);
-		set_pte_at_notify(mm, addr, ptep, entry);
-		update_mmu_cache(vma, addr, ptep);
-	} else {
-		/* No need to invalidate - it was non-present before */
-		set_pte_at(mm, addr, ptep, entry);
-		update_mmu_cache(vma, addr, ptep);
 	}
+	set_pte_at(mm, addr, ptep, entry);
+	update_mmu_cache(vma, addr, ptep);
 
 	pte_unmap_unlock(ptep, ptl);
 	*src = MIGRATE_PFN_MIGRATE;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
@ 2024-04-07  4:50   ` Anup Patel
  2024-04-08  7:23   ` maobibo
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Anup Patel @ 2024-04-07  4:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Fri, Apr 5, 2024 at 5:28 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> .change_pte().
>
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
>
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> .change_pte() and allow sleepable .invalidate_page() hooks.
>
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

For KVM RISC-V:
Acked-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

> ---
>  arch/arm64/kvm/mmu.c                  | 34 -----------------
>  arch/loongarch/include/asm/kvm_host.h |  1 -
>  arch/loongarch/kvm/mmu.c              | 32 ----------------
>  arch/mips/kvm/mmu.c                   | 30 ---------------
>  arch/powerpc/include/asm/kvm_ppc.h    |  1 -
>  arch/powerpc/kvm/book3s.c             |  5 ---
>  arch/powerpc/kvm/book3s.h             |  1 -
>  arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 ------
>  arch/powerpc/kvm/book3s_hv.c          |  1 -
>  arch/powerpc/kvm/book3s_pr.c          |  7 ----
>  arch/powerpc/kvm/e500_mmu_host.c      |  6 ---
>  arch/riscv/kvm/mmu.c                  | 20 ----------
>  arch/x86/kvm/mmu/mmu.c                | 54 +--------------------------
>  arch/x86/kvm/mmu/spte.c               | 16 --------
>  arch/x86/kvm/mmu/spte.h               |  2 -
>  arch/x86/kvm/mmu/tdp_mmu.c            | 46 -----------------------
>  arch/x86/kvm/mmu/tdp_mmu.h            |  1 -
>  include/linux/kvm_host.h              |  2 -
>  include/trace/events/kvm.h            | 15 --------
>  virt/kvm/kvm_main.c                   | 43 ---------------------
>  20 files changed, 2 insertions(+), 327 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..ff17849be9f4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>         return false;
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> -       if (!kvm->arch.mmu.pgt)
> -               return false;
> -
> -       WARN_ON(range->end - range->start != 1);
> -
> -       /*
> -        * If the page isn't tagged, defer to user_mem_abort() for sanitising
> -        * the MTE tags. The S2 pte should have been unmapped by
> -        * mmu_notifier_invalidate_range_end().
> -        */
> -       if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> -               return false;
> -
> -       /*
> -        * We've moved a page around, probably through CoW, so let's treat
> -        * it just like a translation fault and the map handler will clean
> -        * the cache to the PoC.
> -        *
> -        * The MMU notifiers will have unmapped a huge PMD before calling
> -        * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> -        * therefore we never need to clear out a huge PMD through this
> -        * calling path and a memcache is not required.
> -        */
> -       kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> -                              PAGE_SIZE, __pfn_to_phys(pfn),
> -                              KVM_PGTABLE_PROT_R, NULL, 0);
> -
> -       return false;
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         u64 size = (range->end - range->start) << PAGE_SHIFT;
> diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h
> index 2d62f7b0d377..69305441f40d 100644
> --- a/arch/loongarch/include/asm/kvm_host.h
> +++ b/arch/loongarch/include/asm/kvm_host.h
> @@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
>  void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
>  int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
>
> -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable);
>  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
> index a556cff35740..98883aa23ab8 100644
> --- a/arch/loongarch/kvm/mmu.c
> +++ b/arch/loongarch/kvm/mmu.c
> @@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>                         range->end << PAGE_SHIFT, &ctx);
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       unsigned long prot_bits;
> -       kvm_pte_t *ptep;
> -       kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -       gpa_t gpa = range->start << PAGE_SHIFT;
> -
> -       ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
> -       if (!ptep)
> -               return false;
> -
> -       /* Replacing an absent or old page doesn't need flushes */
> -       if (!kvm_pte_present(NULL, ptep) || !kvm_pte_young(*ptep)) {
> -               kvm_set_pte(ptep, 0);
> -               return false;
> -       }
> -
> -       /* Fill new pte if write protected or page migrated */
> -       prot_bits = _PAGE_PRESENT | __READABLE;
> -       prot_bits |= _CACHE_MASK & pte_val(range->arg.pte);
> -
> -       /*
> -        * Set _PAGE_WRITE or _PAGE_DIRTY iff old and new pte both support
> -        * _PAGE_WRITE for map_page_fast if next page write fault
> -        * _PAGE_DIRTY since gpa has already recorded as dirty page
> -        */
> -       prot_bits |= __WRITEABLE & *ptep & pte_val(range->arg.pte);
> -       kvm_set_pte(ptep, kvm_pfn_pte(pfn, __pgprot(prot_bits)));
> -
> -       return true;
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         kvm_ptw_ctx ctx;
> diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
> index 467ee6b95ae1..c17157e700c0 100644
> --- a/arch/mips/kvm/mmu.c
> +++ b/arch/mips/kvm/mmu.c
> @@ -444,36 +444,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>         return true;
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       gpa_t gpa = range->start << PAGE_SHIFT;
> -       pte_t hva_pte = range->arg.pte;
> -       pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
> -       pte_t old_pte;
> -
> -       if (!gpa_pte)
> -               return false;
> -
> -       /* Mapping may need adjusting depending on memslot flags */
> -       old_pte = *gpa_pte;
> -       if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
> -               hva_pte = pte_mkclean(hva_pte);
> -       else if (range->slot->flags & KVM_MEM_READONLY)
> -               hva_pte = pte_wrprotect(hva_pte);
> -
> -       set_pte(gpa_pte, hva_pte);
> -
> -       /* Replacing an absent or old page doesn't need flushes */
> -       if (!pte_present(old_pte) || !pte_young(old_pte))
> -               return false;
> -
> -       /* Pages swapped, aged, moved, or cleaned require flushes */
> -       return !pte_present(hva_pte) ||
> -              !pte_young(hva_pte) ||
> -              pte_pfn(old_pte) != pte_pfn(hva_pte) ||
> -              (pte_dirty(old_pte) && !pte_dirty(hva_pte));
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 3281215097cc..ca3829d47ab7 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -287,7 +287,6 @@ struct kvmppc_ops {
>         bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
>         bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>         bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> -       bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>         void (*free_memslot)(struct kvm_memory_slot *slot);
>         int (*init_vm)(struct kvm *kvm);
>         void (*destroy_vm)(struct kvm *kvm);
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 8acec144120e..0d0624088e6b 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -899,11 +899,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>         return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> -}
> -
>  int kvmppc_core_init_vm(struct kvm *kvm)
>  {
>
> diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> index 58391b4b32ed..4aa2ab89afbc 100644
> --- a/arch/powerpc/kvm/book3s.h
> +++ b/arch/powerpc/kvm/book3s.h
> @@ -12,7 +12,6 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
>  extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>  extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>  extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> -extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>
>  extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
>  extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 2b1f0cdd8c18..1b51b1c4713b 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -1010,18 +1010,6 @@ bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
>                 return kvm_test_age_rmapp(kvm, range->slot, range->start);
>  }
>
> -bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       WARN_ON(range->start + 1 != range->end);
> -
> -       if (kvm_is_radix(kvm))
> -               kvm_unmap_radix(kvm, range->slot, range->start);
> -       else
> -               kvm_unmap_rmapp(kvm, range->slot, range->start);
> -
> -       return false;
> -}
> -
>  static int vcpus_running(struct kvm *kvm)
>  {
>         return atomic_read(&kvm->arch.vcpus_running) != 0;
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8e86eb577eb8..35cb014a0c51 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -6364,7 +6364,6 @@ static struct kvmppc_ops kvm_ops_hv = {
>         .unmap_gfn_range = kvm_unmap_gfn_range_hv,
>         .age_gfn = kvm_age_gfn_hv,
>         .test_age_gfn = kvm_test_age_gfn_hv,
> -       .set_spte_gfn = kvm_set_spte_gfn_hv,
>         .free_memslot = kvmppc_core_free_memslot_hv,
>         .init_vm =  kvmppc_core_init_vm_hv,
>         .destroy_vm = kvmppc_core_destroy_vm_hv,
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 5b92619a05fd..a7d7137ea0c8 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -461,12 +461,6 @@ static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
>         return false;
>  }
>
> -static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       /* The page will get remapped properly on its next fault */
> -       return do_kvm_unmap_gfn(kvm, range);
> -}
> -
>  /*****************************************/
>
>  static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr)
> @@ -2071,7 +2065,6 @@ static struct kvmppc_ops kvm_ops_pr = {
>         .unmap_gfn_range = kvm_unmap_gfn_range_pr,
>         .age_gfn  = kvm_age_gfn_pr,
>         .test_age_gfn = kvm_test_age_gfn_pr,
> -       .set_spte_gfn = kvm_set_spte_gfn_pr,
>         .free_memslot = kvmppc_core_free_memslot_pr,
>         .init_vm = kvmppc_core_init_vm_pr,
>         .destroy_vm = kvmppc_core_destroy_vm_pr,
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index ccb8f16ffe41..c664fdec75b1 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -747,12 +747,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>         return false;
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       /* The page will get remapped properly on its next fault */
> -       return kvm_e500_mmu_unmap_gfn(kvm, range);
> -}
> -
>  /*****************************************/
>
>  int e500_mmu_host_init(struct kvmppc_vcpu_e500 *vcpu_e500)
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index a9e2fd7245e1..b63650f9b966 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -550,26 +550,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>         return false;
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       int ret;
> -       kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> -       if (!kvm->arch.pgd)
> -               return false;
> -
> -       WARN_ON(range->end - range->start != 1);
> -
> -       ret = gstage_map_page(kvm, NULL, range->start << PAGE_SHIFT,
> -                             __pfn_to_phys(pfn), PAGE_SIZE, true, true);
> -       if (ret) {
> -               kvm_debug("Failed to map G-stage page (error %d)\n", ret);
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         pte_t *ptep;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0049d49aa913..87ba2a9da196 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -432,8 +432,8 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
>   * The idea using the light way get the spte on x86_32 guest is from
>   * gup_get_pte (mm/gup.c).
>   *
> - * An spte tlb flush may be pending, because kvm_set_pte_rmap
> - * coalesces them and we are running out of the MMU lock.  Therefore
> + * An spte tlb flush may be pending, because they are coalesced and
> + * we are running out of the MMU lock.  Therefore
>   * we need to protect against in-progress updates of the spte.
>   *
>   * Reading the spte while an update is in progress may get the old value
> @@ -1454,43 +1454,6 @@ static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>         return __kvm_zap_rmap(kvm, rmap_head, slot);
>  }
>
> -static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> -                            struct kvm_memory_slot *slot, gfn_t gfn, int level,
> -                            pte_t pte)
> -{
> -       u64 *sptep;
> -       struct rmap_iterator iter;
> -       bool need_flush = false;
> -       u64 new_spte;
> -       kvm_pfn_t new_pfn;
> -
> -       WARN_ON_ONCE(pte_huge(pte));
> -       new_pfn = pte_pfn(pte);
> -
> -restart:
> -       for_each_rmap_spte(rmap_head, &iter, sptep) {
> -               need_flush = true;
> -
> -               if (pte_write(pte)) {
> -                       kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> -                       goto restart;
> -               } else {
> -                       new_spte = kvm_mmu_changed_pte_notifier_make_spte(
> -                                       *sptep, new_pfn);
> -
> -                       mmu_spte_clear_track_bits(kvm, sptep);
> -                       mmu_spte_set(sptep, new_spte);
> -               }
> -       }
> -
> -       if (need_flush && kvm_available_flush_remote_tlbs_range()) {
> -               kvm_flush_remote_tlbs_gfn(kvm, gfn, level);
> -               return false;
> -       }
> -
> -       return need_flush;
> -}
> -
>  struct slot_rmap_walk_iterator {
>         /* input fields. */
>         const struct kvm_memory_slot *slot;
> @@ -1596,19 +1559,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>         return flush;
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       bool flush = false;
> -
> -       if (kvm_memslots_have_rmaps(kvm))
> -               flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap);
> -
> -       if (tdp_mmu_enabled)
> -               flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range);
> -
> -       return flush;
> -}
> -
>  static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>                          struct kvm_memory_slot *slot, gfn_t gfn, int level,
>                          pte_t unused)
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 318135daf685..283af5b90016 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -322,22 +322,6 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
>         return spte;
>  }
>
> -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
> -{
> -       u64 new_spte;
> -
> -       new_spte = old_spte & ~SPTE_BASE_ADDR_MASK;
> -       new_spte |= (u64)new_pfn << PAGE_SHIFT;
> -
> -       new_spte &= ~PT_WRITABLE_MASK;
> -       new_spte &= ~shadow_host_writable_mask;
> -       new_spte &= ~shadow_mmu_writable_mask;
> -
> -       new_spte = mark_spte_for_access_track(new_spte);
> -
> -       return new_spte;
> -}
> -
>  u64 mark_spte_for_access_track(u64 spte)
>  {
>         if (spte_ad_enabled(spte))
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 1a163aee9ec6..92da4c419171 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -511,8 +511,6 @@ static inline u64 restore_acc_track_spte(u64 spte)
>         return spte;
>  }
>
> -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
> -
>  void __init kvm_mmu_spte_module_init(void);
>  void kvm_mmu_reset_all_pte_masks(void);
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3627744fcab6..fbb86932b766 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1250,52 +1250,6 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>         return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
>  }
>
> -static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> -                        struct kvm_gfn_range *range)
> -{
> -       u64 new_spte;
> -
> -       /* Huge pages aren't expected to be modified without first being zapped. */
> -       WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end);
> -
> -       if (iter->level != PG_LEVEL_4K ||
> -           !is_shadow_present_pte(iter->old_spte))
> -               return false;
> -
> -       /*
> -        * Note, when changing a read-only SPTE, it's not strictly necessary to
> -        * zero the SPTE before setting the new PFN, but doing so preserves the
> -        * invariant that the PFN of a present * leaf SPTE can never change.
> -        * See handle_changed_spte().
> -        */
> -       tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
> -
> -       if (!pte_write(range->arg.pte)) {
> -               new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
> -                                                                 pte_pfn(range->arg.pte));
> -
> -               tdp_mmu_iter_set_spte(kvm, iter, new_spte);
> -       }
> -
> -       return true;
> -}
> -
> -/*
> - * Handle the changed_pte MMU notifier for the TDP MMU.
> - * data is a pointer to the new pte_t mapping the HVA specified by the MMU
> - * notifier.
> - * Returns non-zero if a flush is needed before releasing the MMU lock.
> - */
> -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       /*
> -        * No need to handle the remote TLB flush under RCU protection, the
> -        * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
> -        * shadow page. See the WARN on pfn_changed in handle_changed_spte().
> -        */
> -       return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
> -}
> -
>  /*
>   * Remove write access from all SPTEs at or above min_level that map GFNs
>   * [start, end). Returns true if an SPTE has been changed and the TLBs need to
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 6e1ea04ca885..58b55e61bd33 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -31,7 +31,6 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>                                  bool flush);
>  bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>
>  bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>                              const struct kvm_memory_slot *slot, int min_level);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ab439706ea2f..8dea11701ab2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -259,7 +259,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  union kvm_mmu_notifier_arg {
> -       pte_t pte;
>         unsigned long attributes;
>  };
>
> @@ -273,7 +272,6 @@ struct kvm_gfn_range {
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  #endif
>
>  enum {
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 011fba6b5552..74e40d5d4af4 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -456,21 +456,6 @@ TRACE_EVENT(kvm_unmap_hva_range,
>                   __entry->start, __entry->end)
>  );
>
> -TRACE_EVENT(kvm_set_spte_hva,
> -       TP_PROTO(unsigned long hva),
> -       TP_ARGS(hva),
> -
> -       TP_STRUCT__entry(
> -               __field(        unsigned long,  hva             )
> -       ),
> -
> -       TP_fast_assign(
> -               __entry->hva            = hva;
> -       ),
> -
> -       TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva)
> -);
> -
>  TRACE_EVENT(kvm_age_hva,
>         TP_PROTO(unsigned long start, unsigned long end),
>         TP_ARGS(start, end),
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4eb8afd0b961..2fcd9979752a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -717,48 +717,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>         return __kvm_handle_hva_range(kvm, &range).ret;
>  }
>
> -static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -       /*
> -        * Skipping invalid memslots is correct if and only change_pte() is
> -        * surrounded by invalidate_range_{start,end}(), which is currently
> -        * guaranteed by the primary MMU.  If that ever changes, KVM needs to
> -        * unmap the memslot instead of skipping the memslot to ensure that KVM
> -        * doesn't hold references to the old PFN.
> -        */
> -       WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> -
> -       if (range->slot->flags & KVM_MEMSLOT_INVALID)
> -               return false;
> -
> -       return kvm_set_spte_gfn(kvm, range);
> -}
> -
> -static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> -                                       struct mm_struct *mm,
> -                                       unsigned long address,
> -                                       pte_t pte)
> -{
> -       struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -       const union kvm_mmu_notifier_arg arg = { .pte = pte };
> -
> -       trace_kvm_set_spte_hva(address);
> -
> -       /*
> -        * .change_pte() must be surrounded by .invalidate_range_{start,end}().
> -        * If mmu_invalidate_in_progress is zero, then no in-progress
> -        * invalidations, including this one, found a relevant memslot at
> -        * start(); rechecking memslots here is unnecessary.  Note, a false
> -        * positive (count elevated by a different invalidation) is sub-optimal
> -        * but functionally ok.
> -        */
> -       WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> -       if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> -               return;
> -
> -       kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
> -}
> -
>  void kvm_mmu_invalidate_begin(struct kvm *kvm)
>  {
>         lockdep_assert_held_write(&kvm->mmu_lock);
> @@ -976,7 +934,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>         .clear_flush_young      = kvm_mmu_notifier_clear_flush_young,
>         .clear_young            = kvm_mmu_notifier_clear_young,
>         .test_young             = kvm_mmu_notifier_test_young,
> -       .change_pte             = kvm_mmu_notifier_change_pte,
>         .release                = kvm_mmu_notifier_release,
>  };
>
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()
  2024-04-05 11:58 ` [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at() Paolo Bonzini
@ 2024-04-08  6:28   ` Philippe Mathieu-Daudé
  2024-04-08  7:36   ` David Hildenbrand
  1 sibling, 0 replies; 28+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-08  6:28 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On 5/4/24 13:58, Paolo Bonzini wrote:
> With the demise of the .change_pte() MMU notifier callback, there is no
> notification happening in set_pte_at_notify().  It is a synonym of
> set_pte_at() and can be replaced with it.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   include/linux/mmu_notifier.h | 2 --
>   kernel/events/uprobes.c      | 5 ++---
>   mm/ksm.c                     | 4 ++--
>   mm/memory.c                  | 7 +------
>   mm/migrate_device.c          | 8 ++------
>   5 files changed, 7 insertions(+), 19 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range()
  2024-04-05 11:58 ` [PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range() Paolo Bonzini
@ 2024-04-08  6:31   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 28+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-08  6:31 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On 5/4/24 13:58, Paolo Bonzini wrote:
> The only user was kvm_mmu_notifier_change_pte(), which is now gone.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   virt/kvm/kvm_main.c | 7 +------
>   1 file changed, 1 insertion(+), 6 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
  2024-04-07  4:50   ` Anup Patel
@ 2024-04-08  7:23   ` maobibo
  2024-04-08 11:45   ` Michael Ellerman
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: maobibo @ 2024-04-08  7:23 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Thomas Bogendoerfer,
	Nicholas Piggin, Anup Patel, Atish Patra, Sean Christopherson,
	Andrew Morton, David Hildenbrand, linux-arm-kernel, kvmarm,
	loongarch, linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users



On 2024/4/5 下午7:58, Paolo Bonzini wrote:
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> change_pte().
> 
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
> 
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> change_pte() and allow sleepable .invalidate_page() hooks.
> 
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   arch/arm64/kvm/mmu.c                  | 34 -----------------
>   arch/loongarch/include/asm/kvm_host.h |  1 -
>   arch/loongarch/kvm/mmu.c              | 32 ----------------
>   arch/mips/kvm/mmu.c                   | 30 ---------------
>   arch/powerpc/include/asm/kvm_ppc.h    |  1 -
>   arch/powerpc/kvm/book3s.c             |  5 ---
>   arch/powerpc/kvm/book3s.h             |  1 -
>   arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 ------
>   arch/powerpc/kvm/book3s_hv.c          |  1 -
>   arch/powerpc/kvm/book3s_pr.c          |  7 ----
>   arch/powerpc/kvm/e500_mmu_host.c      |  6 ---
>   arch/riscv/kvm/mmu.c                  | 20 ----------
>   arch/x86/kvm/mmu/mmu.c                | 54 +--------------------------
>   arch/x86/kvm/mmu/spte.c               | 16 --------
>   arch/x86/kvm/mmu/spte.h               |  2 -
>   arch/x86/kvm/mmu/tdp_mmu.c            | 46 -----------------------
>   arch/x86/kvm/mmu/tdp_mmu.h            |  1 -
>   include/linux/kvm_host.h              |  2 -
>   include/trace/events/kvm.h            | 15 --------
>   virt/kvm/kvm_main.c                   | 43 ---------------------
>   20 files changed, 2 insertions(+), 327 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..ff17849be9f4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return false;
>   }
>   
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> -	if (!kvm->arch.mmu.pgt)
> -		return false;
> -
> -	WARN_ON(range->end - range->start != 1);
> -
> -	/*
> -	 * If the page isn't tagged, defer to user_mem_abort() for sanitising
> -	 * the MTE tags. The S2 pte should have been unmapped by
> -	 * mmu_notifier_invalidate_range_end().
> -	 */
> -	if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> -		return false;
> -
> -	/*
> -	 * We've moved a page around, probably through CoW, so let's treat
> -	 * it just like a translation fault and the map handler will clean
> -	 * the cache to the PoC.
> -	 *
> -	 * The MMU notifiers will have unmapped a huge PMD before calling
> -	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> -	 * therefore we never need to clear out a huge PMD through this
> -	 * calling path and a memcache is not required.
> -	 */
> -	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> -			       PAGE_SIZE, __pfn_to_phys(pfn),
> -			       KVM_PGTABLE_PROT_R, NULL, 0);
> -
> -	return false;
> -}
> -
>   bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
>   	u64 size = (range->end - range->start) << PAGE_SHIFT;
> diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h
> index 2d62f7b0d377..69305441f40d 100644
> --- a/arch/loongarch/include/asm/kvm_host.h
> +++ b/arch/loongarch/include/asm/kvm_host.h
> @@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
>   void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
>   int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
>   
> -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>   int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable);
>   int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
>   int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
> index a556cff35740..98883aa23ab8 100644
> --- a/arch/loongarch/kvm/mmu.c
> +++ b/arch/loongarch/kvm/mmu.c
> @@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>   			range->end << PAGE_SHIFT, &ctx);
>   }
>   
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	unsigned long prot_bits;
> -	kvm_pte_t *ptep;
> -	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -	gpa_t gpa = range->start << PAGE_SHIFT;
> -
> -	ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
> -	if (!ptep)
> -		return false;
> -
> -	/* Replacing an absent or old page doesn't need flushes */
> -	if (!kvm_pte_present(NULL, ptep) || !kvm_pte_young(*ptep)) {
> -		kvm_set_pte(ptep, 0);
> -		return false;
> -	}
> -
> -	/* Fill new pte if write protected or page migrated */
> -	prot_bits = _PAGE_PRESENT | __READABLE;
> -	prot_bits |= _CACHE_MASK & pte_val(range->arg.pte);
> -
> -	/*
> -	 * Set _PAGE_WRITE or _PAGE_DIRTY iff old and new pte both support
> -	 * _PAGE_WRITE for map_page_fast if next page write fault
> -	 * _PAGE_DIRTY since gpa has already recorded as dirty page
> -	 */
> -	prot_bits |= __WRITEABLE & *ptep & pte_val(range->arg.pte);
> -	kvm_set_pte(ptep, kvm_pfn_pte(pfn, __pgprot(prot_bits)));
> -
> -	return true;
> -}
> -

Thanks for cleanup. And for kvm loongarch:
Reviewed-by: Bibo Mao <maobibo@loongson.cn>

>   bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
>   	kvm_ptw_ctx ctx;
> diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
> index 467ee6b95ae1..c17157e700c0 100644
> --- a/arch/mips/kvm/mmu.c
> +++ b/arch/mips/kvm/mmu.c
> @@ -444,36 +444,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return true;
>   }
>   
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	gpa_t gpa = range->start << PAGE_SHIFT;
> -	pte_t hva_pte = range->arg.pte;
> -	pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
> -	pte_t old_pte;
> -
> -	if (!gpa_pte)
> -		return false;
> -
> -	/* Mapping may need adjusting depending on memslot flags */
> -	old_pte = *gpa_pte;
> -	if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
> -		hva_pte = pte_mkclean(hva_pte);
> -	else if (range->slot->flags & KVM_MEM_READONLY)
> -		hva_pte = pte_wrprotect(hva_pte);
> -
> -	set_pte(gpa_pte, hva_pte);
> -
> -	/* Replacing an absent or old page doesn't need flushes */
> -	if (!pte_present(old_pte) || !pte_young(old_pte))
> -		return false;
> -
> -	/* Pages swapped, aged, moved, or cleaned require flushes */
> -	return !pte_present(hva_pte) ||
> -	       !pte_young(hva_pte) ||
> -	       pte_pfn(old_pte) != pte_pfn(hva_pte) ||
> -	       (pte_dirty(old_pte) && !pte_dirty(hva_pte));
> -}
> -
>   bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
>   	return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 3281215097cc..ca3829d47ab7 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -287,7 +287,6 @@ struct kvmppc_ops {
>   	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
>   	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> -	bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>   	void (*free_memslot)(struct kvm_memory_slot *slot);
>   	int (*init_vm)(struct kvm *kvm);
>   	void (*destroy_vm)(struct kvm *kvm);
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 8acec144120e..0d0624088e6b 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -899,11 +899,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
>   }
>   
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> -}
> -
>   int kvmppc_core_init_vm(struct kvm *kvm)
>   {
>   
> diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> index 58391b4b32ed..4aa2ab89afbc 100644
> --- a/arch/powerpc/kvm/book3s.h
> +++ b/arch/powerpc/kvm/book3s.h
> @@ -12,7 +12,6 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
>   extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>   extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>   extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> -extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>   
>   extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
>   extern void kvmppc_mmu_destroy_pr(struct kvm_vcpu *vcpu);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 2b1f0cdd8c18..1b51b1c4713b 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -1010,18 +1010,6 @@ bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
>   		return kvm_test_age_rmapp(kvm, range->slot, range->start);
>   }
>   
> -bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	WARN_ON(range->start + 1 != range->end);
> -
> -	if (kvm_is_radix(kvm))
> -		kvm_unmap_radix(kvm, range->slot, range->start);
> -	else
> -		kvm_unmap_rmapp(kvm, range->slot, range->start);
> -
> -	return false;
> -}
> -
>   static int vcpus_running(struct kvm *kvm)
>   {
>   	return atomic_read(&kvm->arch.vcpus_running) != 0;
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 8e86eb577eb8..35cb014a0c51 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -6364,7 +6364,6 @@ static struct kvmppc_ops kvm_ops_hv = {
>   	.unmap_gfn_range = kvm_unmap_gfn_range_hv,
>   	.age_gfn = kvm_age_gfn_hv,
>   	.test_age_gfn = kvm_test_age_gfn_hv,
> -	.set_spte_gfn = kvm_set_spte_gfn_hv,
>   	.free_memslot = kvmppc_core_free_memslot_hv,
>   	.init_vm =  kvmppc_core_init_vm_hv,
>   	.destroy_vm = kvmppc_core_destroy_vm_hv,
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 5b92619a05fd..a7d7137ea0c8 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -461,12 +461,6 @@ static bool kvm_test_age_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return false;
>   }
>   
> -static bool kvm_set_spte_gfn_pr(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	/* The page will get remapped properly on its next fault */
> -	return do_kvm_unmap_gfn(kvm, range);
> -}
> -
>   /*****************************************/
>   
>   static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr)
> @@ -2071,7 +2065,6 @@ static struct kvmppc_ops kvm_ops_pr = {
>   	.unmap_gfn_range = kvm_unmap_gfn_range_pr,
>   	.age_gfn  = kvm_age_gfn_pr,
>   	.test_age_gfn = kvm_test_age_gfn_pr,
> -	.set_spte_gfn = kvm_set_spte_gfn_pr,
>   	.free_memslot = kvmppc_core_free_memslot_pr,
>   	.init_vm = kvmppc_core_init_vm_pr,
>   	.destroy_vm = kvmppc_core_destroy_vm_pr,
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index ccb8f16ffe41..c664fdec75b1 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -747,12 +747,6 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return false;
>   }
>   
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	/* The page will get remapped properly on its next fault */
> -	return kvm_e500_mmu_unmap_gfn(kvm, range);
> -}
> -
>   /*****************************************/
>   
>   int e500_mmu_host_init(struct kvmppc_vcpu_e500 *vcpu_e500)
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index a9e2fd7245e1..b63650f9b966 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -550,26 +550,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return false;
>   }
>   
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	int ret;
> -	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> -	if (!kvm->arch.pgd)
> -		return false;
> -
> -	WARN_ON(range->end - range->start != 1);
> -
> -	ret = gstage_map_page(kvm, NULL, range->start << PAGE_SHIFT,
> -			      __pfn_to_phys(pfn), PAGE_SIZE, true, true);
> -	if (ret) {
> -		kvm_debug("Failed to map G-stage page (error %d)\n", ret);
> -		return true;
> -	}
> -
> -	return false;
> -}
> -
>   bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   {
>   	pte_t *ptep;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0049d49aa913..87ba2a9da196 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -432,8 +432,8 @@ static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
>    * The idea using the light way get the spte on x86_32 guest is from
>    * gup_get_pte (mm/gup.c).
>    *
> - * An spte tlb flush may be pending, because kvm_set_pte_rmap
> - * coalesces them and we are running out of the MMU lock.  Therefore
> + * An spte tlb flush may be pending, because they are coalesced and
> + * we are running out of the MMU lock.  Therefore
>    * we need to protect against in-progress updates of the spte.
>    *
>    * Reading the spte while an update is in progress may get the old value
> @@ -1454,43 +1454,6 @@ static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>   	return __kvm_zap_rmap(kvm, rmap_head, slot);
>   }
>   
> -static bool kvm_set_pte_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
> -			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
> -			     pte_t pte)
> -{
> -	u64 *sptep;
> -	struct rmap_iterator iter;
> -	bool need_flush = false;
> -	u64 new_spte;
> -	kvm_pfn_t new_pfn;
> -
> -	WARN_ON_ONCE(pte_huge(pte));
> -	new_pfn = pte_pfn(pte);
> -
> -restart:
> -	for_each_rmap_spte(rmap_head, &iter, sptep) {
> -		need_flush = true;
> -
> -		if (pte_write(pte)) {
> -			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
> -			goto restart;
> -		} else {
> -			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
> -					*sptep, new_pfn);
> -
> -			mmu_spte_clear_track_bits(kvm, sptep);
> -			mmu_spte_set(sptep, new_spte);
> -		}
> -	}
> -
> -	if (need_flush && kvm_available_flush_remote_tlbs_range()) {
> -		kvm_flush_remote_tlbs_gfn(kvm, gfn, level);
> -		return false;
> -	}
> -
> -	return need_flush;
> -}
> -
>   struct slot_rmap_walk_iterator {
>   	/* input fields. */
>   	const struct kvm_memory_slot *slot;
> @@ -1596,19 +1559,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return flush;
>   }
>   
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	bool flush = false;
> -
> -	if (kvm_memslots_have_rmaps(kvm))
> -		flush = kvm_handle_gfn_range(kvm, range, kvm_set_pte_rmap);
> -
> -	if (tdp_mmu_enabled)
> -		flush |= kvm_tdp_mmu_set_spte_gfn(kvm, range);
> -
> -	return flush;
> -}
> -
>   static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>   			 struct kvm_memory_slot *slot, gfn_t gfn, int level,
>   			 pte_t unused)
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 318135daf685..283af5b90016 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -322,22 +322,6 @@ u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
>   	return spte;
>   }
>   
> -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
> -{
> -	u64 new_spte;
> -
> -	new_spte = old_spte & ~SPTE_BASE_ADDR_MASK;
> -	new_spte |= (u64)new_pfn << PAGE_SHIFT;
> -
> -	new_spte &= ~PT_WRITABLE_MASK;
> -	new_spte &= ~shadow_host_writable_mask;
> -	new_spte &= ~shadow_mmu_writable_mask;
> -
> -	new_spte = mark_spte_for_access_track(new_spte);
> -
> -	return new_spte;
> -}
> -
>   u64 mark_spte_for_access_track(u64 spte)
>   {
>   	if (spte_ad_enabled(spte))
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 1a163aee9ec6..92da4c419171 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -511,8 +511,6 @@ static inline u64 restore_acc_track_spte(u64 spte)
>   	return spte;
>   }
>   
> -u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
> -
>   void __init kvm_mmu_spte_module_init(void);
>   void kvm_mmu_reset_all_pte_masks(void);
>   
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3627744fcab6..fbb86932b766 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1250,52 +1250,6 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>   	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
>   }
>   
> -static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> -			 struct kvm_gfn_range *range)
> -{
> -	u64 new_spte;
> -
> -	/* Huge pages aren't expected to be modified without first being zapped. */
> -	WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end);
> -
> -	if (iter->level != PG_LEVEL_4K ||
> -	    !is_shadow_present_pte(iter->old_spte))
> -		return false;
> -
> -	/*
> -	 * Note, when changing a read-only SPTE, it's not strictly necessary to
> -	 * zero the SPTE before setting the new PFN, but doing so preserves the
> -	 * invariant that the PFN of a present * leaf SPTE can never change.
> -	 * See handle_changed_spte().
> -	 */
> -	tdp_mmu_iter_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
> -
> -	if (!pte_write(range->arg.pte)) {
> -		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
> -								  pte_pfn(range->arg.pte));
> -
> -		tdp_mmu_iter_set_spte(kvm, iter, new_spte);
> -	}
> -
> -	return true;
> -}
> -
> -/*
> - * Handle the changed_pte MMU notifier for the TDP MMU.
> - * data is a pointer to the new pte_t mapping the HVA specified by the MMU
> - * notifier.
> - * Returns non-zero if a flush is needed before releasing the MMU lock.
> - */
> -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	/*
> -	 * No need to handle the remote TLB flush under RCU protection, the
> -	 * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
> -	 * shadow page. See the WARN on pfn_changed in handle_changed_spte().
> -	 */
> -	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
> -}
> -
>   /*
>    * Remove write access from all SPTEs at or above min_level that map GFNs
>    * [start, end). Returns true if an SPTE has been changed and the TLBs need to
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 6e1ea04ca885..58b55e61bd33 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -31,7 +31,6 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>   				 bool flush);
>   bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> -bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>   
>   bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>   			     const struct kvm_memory_slot *slot, int min_level);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ab439706ea2f..8dea11701ab2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -259,7 +259,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>   
>   #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>   union kvm_mmu_notifier_arg {
> -	pte_t pte;
>   	unsigned long attributes;
>   };
>   
> @@ -273,7 +272,6 @@ struct kvm_gfn_range {
>   bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>   bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>   #endif
>   
>   enum {
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 011fba6b5552..74e40d5d4af4 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -456,21 +456,6 @@ TRACE_EVENT(kvm_unmap_hva_range,
>   		  __entry->start, __entry->end)
>   );
>   
> -TRACE_EVENT(kvm_set_spte_hva,
> -	TP_PROTO(unsigned long hva),
> -	TP_ARGS(hva),
> -
> -	TP_STRUCT__entry(
> -		__field(	unsigned long,	hva		)
> -	),
> -
> -	TP_fast_assign(
> -		__entry->hva		= hva;
> -	),
> -
> -	TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva)
> -);
> -
>   TRACE_EVENT(kvm_age_hva,
>   	TP_PROTO(unsigned long start, unsigned long end),
>   	TP_ARGS(start, end),
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4eb8afd0b961..2fcd9979752a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -717,48 +717,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>   	return __kvm_handle_hva_range(kvm, &range).ret;
>   }
>   
> -static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	/*
> -	 * Skipping invalid memslots is correct if and only change_pte() is
> -	 * surrounded by invalidate_range_{start,end}(), which is currently
> -	 * guaranteed by the primary MMU.  If that ever changes, KVM needs to
> -	 * unmap the memslot instead of skipping the memslot to ensure that KVM
> -	 * doesn't hold references to the old PFN.
> -	 */
> -	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> -
> -	if (range->slot->flags & KVM_MEMSLOT_INVALID)
> -		return false;
> -
> -	return kvm_set_spte_gfn(kvm, range);
> -}
> -
> -static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> -					struct mm_struct *mm,
> -					unsigned long address,
> -					pte_t pte)
> -{
> -	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -	const union kvm_mmu_notifier_arg arg = { .pte = pte };
> -
> -	trace_kvm_set_spte_hva(address);
> -
> -	/*
> -	 * .change_pte() must be surrounded by .invalidate_range_{start,end}().
> -	 * If mmu_invalidate_in_progress is zero, then no in-progress
> -	 * invalidations, including this one, found a relevant memslot at
> -	 * start(); rechecking memslots here is unnecessary.  Note, a false
> -	 * positive (count elevated by a different invalidation) is sub-optimal
> -	 * but functionally ok.
> -	 */
> -	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> -	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> -		return;
> -
> -	kvm_handle_hva_range(mn, address, address + 1, arg, kvm_change_spte_gfn);
> -}
> -
>   void kvm_mmu_invalidate_begin(struct kvm *kvm)
>   {
>   	lockdep_assert_held_write(&kvm->mmu_lock);
> @@ -976,7 +934,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>   	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
>   	.clear_young		= kvm_mmu_notifier_clear_young,
>   	.test_young		= kvm_mmu_notifier_test_young,
> -	.change_pte		= kvm_mmu_notifier_change_pte,
>   	.release		= kvm_mmu_notifier_release,
>   };
>   
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] mmu_notifier: remove the .change_pte() callback
  2024-04-05 11:58 ` [PATCH 3/4] mmu_notifier: remove the .change_pte() callback Paolo Bonzini
@ 2024-04-08  7:35   ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-04-08  7:35 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, linux-arm-kernel, kvmarm,
	loongarch, linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users

On 05.04.24 13:58, Paolo Bonzini wrote:
> The scope of set_pte_at_notify() has reduced more and more through the
> years.  Initially, it was meant for when the change to the PTE was
> not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
> that has not been so for over ten years.  During all this period
> the only implementation of .change_pte() was KVM and it
> had no actual functionality, because it was called after
> mmu_notifier_invalidate_range_start() zapped the secondary PTE.
> 
> Now that this (nonfunctional) user of the .change_pte() callback is
> gone, the whole callback can be removed.  For now, leave in place
> set_pte_at_notify() even though it is just a synonym for set_pte_at().
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()
  2024-04-05 11:58 ` [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at() Paolo Bonzini
  2024-04-08  6:28   ` Philippe Mathieu-Daudé
@ 2024-04-08  7:36   ` David Hildenbrand
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-04-08  7:36 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, linux-arm-kernel, kvmarm,
	loongarch, linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users

On 05.04.24 13:58, Paolo Bonzini wrote:
> With the demise of the .change_pte() MMU notifier callback, there is no
> notification happening in set_pte_at_notify().  It is a synonym of
> set_pte_at() and can be replaced with it.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

A real joy seeing that gone

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
  2024-04-07  4:50   ` Anup Patel
  2024-04-08  7:23   ` maobibo
@ 2024-04-08 11:45   ` Michael Ellerman
  2024-04-08 13:56   ` Peter Xu
  2024-04-12 10:44   ` Will Deacon
  4 siblings, 0 replies; 28+ messages in thread
From: Michael Ellerman @ 2024-04-08 11:45 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm
  Cc: Marc Zyngier, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

Paolo Bonzini <pbonzini@redhat.com> writes:
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> .change_pte().
>
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
>
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> .change_pte() and allow sleepable .invalidate_page() hooks.
>
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/arm64/kvm/mmu.c                  | 34 -----------------
>  arch/loongarch/include/asm/kvm_host.h |  1 -
>  arch/loongarch/kvm/mmu.c              | 32 ----------------
>  arch/mips/kvm/mmu.c                   | 30 ---------------
>  arch/powerpc/include/asm/kvm_ppc.h    |  1 -
>  arch/powerpc/kvm/book3s.c             |  5 ---
>  arch/powerpc/kvm/book3s.h             |  1 -
>  arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 ------
>  arch/powerpc/kvm/book3s_hv.c          |  1 -
>  arch/powerpc/kvm/book3s_pr.c          |  7 ----
>  arch/powerpc/kvm/e500_mmu_host.c      |  6 ---

LGTM.

Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)

cheers

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
                     ` (2 preceding siblings ...)
  2024-04-08 11:45   ` Michael Ellerman
@ 2024-04-08 13:56   ` Peter Xu
  2024-04-11 16:55     ` Paolo Bonzini
  2024-04-12 10:44   ` Will Deacon
  4 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-04-08 13:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, Andrew Morton,
	David Hildenbrand, linux-arm-kernel, kvmarm, loongarch,
	linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users, Andrea Arcangeli

On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> .change_pte().
> 
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
> 
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> .change_pte() and allow sleepable .invalidate_page() hooks.
> 
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.

Paolo,

I may miss a bunch of details here (as I still remember some change_pte
patches previously on the list..), however not sure whether we considered
enable it?  Asked because I remember Andrea used to have a custom tree
maintaining that part:

https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

Maybe it can't be enabled for some reason that I overlooked in the current
tree, or we just decided to not to?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()
  2024-04-05 11:58 [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Paolo Bonzini
                   ` (3 preceding siblings ...)
  2024-04-05 11:58 ` [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at() Paolo Bonzini
@ 2024-04-10 21:30 ` Andrew Morton
  2024-04-11 16:57   ` Paolo Bonzini
  2024-04-12 13:07 ` Marc Zyngier
  5 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2024-04-10 21:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Fri,  5 Apr 2024 07:58:11 -0400 Paolo Bonzini <pbonzini@redhat.com> wrote:

> Please review!  Also feel free to take the KVM patches through the mm
> tree, as I don't expect any conflicts.

It's mainly a KVM thing and the MM changes are small and simple.
I'd say that the KVM tree would be a better home?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-08 13:56   ` Peter Xu
@ 2024-04-11 16:55     ` Paolo Bonzini
  2024-04-11 18:47       ` Peter Xu
  2024-04-12 20:01       ` David Hildenbrand
  0 siblings, 2 replies; 28+ messages in thread
From: Paolo Bonzini @ 2024-04-11 16:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, Andrew Morton,
	David Hildenbrand, linux-arm-kernel, kvmarm, loongarch,
	linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users, Andrea Arcangeli

On Mon, Apr 8, 2024 at 3:56 PM Peter Xu <peterx@redhat.com> wrote:
> Paolo,
>
> I may miss a bunch of details here (as I still remember some change_pte
> patches previously on the list..), however not sure whether we considered
> enable it?  Asked because I remember Andrea used to have a custom tree
> maintaining that part:
>
> https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.

> Maybe it can't be enabled for some reason that I overlooked in the current
> tree, or we just decided to not to?

I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one
has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.

Paolo

> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()
  2024-04-10 21:30 ` [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Andrew Morton
@ 2024-04-11 16:57   ` Paolo Bonzini
  0 siblings, 0 replies; 28+ messages in thread
From: Paolo Bonzini @ 2024-04-11 16:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Wed, Apr 10, 2024 at 11:30 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri,  5 Apr 2024 07:58:11 -0400 Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Please review!  Also feel free to take the KVM patches through the mm
> > tree, as I don't expect any conflicts.
>
> It's mainly a KVM thing and the MM changes are small and simple.
> I'd say that the KVM tree would be a better home?

Sure! I'll queue them on my side then.

Paolo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-11 16:55     ` Paolo Bonzini
@ 2024-04-11 18:47       ` Peter Xu
  2024-04-12 20:01       ` David Hildenbrand
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-04-11 18:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, Andrew Morton,
	David Hildenbrand, linux-arm-kernel, kvmarm, loongarch,
	linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users, Andrea Arcangeli

On Thu, Apr 11, 2024 at 06:55:44PM +0200, Paolo Bonzini wrote:
> On Mon, Apr 8, 2024 at 3:56 PM Peter Xu <peterx@redhat.com> wrote:
> > Paolo,
> >
> > I may miss a bunch of details here (as I still remember some change_pte
> > patches previously on the list..), however not sure whether we considered
> > enable it?  Asked because I remember Andrea used to have a custom tree
> > maintaining that part:
> >
> > https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968
> 
> The patch enables it only for KSM, so it would still require a bunch
> of cleanups, for example I also would still use set_pte_at() in all
> the places that are not KSM. This would at least fix the issue with
> the poor documentation of where to use set_pte_at_notify() vs
> set_pte_at().
> 
> With regard to the implementation, I like the idea of disabling the
> invalidation on the MMU notifier side, but I would rather have
> MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
> overloading the event field.
> 
> > Maybe it can't be enabled for some reason that I overlooked in the current
> > tree, or we just decided to not to?
> 
> I have just learnt about the patch, nobody had ever mentioned it even
> though it's almost 2 years old... It's a lot of code though and no one
> has ever reported an issue for over 10 years, so I think it's easiest
> to just rip the code out.

Right, it was pretty old and I have no idea if that was discussed or
published before..  It would be better to have discussed this earlier.

As long as we have a decision with that being aware and in mind, then it
looks fine to me to take either way to go, and I also agree either way is
better than keep the status quo.

I also have Andrea copied anyway when I replied, so I guess he should be
aware of this and he can chim in anytime.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
                     ` (3 preceding siblings ...)
  2024-04-08 13:56   ` Peter Xu
@ 2024-04-12 10:44   ` Will Deacon
  2024-04-12 13:15     ` Marc Zyngier
  4 siblings, 1 reply; 28+ messages in thread
From: Will Deacon @ 2024-04-12 10:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, Andrew Morton,
	David Hildenbrand, linux-arm-kernel, kvmarm, loongarch,
	linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users

On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..ff17849be9f4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  	return false;
>  }
>  
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> -	if (!kvm->arch.mmu.pgt)
> -		return false;
> -
> -	WARN_ON(range->end - range->start != 1);
> -
> -	/*
> -	 * If the page isn't tagged, defer to user_mem_abort() for sanitising
> -	 * the MTE tags. The S2 pte should have been unmapped by
> -	 * mmu_notifier_invalidate_range_end().
> -	 */
> -	if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> -		return false;
> -
> -	/*
> -	 * We've moved a page around, probably through CoW, so let's treat
> -	 * it just like a translation fault and the map handler will clean
> -	 * the cache to the PoC.
> -	 *
> -	 * The MMU notifiers will have unmapped a huge PMD before calling
> -	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> -	 * therefore we never need to clear out a huge PMD through this
> -	 * calling path and a memcache is not required.
> -	 */
> -	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> -			       PAGE_SIZE, __pfn_to_phys(pfn),
> -			       KVM_PGTABLE_PROT_R, NULL, 0);
> -
> -	return false;
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	u64 size = (range->end - range->start) << PAGE_SHIFT;

Thanks. It's nice to see this code retire:

Acked-by: Will Deacon <will@kernel.org>

Also, if you're in the business of hacking the MMU notifier code, it
would be really great to change the .clear_flush_young() callback so
that the architecture could handle the TLB invalidation. At the moment,
the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
being set by kvm_handle_hva_range(), whereas we could do a much
lighter-weight and targetted TLBI in the architecture page-table code
when we actually update the ptes for small ranges.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()
  2024-04-05 11:58 [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Paolo Bonzini
                   ` (4 preceding siblings ...)
  2024-04-10 21:30 ` [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Andrew Morton
@ 2024-04-12 13:07 ` Marc Zyngier
  5 siblings, 0 replies; 28+ messages in thread
From: Marc Zyngier @ 2024-04-12 13:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Oliver Upton, Tianrui Zhao, Bibo Mao,
	Thomas Bogendoerfer, Nicholas Piggin, Anup Patel, Atish Patra,
	Sean Christopherson, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Fri, 05 Apr 2024 12:58:11 +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> The .change_pte() MMU notifier callback was intended as an optimization
> and for this reason it was initially called without a surrounding
> mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
> implemented by KVM (which was also the original user of MMU notifiers)
> and the rules on when to call set_pte_at_notify() rather than set_pte_at()
> have always been pretty obscure.
> 
> It may seem a miracle that it has never caused any hard to trigger
> bugs, but there's a good reason for that: KVM's implementation has
> been nonfunctional for a good part of its existence.  Already in
> 2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
> invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
> .change_pte() callback to occur within an invalidate_range_start/end()
> pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
> .change_pte() has no hope of finding a sPTE to change.
> 
> Therefore, all the code for .change_pte() can be removed from both KVM
> and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().
> 
> Please review!  Also feel free to take the KVM patches through the mm
> tree, as I don't expect any conflicts.
> 
> Thanks,
> 
> Paolo
> 
> Paolo Bonzini (4):
>   KVM: delete .change_pte MMU notifier callback
>   KVM: remove unused argument of kvm_handle_hva_range()
>   mmu_notifier: remove the .change_pte() callback
>   mm: replace set_pte_at_notify() with just set_pte_at()
> 
>  arch/arm64/kvm/mmu.c                  | 34 -----------------
>  arch/loongarch/include/asm/kvm_host.h |  1 -
>  arch/loongarch/kvm/mmu.c              | 32 ----------------
>  arch/mips/kvm/mmu.c                   | 30 ---------------
>  arch/powerpc/include/asm/kvm_ppc.h    |  1 -
>  arch/powerpc/kvm/book3s.c             |  5 ---
>  arch/powerpc/kvm/book3s.h             |  1 -
>  arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 ------
>  arch/powerpc/kvm/book3s_hv.c          |  1 -
>  arch/powerpc/kvm/book3s_pr.c          |  7 ----
>  arch/powerpc/kvm/e500_mmu_host.c      |  6 ---
>  arch/riscv/kvm/mmu.c                  | 20 ----------
>  arch/x86/kvm/mmu/mmu.c                | 54 +--------------------------
>  arch/x86/kvm/mmu/spte.c               | 16 --------
>  arch/x86/kvm/mmu/spte.h               |  2 -
>  arch/x86/kvm/mmu/tdp_mmu.c            | 46 -----------------------
>  arch/x86/kvm/mmu/tdp_mmu.h            |  1 -
>  include/linux/kvm_host.h              |  2 -
>  include/linux/mmu_notifier.h          | 44 ----------------------
>  include/trace/events/kvm.h            | 15 --------
>  kernel/events/uprobes.c               |  5 +--
>  mm/ksm.c                              |  4 +-
>  mm/memory.c                           |  7 +---
>  mm/migrate_device.c                   |  8 +---
>  mm/mmu_notifier.c                     | 17 ---------
>  virt/kvm/kvm_main.c                   | 50 +------------------------
>  26 files changed, 10 insertions(+), 411 deletions(-)
> 

Reviewed-by: Marc Zyngier <maz@kernel.org>

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-12 10:44   ` Will Deacon
@ 2024-04-12 13:15     ` Marc Zyngier
  2024-04-12 14:54       ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Marc Zyngier @ 2024-04-12 13:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paolo Bonzini, linux-kernel, kvm, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, Andrew Morton,
	David Hildenbrand, linux-arm-kernel, kvmarm, loongarch,
	linux-mips, linuxppc-dev, kvm-riscv, linux-mm,
	linux-trace-kernel, linux-perf-users

On Fri, 12 Apr 2024 11:44:09 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index dc04bc767865..ff17849be9f4 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  	return false;
> >  }
> >  
> > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > -{
> > -	kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> > -
> > -	if (!kvm->arch.mmu.pgt)
> > -		return false;
> > -
> > -	WARN_ON(range->end - range->start != 1);
> > -
> > -	/*
> > -	 * If the page isn't tagged, defer to user_mem_abort() for sanitising
> > -	 * the MTE tags. The S2 pte should have been unmapped by
> > -	 * mmu_notifier_invalidate_range_end().
> > -	 */
> > -	if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> > -		return false;
> > -
> > -	/*
> > -	 * We've moved a page around, probably through CoW, so let's treat
> > -	 * it just like a translation fault and the map handler will clean
> > -	 * the cache to the PoC.
> > -	 *
> > -	 * The MMU notifiers will have unmapped a huge PMD before calling
> > -	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> > -	 * therefore we never need to clear out a huge PMD through this
> > -	 * calling path and a memcache is not required.
> > -	 */
> > -	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> > -			       PAGE_SIZE, __pfn_to_phys(pfn),
> > -			       KVM_PGTABLE_PROT_R, NULL, 0);
> > -
> > -	return false;
> > -}
> > -
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >  	u64 size = (range->end - range->start) << PAGE_SHIFT;
> 
> Thanks. It's nice to see this code retire:
> 
> Acked-by: Will Deacon <will@kernel.org>
> 
> Also, if you're in the business of hacking the MMU notifier code, it
> would be really great to change the .clear_flush_young() callback so
> that the architecture could handle the TLB invalidation. At the moment,
> the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> being set by kvm_handle_hva_range(), whereas we could do a much
> lighter-weight and targetted TLBI in the architecture page-table code
> when we actually update the ptes for small ranges.

Indeed, and I was looking at this earlier this week as it has a pretty
devastating effect with NV (it blows the shadow S2 for that VMID, with
costly consequences).

In general, it feels like the TLB invalidation should stay with the
code that deals with the page tables, as it has a pretty good idea of
what needs to be invalidated and how -- specially on architectures
that have a HW-broadcast facility like arm64.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-12 13:15     ` Marc Zyngier
@ 2024-04-12 14:54       ` Sean Christopherson
  2024-04-13  9:56         ` Marc Zyngier
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2024-04-12 14:54 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Will Deacon, Paolo Bonzini, linux-kernel, kvm, Oliver Upton,
	Tianrui Zhao, Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin,
	Anup Patel, Atish Patra, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Fri, Apr 12, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote:
> > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > Also, if you're in the business of hacking the MMU notifier code, it
> > would be really great to change the .clear_flush_young() callback so
> > that the architecture could handle the TLB invalidation. At the moment,
> > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > being set by kvm_handle_hva_range(), whereas we could do a much
> > lighter-weight and targetted TLBI in the architecture page-table code
> > when we actually update the ptes for small ranges.
> 
> Indeed, and I was looking at this earlier this week as it has a pretty
> devastating effect with NV (it blows the shadow S2 for that VMID, with
> costly consequences).
> 
> In general, it feels like the TLB invalidation should stay with the
> code that deals with the page tables, as it has a pretty good idea of
> what needs to be invalidated and how -- specially on architectures
> that have a HW-broadcast facility like arm64.

Would this be roughly on par with an in-line flush on arm64?  The simpler, more
straightforward solution would be to let architectures override flush_on_ret,
but I would prefer something like the below as x86 can also utilize a range-based
flush when running as a nested hypervisor.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff0a20565f90..b65116294efe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
        struct kvm_gfn_range gfn_range;
        struct kvm_memory_slot *slot;
        struct kvm_memslots *slots;
+       bool need_flush = false;
        int i, idx;
 
        if (WARN_ON_ONCE(range->end <= range->start))
@@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
                                        break;
                        }
                        r.ret |= range->handler(kvm, &gfn_range);
+
+                       /*
+                        * Use a precise gfn-based TLB flush when possible, as
+                        * most mmu_notifier events affect a small-ish range.
+                        * Fall back to a full TLB flush if the gfn-based flush
+                        * fails, and don't bother trying the gfn-based flush
+                        * if a full flush is already pending.
+                        */
+                       if (range->flush_on_ret && !need_flush && r.ret &&
+                           kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start
+                                                            gfn_range.end - gfn_range.start + 1))
+                               need_flush = true;
                }
        }
 
-       if (range->flush_on_ret && r.ret)
+       if (need_flush)
                kvm_flush_remote_tlbs(kvm);
 
        if (r.found_memslot)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-11 16:55     ` Paolo Bonzini
  2024-04-11 18:47       ` Peter Xu
@ 2024-04-12 20:01       ` David Hildenbrand
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-04-12 20:01 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Xu
  Cc: linux-kernel, kvm, Marc Zyngier, Oliver Upton, Tianrui Zhao,
	Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin, Anup Patel,
	Atish Patra, Sean Christopherson, Andrew Morton,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users,
	Andrea Arcangeli

On 11.04.24 18:55, Paolo Bonzini wrote:
> On Mon, Apr 8, 2024 at 3:56 PM Peter Xu <peterx@redhat.com> wrote:
>> Paolo,
>>
>> I may miss a bunch of details here (as I still remember some change_pte
>> patches previously on the list..), however not sure whether we considered
>> enable it?  Asked because I remember Andrea used to have a custom tree
>> maintaining that part:
>>
>> https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968
> 
> The patch enables it only for KSM, so it would still require a bunch
> of cleanups, for example I also would still use set_pte_at() in all
> the places that are not KSM. This would at least fix the issue with
> the poor documentation of where to use set_pte_at_notify() vs
> set_pte_at().
> 
> With regard to the implementation, I like the idea of disabling the
> invalidation on the MMU notifier side, but I would rather have
> MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
> overloading the event field.
> 
>> Maybe it can't be enabled for some reason that I overlooked in the current
>> tree, or we just decided to not to?
> 
> I have just learnt about the patch, nobody had ever mentioned it even
> though it's almost 2 years old... It's a lot of code though and no one

I assume Andrea used it on his tree where he also has a version of 
"randprotect" (even included in that commit subject) to mitigate a KSM 
security issue that was reported by some security researchers [1] a 
while ago. From what I recall, the industry did not end up caring about 
that security issue that much.

IIUC, with "randprotect" we get a lot more R/O protection even when not 
de-duplicating a page -- thus the name. Likely, the reporter mentioned 
in the commit is a researcher that played with Andreas fix for the 
security issue. But I'm just speculating at this point :)

> has ever reported an issue for over 10 years, so I think it's easiest
> to just rip the code out.

Yes. Can always be readded in a possibly cleaner fashion (like you note 
above), when deemed necessary and we are willing to support it.

[1] https://gruss.cc/files/remote_dedup.pdf

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-12 14:54       ` Sean Christopherson
@ 2024-04-13  9:56         ` Marc Zyngier
  2024-04-15 17:03           ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Marc Zyngier @ 2024-04-13  9:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Will Deacon, Paolo Bonzini, linux-kernel, kvm, Oliver Upton,
	Tianrui Zhao, Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin,
	Anup Patel, Atish Patra, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Fri, 12 Apr 2024 15:54:22 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote:
> > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > Also, if you're in the business of hacking the MMU notifier code, it
> > > would be really great to change the .clear_flush_young() callback so
> > > that the architecture could handle the TLB invalidation. At the moment,
> > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > lighter-weight and targetted TLBI in the architecture page-table code
> > > when we actually update the ptes for small ranges.
> > 
> > Indeed, and I was looking at this earlier this week as it has a pretty
> > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > costly consequences).
> > 
> > In general, it feels like the TLB invalidation should stay with the
> > code that deals with the page tables, as it has a pretty good idea of
> > what needs to be invalidated and how -- specially on architectures
> > that have a HW-broadcast facility like arm64.
> 
> Would this be roughly on par with an in-line flush on arm64?  The simpler, more
> straightforward solution would be to let architectures override flush_on_ret,
> but I would prefer something like the below as x86 can also utilize a range-based
> flush when running as a nested hypervisor.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff0a20565f90..b65116294efe 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>         struct kvm_gfn_range gfn_range;
>         struct kvm_memory_slot *slot;
>         struct kvm_memslots *slots;
> +       bool need_flush = false;
>         int i, idx;
>  
>         if (WARN_ON_ONCE(range->end <= range->start))
> @@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>                                         break;
>                         }
>                         r.ret |= range->handler(kvm, &gfn_range);
> +
> +                       /*
> +                        * Use a precise gfn-based TLB flush when possible, as
> +                        * most mmu_notifier events affect a small-ish range.
> +                        * Fall back to a full TLB flush if the gfn-based flush
> +                        * fails, and don't bother trying the gfn-based flush
> +                        * if a full flush is already pending.
> +                        */
> +                       if (range->flush_on_ret && !need_flush && r.ret &&
> +                           kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start
> +                                                            gfn_range.end - gfn_range.start + 1))
> +                               need_flush = true;
>                 }
>         }
>  
> -       if (range->flush_on_ret && r.ret)
> +       if (need_flush)
>                 kvm_flush_remote_tlbs(kvm);
>  
>         if (r.found_memslot)

I think this works for us on HW that has range invalidation, which
would already be a positive move.

For the lesser HW that isn't range capable, it also gives the
opportunity to perform the iteration ourselves or go for the nuclear
option if the range is larger than some arbitrary constant (though
this is additional work).

But this still considers the whole range as being affected by
range->handler(). It'd be interesting to try and see whether more
precise tracking is (or isn't) generally beneficial.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-13  9:56         ` Marc Zyngier
@ 2024-04-15 17:03           ` Sean Christopherson
  2024-04-18 14:19             ` Will Deacon
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2024-04-15 17:03 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Will Deacon, Paolo Bonzini, linux-kernel, kvm, Oliver Upton,
	Tianrui Zhao, Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin,
	Anup Patel, Atish Patra, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Sat, Apr 13, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote:
> > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > would be really great to change the .clear_flush_young() callback so
> > > > that the architecture could handle the TLB invalidation. At the moment,
> > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > when we actually update the ptes for small ranges.
> > > 
> > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > costly consequences).
> > > 
> > > In general, it feels like the TLB invalidation should stay with the
> > > code that deals with the page tables, as it has a pretty good idea of
> > > what needs to be invalidated and how -- specially on architectures
> > > that have a HW-broadcast facility like arm64.
> > 
> > Would this be roughly on par with an in-line flush on arm64?  The simpler, more
> > straightforward solution would be to let architectures override flush_on_ret,
> > but I would prefer something like the below as x86 can also utilize a range-based
> > flush when running as a nested hypervisor.

...

> I think this works for us on HW that has range invalidation, which
> would already be a positive move.
> 
> For the lesser HW that isn't range capable, it also gives the
> opportunity to perform the iteration ourselves or go for the nuclear
> option if the range is larger than some arbitrary constant (though
> this is additional work).
> 
> But this still considers the whole range as being affected by
> range->handler(). It'd be interesting to try and see whether more
> precise tracking is (or isn't) generally beneficial.

I assume the idea would be to let arch code do single-page invalidations of
stage-2 entries for each gfn?

Unless I'm having a brain fart, x86 can't make use of that functionality.  Intel
doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
provides an instruction to do broadcast invalidations, but it takes a virtual
address, i.e. a stage-1 address.  I can't tell if it's a host virtual address or
a guest virtual address, but it's a moot point because KVM doen't have the guest
virtual address, and if it's a host virtual address, there would need to be valid
mappings in the host page tables for it to work, which KVM can't guarantee.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-15 17:03           ` Sean Christopherson
@ 2024-04-18 14:19             ` Will Deacon
  2024-04-18 19:53               ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Will Deacon @ 2024-04-18 14:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, linux-kernel, kvm, Oliver Upton,
	Tianrui Zhao, Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin,
	Anup Patel, Atish Patra, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote:
> > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > would be really great to change the .clear_flush_young() callback so
> > > > > that the architecture could handle the TLB invalidation. At the moment,
> > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > when we actually update the ptes for small ranges.
> > > > 
> > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > costly consequences).
> > > > 
> > > > In general, it feels like the TLB invalidation should stay with the
> > > > code that deals with the page tables, as it has a pretty good idea of
> > > > what needs to be invalidated and how -- specially on architectures
> > > > that have a HW-broadcast facility like arm64.
> > > 
> > > Would this be roughly on par with an in-line flush on arm64?  The simpler, more
> > > straightforward solution would be to let architectures override flush_on_ret,
> > > but I would prefer something like the below as x86 can also utilize a range-based
> > > flush when running as a nested hypervisor.
> 
> ...
> 
> > I think this works for us on HW that has range invalidation, which
> > would already be a positive move.
> > 
> > For the lesser HW that isn't range capable, it also gives the
> > opportunity to perform the iteration ourselves or go for the nuclear
> > option if the range is larger than some arbitrary constant (though
> > this is additional work).
> > 
> > But this still considers the whole range as being affected by
> > range->handler(). It'd be interesting to try and see whether more
> > precise tracking is (or isn't) generally beneficial.
> 
> I assume the idea would be to let arch code do single-page invalidations of
> stage-2 entries for each gfn?

Right, as it's the only code which knows which ptes actually ended up
being aged.

> Unless I'm having a brain fart, x86 can't make use of that functionality.  Intel
> doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> provides an instruction to do broadcast invalidations, but it takes a virtual
> address, i.e. a stage-1 address.  I can't tell if it's a host virtual address or
> a guest virtual address, but it's a moot point because KVM doen't have the guest
> virtual address, and if it's a host virtual address, there would need to be valid
> mappings in the host page tables for it to work, which KVM can't guarantee.

Ah, so it sounds like it would need to be an arch opt-in then.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-18 14:19             ` Will Deacon
@ 2024-04-18 19:53               ` Sean Christopherson
  2024-04-19 11:24                 ` Will Deacon
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2024-04-18 19:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: Marc Zyngier, Paolo Bonzini, linux-kernel, kvm, Oliver Upton,
	Tianrui Zhao, Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin,
	Anup Patel, Atish Patra, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Thu, Apr 18, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote:
> > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > that the architecture could handle the TLB invalidation. At the moment,
> > > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > > when we actually update the ptes for small ranges.
> > > > > 
> > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > costly consequences).
> > > > > 
> > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > what needs to be invalidated and how -- specially on architectures
> > > > > that have a HW-broadcast facility like arm64.
> > > > 
> > > > Would this be roughly on par with an in-line flush on arm64?  The simpler, more
> > > > straightforward solution would be to let architectures override flush_on_ret,
> > > > but I would prefer something like the below as x86 can also utilize a range-based
> > > > flush when running as a nested hypervisor.
> > 
> > ...
> > 
> > > I think this works for us on HW that has range invalidation, which
> > > would already be a positive move.
> > > 
> > > For the lesser HW that isn't range capable, it also gives the
> > > opportunity to perform the iteration ourselves or go for the nuclear
> > > option if the range is larger than some arbitrary constant (though
> > > this is additional work).
> > > 
> > > But this still considers the whole range as being affected by
> > > range->handler(). It'd be interesting to try and see whether more
> > > precise tracking is (or isn't) generally beneficial.
> > 
> > I assume the idea would be to let arch code do single-page invalidations of
> > stage-2 entries for each gfn?
> 
> Right, as it's the only code which knows which ptes actually ended up
> being aged.
> 
> > Unless I'm having a brain fart, x86 can't make use of that functionality.  Intel
> > doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> > provides an instruction to do broadcast invalidations, but it takes a virtual
> > address, i.e. a stage-1 address.  I can't tell if it's a host virtual address or
> > a guest virtual address, but it's a moot point because KVM doen't have the guest
> > virtual address, and if it's a host virtual address, there would need to be valid
> > mappings in the host page tables for it to work, which KVM can't guarantee.
> 
> Ah, so it sounds like it would need to be an arch opt-in then.

Even if x86 (or some other arch code) could use the precise tracking, I think it
would make sense to have the behavior be arch specific.  Adding infrastructure
to get information from arch code, only to turn around and give it back to arch
code would be odd.

Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
the best/easiest solution would be to let arm64 opt out of the common TLB flush
when a SPTE is made young.

With the range-based flushing bundled in, this?

---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 40 +++++++++++++++++++++++++---------------
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index afbc99264ffa..8fe5f5e16919 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+int kvm_arch_flush_tlb_if_young(void);
+
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
 	if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38b498669ef9..5ebef8ef239c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,6 +595,11 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
+int __weak kvm_arch_flush_tlb_if_young(void)
+{
+	return true;
+}
+
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define kvm_for_each_memslot_in_hva_range(node, slots, start, last)	     \
 	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
@@ -611,6 +616,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
 	struct kvm_memslots *slots;
+	bool need_flush = false;
 	int i, idx;
 
 	if (WARN_ON_ONCE(range->end <= range->start))
@@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 					break;
 			}
 			r.ret |= range->handler(kvm, &gfn_range);
+
+		       /*
+			* Use a precise gfn-based TLB flush when possible, as
+			* most mmu_notifier events affect a small-ish range.
+			* Fall back to a full TLB flush if the gfn-based flush
+			* fails, and don't bother trying the gfn-based flush
+			* if a full flush is already pending.
+			*/
+		       if (range->flush_on_ret && !need_flush && r.ret &&
+			   kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
+							    gfn_range.end - gfn_range.start + 1))
+			       need_flush = true;
 		}
 	}
 
-	if (range->flush_on_ret && r.ret)
+	if (need_flush)
 		kvm_flush_remote_tlbs(kvm);
 
 	if (r.found_memslot)
@@ -680,7 +698,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 						unsigned long start,
 						unsigned long end,
-						gfn_handler_t handler)
+						gfn_handler_t handler,
+						bool flush_on_ret)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	const struct kvm_mmu_notifier_range range = {
@@ -688,7 +707,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 		.end		= end,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
-		.flush_on_ret	= true,
+		.flush_on_ret	= flush_on_ret,
 		.may_block	= false,
 	};
 
@@ -700,17 +719,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 							 unsigned long end,
 							 gfn_handler_t handler)
 {
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_mmu_notifier_range range = {
-		.start		= start,
-		.end		= end,
-		.handler	= handler,
-		.on_lock	= (void *)kvm_null_fn,
-		.flush_on_ret	= false,
-		.may_block	= false,
-	};
-
-	return __kvm_handle_hva_range(kvm, &range).ret;
+	return kvm_handle_hva_range(mn, start, end, handler, false);
 }
 
 void kvm_mmu_invalidate_begin(struct kvm *kvm)
@@ -876,7 +885,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 {
 	trace_kvm_age_hva(start, end);
 
-	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
+	return kvm_handle_hva_range(mn, start, end, kvm_age_gfn,
+				    kvm_arch_flush_tlb_if_young());
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,

base-commit: eae53272c8ad4e7ed2bbb11bd0456eb5b0484f0c
-- 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-18 19:53               ` Sean Christopherson
@ 2024-04-19 11:24                 ` Will Deacon
  2024-04-19 13:58                   ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Will Deacon @ 2024-04-19 11:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, linux-kernel, kvm, Oliver Upton,
	Tianrui Zhao, Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin,
	Anup Patel, Atish Patra, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Thu, Apr 18, 2024 at 12:53:26PM -0700, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, Will Deacon wrote:
> > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > > > 
> > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon <will@kernel.org> wrote:
> > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > > that the architecture could handle the TLB invalidation. At the moment,
> > > > > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > > > when we actually update the ptes for small ranges.
> > > > > > 
> > > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > > costly consequences).
> > > > > > 
> > > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > > what needs to be invalidated and how -- specially on architectures
> > > > > > that have a HW-broadcast facility like arm64.
> > > > > 
> > > > > Would this be roughly on par with an in-line flush on arm64?  The simpler, more
> > > > > straightforward solution would be to let architectures override flush_on_ret,
> > > > > but I would prefer something like the below as x86 can also utilize a range-based
> > > > > flush when running as a nested hypervisor.
> > > 
> > > ...
> > > 
> > > > I think this works for us on HW that has range invalidation, which
> > > > would already be a positive move.
> > > > 
> > > > For the lesser HW that isn't range capable, it also gives the
> > > > opportunity to perform the iteration ourselves or go for the nuclear
> > > > option if the range is larger than some arbitrary constant (though
> > > > this is additional work).
> > > > 
> > > > But this still considers the whole range as being affected by
> > > > range->handler(). It'd be interesting to try and see whether more
> > > > precise tracking is (or isn't) generally beneficial.
> > > 
> > > I assume the idea would be to let arch code do single-page invalidations of
> > > stage-2 entries for each gfn?
> > 
> > Right, as it's the only code which knows which ptes actually ended up
> > being aged.
> > 
> > > Unless I'm having a brain fart, x86 can't make use of that functionality.  Intel
> > > doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> > > provides an instruction to do broadcast invalidations, but it takes a virtual
> > > address, i.e. a stage-1 address.  I can't tell if it's a host virtual address or
> > > a guest virtual address, but it's a moot point because KVM doen't have the guest
> > > virtual address, and if it's a host virtual address, there would need to be valid
> > > mappings in the host page tables for it to work, which KVM can't guarantee.
> > 
> > Ah, so it sounds like it would need to be an arch opt-in then.
> 
> Even if x86 (or some other arch code) could use the precise tracking, I think it
> would make sense to have the behavior be arch specific.  Adding infrastructure
> to get information from arch code, only to turn around and give it back to arch
> code would be odd.

Sorry, yes, that's what I had in mind. Basically, a way for the arch code
to say "I've handled the TLBI, don't worry about it."

> Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
> the best/easiest solution would be to let arm64 opt out of the common TLB flush
> when a SPTE is made young.
> 
> With the range-based flushing bundled in, this?
> 
> ---
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/kvm_main.c      | 40 +++++++++++++++++++++++++---------------
>  2 files changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index afbc99264ffa..8fe5f5e16919 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>  
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> +int kvm_arch_flush_tlb_if_young(void);
> +
>  static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>  {
>  	if (unlikely(kvm->mmu_invalidate_in_progress))
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 38b498669ef9..5ebef8ef239c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -595,6 +595,11 @@ static void kvm_null_fn(void)
>  }
>  #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
>  
> +int __weak kvm_arch_flush_tlb_if_young(void)
> +{
> +	return true;
> +}

I tend to find __weak functions a little ugly, but I think the gist of the
diff looks good to me. Thanks for putting it together!

> +
>  /* Iterate over each memslot intersecting [start, last] (inclusive) range */
>  #define kvm_for_each_memslot_in_hva_range(node, slots, start, last)	     \
>  	for (node = interval_tree_iter_first(&slots->hva_tree, start, last); \
> @@ -611,6 +616,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  	struct kvm_gfn_range gfn_range;
>  	struct kvm_memory_slot *slot;
>  	struct kvm_memslots *slots;
> +	bool need_flush = false;
>  	int i, idx;
>  
>  	if (WARN_ON_ONCE(range->end <= range->start))
> @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  					break;
>  			}
>  			r.ret |= range->handler(kvm, &gfn_range);
> +
> +		       /*
> +			* Use a precise gfn-based TLB flush when possible, as
> +			* most mmu_notifier events affect a small-ish range.
> +			* Fall back to a full TLB flush if the gfn-based flush
> +			* fails, and don't bother trying the gfn-based flush
> +			* if a full flush is already pending.
> +			*/
> +		       if (range->flush_on_ret && !need_flush && r.ret &&
> +			   kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> +							    gfn_range.end - gfn_range.start + 1))

What's that '+ 1' needed for here?

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
  2024-04-19 11:24                 ` Will Deacon
@ 2024-04-19 13:58                   ` Sean Christopherson
  0 siblings, 0 replies; 28+ messages in thread
From: Sean Christopherson @ 2024-04-19 13:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: Marc Zyngier, Paolo Bonzini, linux-kernel, kvm, Oliver Upton,
	Tianrui Zhao, Bibo Mao, Thomas Bogendoerfer, Nicholas Piggin,
	Anup Patel, Atish Patra, Andrew Morton, David Hildenbrand,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, linuxppc-dev,
	kvm-riscv, linux-mm, linux-trace-kernel, linux-perf-users

On Fri, Apr 19, 2024, Will Deacon wrote:
> > @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >  					break;
> >  			}
> >  			r.ret |= range->handler(kvm, &gfn_range);
> > +
> > +		       /*
> > +			* Use a precise gfn-based TLB flush when possible, as
> > +			* most mmu_notifier events affect a small-ish range.
> > +			* Fall back to a full TLB flush if the gfn-based flush
> > +			* fails, and don't bother trying the gfn-based flush
> > +			* if a full flush is already pending.
> > +			*/
> > +		       if (range->flush_on_ret && !need_flush && r.ret &&
> > +			   kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> > +							    gfn_range.end - gfn_range.start + 1))
> 
> What's that '+ 1' needed for here?

 (a) To see if you're paying attention.
 (b) Because more is always better.
 (c) Because math is hard.
 (d) Because I haven't tested this.
 (e) Both (c) and (d).

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-04-19 13:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-05 11:58 [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Paolo Bonzini
2024-04-05 11:58 ` [PATCH 1/4] KVM: delete .change_pte MMU notifier callback Paolo Bonzini
2024-04-07  4:50   ` Anup Patel
2024-04-08  7:23   ` maobibo
2024-04-08 11:45   ` Michael Ellerman
2024-04-08 13:56   ` Peter Xu
2024-04-11 16:55     ` Paolo Bonzini
2024-04-11 18:47       ` Peter Xu
2024-04-12 20:01       ` David Hildenbrand
2024-04-12 10:44   ` Will Deacon
2024-04-12 13:15     ` Marc Zyngier
2024-04-12 14:54       ` Sean Christopherson
2024-04-13  9:56         ` Marc Zyngier
2024-04-15 17:03           ` Sean Christopherson
2024-04-18 14:19             ` Will Deacon
2024-04-18 19:53               ` Sean Christopherson
2024-04-19 11:24                 ` Will Deacon
2024-04-19 13:58                   ` Sean Christopherson
2024-04-05 11:58 ` [PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range() Paolo Bonzini
2024-04-08  6:31   ` Philippe Mathieu-Daudé
2024-04-05 11:58 ` [PATCH 3/4] mmu_notifier: remove the .change_pte() callback Paolo Bonzini
2024-04-08  7:35   ` David Hildenbrand
2024-04-05 11:58 ` [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at() Paolo Bonzini
2024-04-08  6:28   ` Philippe Mathieu-Daudé
2024-04-08  7:36   ` David Hildenbrand
2024-04-10 21:30 ` [PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify() Andrew Morton
2024-04-11 16:57   ` Paolo Bonzini
2024-04-12 13:07 ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).