linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit
@ 2023-05-26 23:44 Yu Zhao
  2023-05-26 23:44 ` [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young() Yu Zhao
                   ` (13 more replies)
  0 siblings, 14 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

TLDR
====
This patchset adds a fast path to clear the accessed bit without
taking kvm->mmu_lock. It can significantly improve the performance of
guests when the host is under heavy memory pressure.

ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.

This v2 addressed previous requests [2] on refactoring code, removing
inaccurate/redundant texts, etc.

[1] https://crrev.com/c/2987928
[2] https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/

Overview
========
The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on a simple yet
common case where hardware sets the accessed bit in KVM PTEs and VMs
are not nested. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

The fast path relies on two techniques to safely clear the accessed
bit: RCU and CAS. The former protects KVM page tables from being
freed while the latter clears the accessed bit atomically against
both the hardware and other software page table walkers.

A new mmu_notifier_ops member, test_clear_young(), supersedes the
existing clear_young() and test_young(). This extended callback can
operate on a range of KVM PTEs individually according to a bitmap, if
the caller provides it.

Evaluation
==========
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 128 CPUs and
256GB DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250
  
  MGLRU         run2
  ------------------
  Before [1]    ~64s
  After         ~51s
  
  kswapd (MGLRU before)
    100.00%  balance_pgdat
      100.00%  shrink_node
        100.00%  shrink_one
          99.99%  try_to_shrink_lruvec
            99.71%  evict_folios
              97.29%  shrink_folio_list
  ==>>          13.05%  folio_referenced
                  12.83%  rmap_walk_file
                    12.31%  folio_referenced_one
                      7.90%  __mmu_notifier_clear_young
                        7.72%  kvm_mmu_notifier_clear_young
                          7.34%  _raw_write_lock
  
  kswapd (MGLRU after)
    100.00%  balance_pgdat
      100.00%  shrink_node
        100.00%  shrink_one
          99.99%  try_to_shrink_lruvec
            99.59%  evict_folios
              80.37%  shrink_folio_list
  ==>>          3.74%  folio_referenced
                  3.59%  rmap_walk_file
                    3.19%  folio_referenced_one
                      2.53%  lru_gen_look_around
                        1.06%  __mmu_notifier_test_clear_young

Comprehensive benchmarks are coming soon.

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
     reference" was included so that the comparison is apples to
     apples.
    https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/

Yu Zhao (10):
  mm/kvm: add mmu_notifier_ops->test_clear_young()
  mm/kvm: use mmu_notifier_ops->test_clear_young()
  kvm/arm64: export stage2_try_set_pte() and macros
  kvm/arm64: make stage2 page tables RCU safe
  kvm/arm64: add kvm_arch_test_clear_young()
  kvm/powerpc: make radix page tables RCU safe
  kvm/powerpc: add kvm_arch_test_clear_young()
  kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
  kvm/x86: add kvm_arch_test_clear_young()
  mm: multi-gen LRU: use mmu_notifier_test_clear_young()

 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 arch/arm64/include/asm/kvm_host.h             |   6 +
 arch/arm64/include/asm/kvm_pgtable.h          |  55 +++++++
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/hyp/pgtable.c                  |  61 +-------
 arch/arm64/kvm/mmu.c                          |  53 ++++++-
 arch/powerpc/include/asm/kvm_host.h           |   8 +
 arch/powerpc/include/asm/kvm_ppc.h            |   1 +
 arch/powerpc/kvm/book3s.c                     |   6 +
 arch/powerpc/kvm/book3s.h                     |   1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |  65 +++++++-
 arch/powerpc/kvm/book3s_hv.c                  |   5 +
 arch/x86/include/asm/kvm_host.h               |  13 ++
 arch/x86/kvm/mmu.h                            |   6 -
 arch/x86/kvm/mmu/spte.h                       |   1 -
 arch/x86/kvm/mmu/tdp_mmu.c                    |  34 +++++
 include/linux/kvm_host.h                      |  22 +++
 include/linux/mmu_notifier.h                  |  79 ++++++----
 include/linux/mmzone.h                        |   6 +-
 include/trace/events/kvm.h                    |  15 --
 mm/mmu_notifier.c                             |  48 ++----
 mm/rmap.c                                     |   8 +-
 mm/vmscan.c                                   | 139 ++++++++++++++++--
 virt/kvm/kvm_main.c                           | 114 ++++++++------
 24 files changed, 546 insertions(+), 207 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-06-06  8:34   ` Tzung-Bi Shih
                     ` (3 more replies)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young() Yu Zhao
                   ` (12 subsequent siblings)
  13 siblings, 4 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

Add mmu_notifier_ops->test_clear_young() to supersede test_young()
and clear_young().

test_clear_young() has a fast path, which if supported, allows its
callers to safely clear the accessed bit without taking
kvm->mmu_lock.

The fast path requires arch-specific code that generally relies on
RCU and CAS: the former protects KVM page tables from being freed
while the latter clears the accessed bit atomically against both the
hardware and other software page table walkers. If the fast path is
unsupported, test_clear_young() falls back to the existing slow path
where kvm->mmu_lock is then taken.

test_clear_young() can also operate on a range of KVM PTEs
individually according to a bitmap, if the caller provides it.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/kvm_host.h     | 22 +++++++++++
 include/linux/mmu_notifier.h | 52 ++++++++++++++++++++++++
 mm/mmu_notifier.c            | 24 ++++++++++++
 virt/kvm/kvm_main.c          | 76 +++++++++++++++++++++++++++++++++++-
 4 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e571e973bc2..374262545f96 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
+	void *args;
 	gfn_t start;
 	gfn_t end;
 	pte_t pte;
@@ -267,6 +268,27 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn);
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);
+#endif
+
+/*
+ * Architectures that implement kvm_arch_test_clear_young() should override
+ * kvm_arch_has_test_clear_young().
+ *
+ * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
+ * can return true if kvm_arch_test_clear_young() is supported but disabled due
+ * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
+ * return true; otherwise, it should return false.
+ *
+ * For each young KVM PTE, kvm_arch_test_clear_young() should call
+ * kvm_should_clear_young() to decide whether to clear the accessed bit.
+ */
+#ifndef kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+	return false;
+}
 #endif
 
 enum {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..dfdbb370682d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -60,6 +60,8 @@ enum mmu_notifier_event {
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
+#define MMU_NOTIFIER_RANGE_LOCKLESS	(1 << 1)
+#define MMU_NOTIFIER_RANGE_YOUNG	(1 << 2)
 
 struct mmu_notifier_ops {
 	/*
@@ -122,6 +124,10 @@ struct mmu_notifier_ops {
 			  struct mm_struct *mm,
 			  unsigned long address);
 
+	int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
+				unsigned long start, unsigned long end,
+				bool clear, unsigned long *bitmap);
+
 	/*
 	 * change_pte is called in cases that pte mapping to page is changed:
 	 * for example, when ksm remaps pte to point to a new shared page.
@@ -392,6 +398,9 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
 				      unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
+extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+					   unsigned long start, unsigned long end,
+					   bool clear, unsigned long *bitmap);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
@@ -440,6 +449,35 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 	return 0;
 }
 
+/*
+ * mmu_notifier_test_clear_young() returns nonzero if any of the KVM PTEs within
+ * a given range was young. Specifically, it returns MMU_NOTIFIER_RANGE_LOCKLESS
+ * if the fast path was successful, MMU_NOTIFIER_RANGE_YOUNG otherwise.
+ *
+ * The last parameter to the function is a bitmap and only the fast path
+ * supports it: if it is NULL, the function falls back to the slow path if the
+ * fast path was unsuccessful; otherwise, the function bails out.
+ *
+ * The bitmap has the following specifications:
+ * 1. The number of bits should be at least (end-start)/PAGE_SIZE.
+ * 2. The offset of each bit should be relative to the end, i.e., the offset
+ *    corresponding to addr should be (end-addr)/PAGE_SIZE-1. This is convenient
+ *    for batching while forward looping.
+ *
+ * When testing, this function sets the corresponding bit in the bitmap for each
+ * young KVM PTE. When clearing, this function clears the accessed bit for each
+ * young KVM PTE whose corresponding bit in the bitmap is set.
+ */
+static inline int mmu_notifier_test_clear_young(struct mm_struct *mm,
+						unsigned long start, unsigned long end,
+						bool clear, unsigned long *bitmap)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_test_clear_young(mm, start, end, clear, bitmap);
+
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
@@ -684,12 +722,26 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long end)
+{
+	return 0;
+}
+
 static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
 	return 0;
 }
 
+static inline int mmu_notifier_test_clear_young(struct mm_struct *mm,
+						unsigned long start, unsigned long end,
+						bool clear, unsigned long *bitmap)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde1354f..7e6aba4bddcb 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -424,6 +424,30 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
+int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+				    unsigned long start, unsigned long end,
+				    bool clear, unsigned long *bitmap)
+{
+	int idx;
+	struct mmu_notifier *mn;
+	int young = 0;
+
+	idx = srcu_read_lock(&srcu);
+
+	hlist_for_each_entry_srcu(mn, &mm->notifier_subscriptions->list, hlist,
+				  srcu_read_lock_held(&srcu)) {
+		if (mn->ops->test_clear_young)
+			young |= mn->ops->test_clear_young(mn, mm, start, end, clear, bitmap);
+
+		if (young && !clear)
+			break;
+	}
+
+	srcu_read_unlock(&srcu, idx);
+
+	return young;
+}
+
 void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 			       pte_t pte)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 51e4882d0873..31ee58754b19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -541,6 +541,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_hva_range {
+	void *args;
 	unsigned long start;
 	unsigned long end;
 	pte_t pte;
@@ -549,6 +550,7 @@ struct kvm_hva_range {
 	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
 	bool may_block;
+	bool lockless;
 };
 
 /*
@@ -602,6 +604,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			hva_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));
 
+			gfn_range.args = range->args;
+
 			/*
 			 * To optimize for the likely case where the address
 			 * range is covered by zero or one memslots, don't
@@ -619,7 +623,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
 
-			if (!locked) {
+			if (!range->lockless && !locked) {
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
@@ -628,6 +632,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 					break;
 			}
 			ret |= range->handler(kvm, &gfn_range);
+
+			if (range->lockless && ret)
+				break;
 		}
 	}
 
@@ -880,6 +887,72 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 					     kvm_test_age_gfn);
 }
 
+struct test_clear_young_args {
+	unsigned long *bitmap;
+	unsigned long end;
+	bool clear;
+	bool young;
+};
+
+bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
+{
+	struct test_clear_young_args *args = range->args;
+
+	VM_WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
+
+	args->young = true;
+
+	if (args->bitmap) {
+		int offset = hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
+
+		if (args->clear)
+			return test_bit(offset, args->bitmap);
+
+		__set_bit(offset, args->bitmap);
+	}
+
+	return args->clear;
+}
+
+static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
+					     unsigned long start, unsigned long end,
+					     bool clear, unsigned long *bitmap)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	struct kvm_hva_range range = {
+		.start		= start,
+		.end		= end,
+		.on_lock	= (void *)kvm_null_fn,
+		.on_unlock	= (void *)kvm_null_fn,
+	};
+
+	trace_kvm_age_hva(start, end);
+
+	if (kvm_arch_has_test_clear_young()) {
+		struct test_clear_young_args args = {
+			.bitmap	= bitmap,
+			.end	= end,
+			.clear	= clear,
+		};
+
+		range.args = &args;
+		range.lockless = true;
+		range.handler = kvm_arch_test_clear_young;
+
+		if (!__kvm_handle_hva_range(kvm, &range))
+			return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;
+	}
+
+	if (bitmap)
+		return 0;
+
+	range.args = NULL;
+	range.lockless = false;
+	range.handler = clear ? kvm_age_gfn : kvm_test_age_gfn;
+
+	return __kvm_handle_hva_range(kvm, &range) ? MMU_NOTIFIER_RANGE_YOUNG : 0;
+}
+
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 				     struct mm_struct *mm)
 {
@@ -898,6 +971,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
+	.test_clear_young	= kvm_mmu_notifier_test_clear_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
 };
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young()
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
  2023-05-26 23:44 ` [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young() Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-05-26 23:44 ` [PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros Yu Zhao
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

Replace test_young() and clear_young() with test_clear_young().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/mmu_notifier.h | 29 ++-----------------
 include/trace/events/kvm.h   | 15 ----------
 mm/mmu_notifier.c            | 42 ----------------------------
 virt/kvm/kvm_main.c          | 54 ------------------------------------
 4 files changed, 2 insertions(+), 138 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index dfdbb370682d..c8f35fc08703 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -104,26 +104,6 @@ struct mmu_notifier_ops {
 				 unsigned long start,
 				 unsigned long end);
 
-	/*
-	 * clear_young is a lightweight version of clear_flush_young. Like the
-	 * latter, it is supposed to test-and-clear the young/accessed bitflag
-	 * in the secondary pte, but it may omit flushing the secondary tlb.
-	 */
-	int (*clear_young)(struct mmu_notifier *subscription,
-			   struct mm_struct *mm,
-			   unsigned long start,
-			   unsigned long end);
-
-	/*
-	 * test_young is called to check the young/accessed bitflag in
-	 * the secondary pte. This is used to know if the page is
-	 * frequently used without actually clearing the flag or tearing
-	 * down the secondary mapping on the page.
-	 */
-	int (*test_young)(struct mmu_notifier *subscription,
-			  struct mm_struct *mm,
-			  unsigned long address);
-
 	int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
 				unsigned long start, unsigned long end,
 				bool clear, unsigned long *bitmap);
@@ -393,11 +373,6 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
-extern int __mmu_notifier_clear_young(struct mm_struct *mm,
-				      unsigned long start,
-				      unsigned long end);
-extern int __mmu_notifier_test_young(struct mm_struct *mm,
-				     unsigned long address);
 extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
 					   unsigned long start, unsigned long end,
 					   bool clear, unsigned long *bitmap);
@@ -437,7 +412,7 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
 					   unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		return __mmu_notifier_clear_young(mm, start, end);
+		return __mmu_notifier_test_clear_young(mm, start, end, true, NULL);
 	return 0;
 }
 
@@ -445,7 +420,7 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
 	if (mm_has_notifiers(mm))
-		return __mmu_notifier_test_young(mm, address);
+		return __mmu_notifier_test_clear_young(mm, address, address + 1, false, NULL);
 	return 0;
 }
 
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 3bd31ea23fee..46c347e56e60 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -489,21 +489,6 @@ TRACE_EVENT(kvm_age_hva,
 		  __entry->start, __entry->end)
 );
 
-TRACE_EVENT(kvm_test_age_hva,
-	TP_PROTO(unsigned long hva),
-	TP_ARGS(hva),
-
-	TP_STRUCT__entry(
-		__field(	unsigned long,	hva		)
-	),
-
-	TP_fast_assign(
-		__entry->hva		= hva;
-	),
-
-	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
-);
-
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7e6aba4bddcb..c7e9747c9920 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -382,48 +382,6 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return young;
 }
 
-int __mmu_notifier_clear_young(struct mm_struct *mm,
-			       unsigned long start,
-			       unsigned long end)
-{
-	struct mmu_notifier *subscription;
-	int young = 0, id;
-
-	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(subscription,
-				 &mm->notifier_subscriptions->list, hlist,
-				 srcu_read_lock_held(&srcu)) {
-		if (subscription->ops->clear_young)
-			young |= subscription->ops->clear_young(subscription,
-								mm, start, end);
-	}
-	srcu_read_unlock(&srcu, id);
-
-	return young;
-}
-
-int __mmu_notifier_test_young(struct mm_struct *mm,
-			      unsigned long address)
-{
-	struct mmu_notifier *subscription;
-	int young = 0, id;
-
-	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(subscription,
-				 &mm->notifier_subscriptions->list, hlist,
-				 srcu_read_lock_held(&srcu)) {
-		if (subscription->ops->test_young) {
-			young = subscription->ops->test_young(subscription, mm,
-							      address);
-			if (young)
-				break;
-		}
-	}
-	srcu_read_unlock(&srcu, id);
-
-	return young;
-}
-
 int __mmu_notifier_test_clear_young(struct mm_struct *mm,
 				    unsigned long start, unsigned long end,
 				    bool clear, unsigned long *bitmap)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 31ee58754b19..977baaf1b248 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -674,25 +674,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
 	return __kvm_handle_hva_range(kvm, &range);
 }
 
-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
-							 unsigned long start,
-							 unsigned long end,
-							 hva_handler_t handler)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	const struct kvm_hva_range range = {
-		.start		= start,
-		.end		= end,
-		.pte		= __pte(0),
-		.handler	= handler,
-		.on_lock	= (void *)kvm_null_fn,
-		.on_unlock	= (void *)kvm_null_fn,
-		.flush_on_ret	= false,
-		.may_block	= false,
-	};
-
-	return __kvm_handle_hva_range(kvm, &range);
-}
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
@@ -854,39 +835,6 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
 }
 
-static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
-					struct mm_struct *mm,
-					unsigned long start,
-					unsigned long end)
-{
-	trace_kvm_age_hva(start, end);
-
-	/*
-	 * Even though we do not flush TLB, this will still adversely
-	 * affect performance on pre-Haswell Intel EPT, where there is
-	 * no EPT Access Bit to clear so that we have to tear down EPT
-	 * tables instead. If we find this unacceptable, we can always
-	 * add a parameter to kvm_age_hva so that it effectively doesn't
-	 * do anything on clear_young.
-	 *
-	 * Also note that currently we never issue secondary TLB flushes
-	 * from clear_young, leaving this job up to the regular system
-	 * cadence. If we find this inaccurate, we might come up with a
-	 * more sophisticated heuristic later.
-	 */
-	return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
-}
-
-static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
-				       struct mm_struct *mm,
-				       unsigned long address)
-{
-	trace_kvm_test_age_hva(address);
-
-	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
-					     kvm_test_age_gfn);
-}
-
 struct test_clear_young_args {
 	unsigned long *bitmap;
 	unsigned long end;
@@ -969,8 +917,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-	.clear_young		= kvm_mmu_notifier_clear_young,
-	.test_young		= kvm_mmu_notifier_test_young,
 	.test_clear_young	= kvm_mmu_notifier_test_clear_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
  2023-05-26 23:44 ` [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young() Yu Zhao
  2023-05-26 23:44 ` [PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young() Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-05-26 23:44 ` [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe Yu Zhao
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

stage2_try_set_pte() and KVM_PTE_LEAF_ATTR_LO_S2_AF are needed to
implement kvm_arch_test_clear_young().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 53 ++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 53 ----------------------------
 2 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index dc3c072e862f..ff520598b62c 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -44,6 +44,49 @@ typedef u64 kvm_pte_t;
 
 #define KVM_PHYS_INVALID		(-1ULL)
 
+#define KVM_PTE_TYPE			BIT(1)
+#define KVM_PTE_TYPE_BLOCK		0
+#define KVM_PTE_TYPE_PAGE		1
+#define KVM_PTE_TYPE_TABLE		1
+
+#define KVM_PTE_LEAF_ATTR_LO		GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX	GENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP	GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO	3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW	1
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH	GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS	3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF	BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR	GENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R	BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W	BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH	GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS	3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF	BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI		GENMASK(63, 51)
+
+#define KVM_PTE_LEAF_ATTR_HI_SW		GENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
+					 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+					 KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
+#define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
+#define KVM_MAX_OWNER_ID		1
+
+/*
+ * Used to indicate a pte for which a 'break-before-make' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED		BIT(10)
+
 static inline bool kvm_pte_valid(kvm_pte_t pte)
 {
 	return pte & KVM_PTE_VALID;
@@ -224,6 +267,16 @@ static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *c
 	return ctx->flags & KVM_PGTABLE_WALK_SHARED;
 }
 
+static inline bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
+{
+	if (!kvm_pgtable_walk_shared(ctx)) {
+		WRITE_ONCE(*ctx->ptep, new);
+		return true;
+	}
+
+	return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
+}
+
 /**
  * struct kvm_pgtable_walker - Hook into a page-table walk.
  * @cb:		Callback function to invoke during the walk.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5282cb9ca4cf..24678ccba76a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -12,49 +12,6 @@
 #include <asm/stage2_pgtable.h>
 
 
-#define KVM_PTE_TYPE			BIT(1)
-#define KVM_PTE_TYPE_BLOCK		0
-#define KVM_PTE_TYPE_PAGE		1
-#define KVM_PTE_TYPE_TABLE		1
-
-#define KVM_PTE_LEAF_ATTR_LO		GENMASK(11, 2)
-
-#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX	GENMASK(4, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP	GENMASK(7, 6)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO	3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW	1
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH	GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS	3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AF	BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR	GENMASK(5, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R	BIT(6)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W	BIT(7)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH	GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS	3
-#define KVM_PTE_LEAF_ATTR_LO_S2_AF	BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_HI		GENMASK(63, 51)
-
-#define KVM_PTE_LEAF_ATTR_HI_SW		GENMASK(58, 55)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
-					 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
-					 KVM_PTE_LEAF_ATTR_HI_S2_XN)
-
-#define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
-#define KVM_MAX_OWNER_ID		1
-
-/*
- * Used to indicate a pte for which a 'break-before-make' sequence is in
- * progress.
- */
-#define KVM_INVALID_PTE_LOCKED		BIT(10)
-
 struct kvm_pgtable_walk_data {
 	struct kvm_pgtable_walker	*walker;
 
@@ -702,16 +659,6 @@ static bool stage2_pte_is_locked(kvm_pte_t pte)
 	return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
 }
 
-static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
-{
-	if (!kvm_pgtable_walk_shared(ctx)) {
-		WRITE_ONCE(*ctx->ptep, new);
-		return true;
-	}
-
-	return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
-}
-
 /**
  * stage2_try_break_pte() - Invalidates a pte according to the
  *			    'break-before-make' requirements of the
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (2 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-05-27 18:08   ` Oliver Upton
  2023-05-26 23:44 ` [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young() Yu Zhao
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

Stage2 page tables are currently not RCU safe against unmapping or VM
destruction. The previous mmu_notifier_ops members rely on
kvm->mmu_lock to synchronize with those operations.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, unmapped page tables need
to be freed by RCU and kvm_free_stage2_pgd() needs to be after
mmu_notifier_unregister().

Remapping, specifically stage2_free_removed_table(), is already RCU
safe.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  2 ++
 arch/arm64/kvm/arm.c                 |  1 +
 arch/arm64/kvm/hyp/pgtable.c         |  8 ++++++--
 arch/arm64/kvm/mmu.c                 | 17 ++++++++++++++++-
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index ff520598b62c..5cab52e3a35f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
  * @put_page:			Decrement the refcount on a page. When the
  *				refcount reaches 0 the page is automatically
  *				freed.
+ * @put_page_rcu:		RCU variant of the above.
  * @page_count:			Return the refcount of a page.
  * @phys_to_virt:		Convert a physical address into a virtual
  *				address	mapped in the current context.
@@ -170,6 +171,7 @@ struct kvm_pgtable_mm_ops {
 	void		(*free_removed_table)(void *addr, u32 level);
 	void		(*get_page)(void *addr);
 	void		(*put_page)(void *addr);
+	void		(*put_page_rcu)(void *addr);
 	int		(*page_count)(void *addr);
 	void*		(*phys_to_virt)(phys_addr_t phys);
 	phys_addr_t	(*virt_to_phys)(void *addr);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 14391826241c..ee93271035d9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
  */
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+	kvm_free_stage2_pgd(&kvm->arch.mmu);
 	bitmap_free(kvm->arch.pmu_filter);
 	free_cpumask_var(kvm->arch.supported_cpus);
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 24678ccba76a..dbace4c6a841 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -988,8 +988,12 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
 					       kvm_granule_size(ctx->level));
 
-	if (childp)
-		mm_ops->put_page(childp);
+	if (childp) {
+		if (mm_ops->put_page_rcu)
+			mm_ops->put_page_rcu(childp);
+		else
+			mm_ops->put_page(childp);
+	}
 
 	return 0;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3b9d4d24c361..c3b3e2afe26f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -172,6 +172,21 @@ static int kvm_host_page_count(void *addr)
 	return page_count(virt_to_page(addr));
 }
 
+static void kvm_s2_rcu_put_page(struct rcu_head *head)
+{
+	put_page(container_of(head, struct page, rcu_head));
+}
+
+static void kvm_s2_put_page_rcu(void *addr)
+{
+	struct page *page = virt_to_page(addr);
+
+	if (kvm_host_page_count(addr) == 1)
+		kvm_account_pgtable_pages(addr, -1);
+
+	call_rcu(&page->rcu_head, kvm_s2_rcu_put_page);
+}
+
 static phys_addr_t kvm_host_pa(void *addr)
 {
 	return __pa(addr);
@@ -704,6 +719,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.free_removed_table	= stage2_free_removed_table,
 	.get_page		= kvm_host_get_page,
 	.put_page		= kvm_s2_put_page,
+	.put_page_rcu		= kvm_s2_put_page_rcu,
 	.page_count		= kvm_host_page_count,
 	.phys_to_virt		= kvm_host_va,
 	.virt_to_phys		= kvm_host_pa,
@@ -1877,7 +1893,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_free_stage2_pgd(&kvm->arch.mmu);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (3 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-05-26 23:44 ` [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe Yu Zhao
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., hardware sets the accessed bit in
KVM PTEs and VMs are not protected, where it can rely on RCU and
cmpxchg to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  6 ++++++
 arch/arm64/kvm/mmu.c              | 36 +++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 7e7e19ef6993..da32b0890716 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
 void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
 bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+	return cpu_has_hw_af() && !is_protected_kvm_enabled();
+}
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c3b3e2afe26f..26a8d955b49c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 					   range->start << PAGE_SHIFT);
 }
 
+static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
+				   enum kvm_pgtable_walk_flags flags)
+{
+	kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
+
+	VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
+
+	if (!kvm_pte_valid(new))
+		return 0;
+
+	if (new == ctx->old)
+		return 0;
+
+	if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
+		stage2_try_set_pte(ctx, new);
+
+	return 0;
+}
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	u64 start = range->start * PAGE_SIZE;
+	u64 end = range->end * PAGE_SIZE;
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_test_clear_young,
+		.arg	= range,
+		.flags	= KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
+	};
+
+	BUILD_BUG_ON(is_hyp_code());
+
+	kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
+
+	return false;
+}
+
 phys_addr_t kvm_mmu_get_httbr(void)
 {
 	return __pa(hyp_pgtable->pgd);
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (4 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young() Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-06-20  6:32   ` Nicholas Piggin
  2023-05-26 23:44 ` [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young() Yu Zhao
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

KVM page tables are currently not RCU safe against remapping, i.e.,
kvmppc_unmap_free_pmd_entry_table() et al. The previous
mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
that operation.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, orphan page tables need to
be freed by RCU.

Unmapping, specifically kvm_unmap_radix(), does not free page tables,
hence not a concern.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..3b65b3b11041 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
 {
 	unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
 
-	kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
+	kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
+					  SLAB_TYPESAFE_BY_RCU, pte_ctor);
 	if (!kvm_pte_cache)
 		return -ENOMEM;
 
 	size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
 
-	kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
+	kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
+					  SLAB_TYPESAFE_BY_RCU, pmd_ctor);
 	if (!kvm_pmd_cache) {
 		kmem_cache_destroy(kvm_pte_cache);
 		return -ENOMEM;
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (5 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-06-20  7:47   ` Nicholas Piggin
  2023-05-26 23:44 ` [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask Yu Zhao
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., radix MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
pte_xchg() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/powerpc/include/asm/kvm_host.h    |  8 ++++
 arch/powerpc/include/asm/kvm_ppc.h     |  1 +
 arch/powerpc/kvm/book3s.c              |  6 +++
 arch/powerpc/kvm/book3s.h              |  1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_hv.c           |  5 +++
 6 files changed, 80 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..75c260ea8a9e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+	return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
+	       cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
+	       radix_enabled();
+}
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 79a9c0bb8bba..ff1af6a7b44f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -287,6 +287,7 @@ struct kvmppc_ops {
 	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+	bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
 	bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
 	void (*free_memslot)(struct kvm_memory_slot *slot);
 	int (*init_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 686d8d9eda3e..37bf40b0c4ff 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	return !kvm->arch.kvm_ops->test_clear_young ||
+	       kvm->arch.kvm_ops->test_clear_young(kvm, range);
+}
+
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 58391b4b32ed..fa2659e21ccc 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
 
 extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 3b65b3b11041..0a392e9a100a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	return ref;
 }
 
+bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	bool err;
+	gfn_t gfn = range->start;
+
+	rcu_read_lock();
+
+	err = !kvm_is_radix(kvm);
+	if (err)
+		goto unlock;
+
+	/*
+	 * Case 1:  This function          kvmppc_switch_mmu_to_hpt()
+	 *
+	 *          rcu_read_lock()
+	 *          Test kvm_is_radix()    kvm->arch.radix = 0
+	 *          Use kvm->arch.pgtable  synchronize_rcu()
+	 *          rcu_read_unlock()
+	 *                                 kvmppc_free_radix()
+	 *
+	 *
+	 * Case 2:  This function          kvmppc_switch_mmu_to_radix()
+	 *
+	 *                                 kvmppc_init_vm_radix()
+	 *                                 smp_wmb()
+	 *          Test kvm_is_radix()    kvm->arch.radix = 1
+	 *          smp_rmb()
+	 *          Use kvm->arch.pgtable
+	 */
+	smp_rmb();
+
+	while (gfn < range->end) {
+		pte_t *ptep;
+		pte_t old, new;
+		unsigned int shift;
+
+		ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
+		if (!ptep)
+			goto next;
+
+		VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));
+
+		old = READ_ONCE(*ptep);
+		if (!pte_present(old) || !pte_young(old))
+			goto next;
+
+		new = pte_mkold(old);
+
+		if (kvm_should_clear_young(range, gfn))
+			pte_xchg(ptep, old, new);
+next:
+		gfn += shift ? BIT(shift - PAGE_SHIFT) : 1;
+	}
+unlock:
+	rcu_read_unlock();
+
+	return err;
+}
+
 /* Returns the number of PAGE_SIZE pages that are dirty */
 static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 				struct kvm_memory_slot *memslot, int pagenum)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..20a81ec9fde8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5262,6 +5262,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
 	spin_lock(&kvm->mmu_lock);
 	kvm->arch.radix = 0;
 	spin_unlock(&kvm->mmu_lock);
+	/* see the comments in kvm_test_clear_young_hv() */
+	synchronize_rcu();
 	kvmppc_free_radix(kvm);
 
 	lpcr = LPCR_VPM1;
@@ -5286,6 +5288,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
 	if (err)
 		return err;
 	kvmppc_rmap_reset(kvm);
+	/* see the comments in kvm_test_clear_young_hv() */
+	smp_wmb();
 	/* Mutual exclusion with kvm_unmap_gfn_range etc. */
 	spin_lock(&kvm->mmu_lock);
 	kvm->arch.radix = 1;
@@ -6185,6 +6189,7 @@ static struct kvmppc_ops kvm_ops_hv = {
 	.unmap_gfn_range = kvm_unmap_gfn_range_hv,
 	.age_gfn = kvm_age_gfn_hv,
 	.test_age_gfn = kvm_test_age_gfn_hv,
+	.test_clear_young = kvm_test_clear_young_hv,
 	.set_spte_gfn = kvm_set_spte_gfn_hv,
 	.free_memslot = kvmppc_core_free_memslot_hv,
 	.init_vm =  kvmppc_core_init_vm_hv,
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (6 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young() Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-06-15 16:59   ` Sean Christopherson
  2023-05-26 23:44 ` [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young() Yu Zhao
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

tdp_mmu_enabled and shadow_accessed_mask are needed to implement
kvm_arch_has_test_clear_young().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/x86/include/asm/kvm_host.h | 6 ++++++
 arch/x86/kvm/mmu.h              | 6 ------
 arch/x86/kvm/mmu/spte.h         | 1 -
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..753c67072c47 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1772,6 +1772,7 @@ struct kvm_arch_async_pf {
 
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern u64 __read_mostly host_efer;
+extern u64 __read_mostly shadow_accessed_mask;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
 extern struct kvm_x86_ops kvm_x86_ops;
@@ -1855,6 +1856,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
 			     bool mask);
 
 extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..84aedb2671ef 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -253,12 +253,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
 	return smp_load_acquire(&kvm->arch.shadow_root_allocated);
 }
 
-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..a82c4fa1c47b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
 extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
 extern u64 __read_mostly shadow_dirty_mask;
 extern u64 __read_mostly shadow_mmio_value;
 extern u64 __read_mostly shadow_mmio_mask;
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (7 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-06-09  9:06   ` Paolo Bonzini
  2023-06-15 18:26   ` Sean Christopherson
  2023-05-26 23:44 ` [PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young() Yu Zhao
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., TDP MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
clear_bit() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/x86/include/asm/kvm_host.h |  7 +++++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 +++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 753c67072c47..d6dfdebe3d94 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2223,4 +2223,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
  */
 #define KVM_EXIT_HYPERCALL_MBZ		GENMASK_ULL(31, 1)
 
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+	return IS_ENABLED(CONFIG_X86_64) &&
+	       (!IS_REACHABLE(CONFIG_KVM) || (tdp_mmu_enabled && shadow_accessed_mask));
+}
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 08340219c35a..6875a819e007 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1232,6 +1232,40 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
 }
 
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	struct kvm_mmu_page *root;
+	int offset = ffs(shadow_accessed_mask) - 1;
+
+	if (kvm_shadow_root_allocated(kvm))
+		return true;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
+		struct tdp_iter iter;
+
+		if (kvm_mmu_page_as_id(root) != range->slot->as_id)
+			continue;
+
+		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
+			u64 *sptep = rcu_dereference(iter.sptep);
+
+			VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));
+
+			if (!(iter.old_spte & shadow_accessed_mask))
+				continue;
+
+			if (kvm_should_clear_young(range, iter.gfn))
+				clear_bit(offset, (unsigned long *)sptep);
+		}
+	}
+
+	rcu_read_unlock();
+
+	return false;
+}
+
 static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 			 struct kvm_gfn_range *range)
 {
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young()
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (8 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young() Yu Zhao
@ 2023-05-26 23:44 ` Yu Zhao
  2023-06-09  0:59 ` kvm/arm64: Spark benchmark Yu Zhao
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-26 23:44 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm, Yu Zhao

Use mmu_notifier_test_clear_young() to handle KVM PTEs in batches
when the fast path is supported. This reduces the contention on
kvm->mmu_lock when the host is under heavy memory pressure.

An existing selftest can quickly demonstrate the effectiveness of
this patch. On a generic workstation equipped with 128 CPUs and 256GB
DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250

  MGLRU         run2
  ------------------
  Before [1]    ~64s
  After         ~51s

  kswapd (MGLRU before)
    100.00%  balance_pgdat
      100.00%  shrink_node
        100.00%  shrink_one
          99.99%  try_to_shrink_lruvec
            99.71%  evict_folios
              97.29%  shrink_folio_list
  ==>>          13.05%  folio_referenced
                  12.83%  rmap_walk_file
                    12.31%  folio_referenced_one
                      7.90%  __mmu_notifier_clear_young
                        7.72%  kvm_mmu_notifier_clear_young
                          7.34%  _raw_write_lock

  kswapd (MGLRU after)
    100.00%  balance_pgdat
      100.00%  shrink_node
        100.00%  shrink_one
          99.99%  try_to_shrink_lruvec
            99.59%  evict_folios
              80.37%  shrink_folio_list
  ==>>          3.74%  folio_referenced
                  3.59%  rmap_walk_file
                    3.19%  folio_referenced_one
                      2.53%  lru_gen_look_around
                        1.06%  __mmu_notifier_test_clear_young

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
    reference" was included so that the comparison is apples to
    apples.
    https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 include/linux/mmzone.h                        |   6 +-
 mm/rmap.c                                     |   8 +-
 mm/vmscan.c                                   | 139 ++++++++++++++++--
 4 files changed, 138 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..0ae2a6d4d94c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
        verified on x86 varieties other than Intel and AMD. If it is
        disabled, the multi-gen LRU will suffer a negligible
        performance degradation.
+0x0008 Clearing the accessed bit in KVM page table entries in large
+       batches, when KVM MMU sets it (e.g., on x86_64). This can
+       improve the performance of guests when the host is under memory
+       pressure.
 [yYnN] Apply to all the components above.
 ====== ===============================================================
 
@@ -56,7 +60,7 @@ E.g.,
 
     echo y >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
-    0x0007
+    0x000f
     echo 5 >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
     0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5a7ada0413da..1b148f39fabc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,7 @@ enum {
 	LRU_GEN_CORE,
 	LRU_GEN_MM_WALK,
 	LRU_GEN_NONLEAF_YOUNG,
+	LRU_GEN_KVM_MMU_WALK,
 	NR_LRU_GEN_CAPS
 };
 
@@ -471,7 +472,7 @@ struct lru_gen_mm_walk {
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 
@@ -559,8 +560,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 }
 
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
+	return false;
 }
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/rmap.c b/mm/rmap.c
index ae127f60a4fb..3a2c4e6a0887 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -825,12 +825,10 @@ static bool folio_referenced_one(struct folio *folio,
 			return false; /* To break the loop */
 		}
 
-		if (pvmw.pte) {
-			if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
-				lru_gen_look_around(&pvmw);
+		if (lru_gen_enabled() && pvmw.pte) {
+			if (lru_gen_look_around(&pvmw))
 				referenced++;
-			}
-
+		} else if (pvmw.pte) {
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte))
 				referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ef687f9be13c..3f734588b600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
 #include <linux/srcu.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3244,6 +3245,20 @@ static bool should_clear_pmd_young(void)
 	return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
 }
 
+#if IS_ENABLED(CONFIG_KVM)
+#include <linux/kvm_host.h>
+
+static bool should_walk_kvm_mmu(void)
+{
+	return kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_KVM_MMU_WALK);
+}
+#else
+static bool should_walk_kvm_mmu(void)
+{
+	return false;
+}
+#endif
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
@@ -3982,6 +3997,55 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 	return folio;
 }
 
+static bool test_spte_young(struct mm_struct *mm, unsigned long addr, unsigned long end,
+			    unsigned long *bitmap, unsigned long *last)
+{
+	if (!should_walk_kvm_mmu())
+		return false;
+
+	if (*last > addr)
+		goto done;
+
+	*last = end - addr > MIN_LRU_BATCH * PAGE_SIZE ?
+		addr + MIN_LRU_BATCH * PAGE_SIZE - 1 : end - 1;
+	bitmap_zero(bitmap, MIN_LRU_BATCH);
+
+	mmu_notifier_test_clear_young(mm, addr, *last + 1, false, bitmap);
+done:
+	return test_bit((*last - addr) / PAGE_SIZE, bitmap);
+}
+
+static void clear_spte_young(struct mm_struct *mm, unsigned long addr,
+			     unsigned long *bitmap, unsigned long *last)
+{
+	int i;
+	unsigned long start, end = *last + 1;
+
+	if (addr + PAGE_SIZE != end)
+		return;
+
+	i = find_last_bit(bitmap, MIN_LRU_BATCH);
+	if (i == MIN_LRU_BATCH)
+		return;
+
+	start = end - (i + 1) * PAGE_SIZE;
+
+	i = find_first_bit(bitmap, MIN_LRU_BATCH);
+
+	end -= i * PAGE_SIZE;
+
+	mmu_notifier_test_clear_young(mm, start, end, true, bitmap);
+}
+
+static void skip_spte_young(struct mm_struct *mm, unsigned long addr,
+			    unsigned long *bitmap, unsigned long *last)
+{
+	if (*last > addr)
+		__clear_bit((*last - addr) / PAGE_SIZE, bitmap);
+
+	clear_spte_young(mm, addr, bitmap, last);
+}
+
 static bool suitable_to_scan(int total, int young)
 {
 	int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
@@ -3997,6 +4061,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 	unsigned long addr;
+	DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
+	unsigned long last = 0;
 	int total = 0;
 	int young = 0;
 	struct lru_gen_mm_walk *walk = args->private;
@@ -4015,6 +4081,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	pte = pte_offset_map(pmd, start & PMD_MASK);
 restart:
 	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		bool ret;
 		unsigned long pfn;
 		struct folio *folio;
 
@@ -4022,20 +4089,27 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		walk->mm_stats[MM_LEAF_TOTAL]++;
 
 		pfn = get_pte_pfn(pte[i], args->vma, addr);
-		if (pfn == -1)
+		if (pfn == -1) {
+			skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
 			continue;
+		}
 
-		if (!pte_young(pte[i])) {
+		ret = test_spte_young(args->vma->vm_mm, addr, end, bitmap, &last);
+		if (!ret && !pte_young(pte[i])) {
+			skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
 			walk->mm_stats[MM_LEAF_OLD]++;
 			continue;
 		}
 
 		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
-		if (!folio)
+		if (!folio) {
+			skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
 			continue;
+		}
 
-		if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
-			VM_WARN_ON_ONCE(true);
+		clear_spte_young(args->vma->vm_mm, addr, bitmap, &last);
+		if (pte_young(pte[i]))
+			ptep_test_and_clear_young(args->vma, addr, pte + i);
 
 		young++;
 		walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -4629,6 +4703,23 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
  *                          rmap/PT walk feedback
  ******************************************************************************/
 
+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+			       pte_t *pte, int *young)
+{
+	int ret = mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
+
+	if (pte_young(*pte)) {
+		ptep_test_and_clear_young(vma, addr, pte);
+		*young = true;
+		return true;
+	}
+
+	if (ret)
+		*young = true;
+
+	return ret & MMU_NOTIFIER_RANGE_LOCKLESS;
+}
+
 /*
  * This function exploits spatial locality when shrink_folio_list() walks the
  * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4636,12 +4727,14 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
  * the PTE table to the Bloom filter. This forms a feedback loop between the
  * eviction and the aging.
  */
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
 	int i;
 	unsigned long start;
 	unsigned long end;
 	struct lru_gen_mm_walk *walk;
+	DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
+	unsigned long last = 0;
 	int young = 0;
 	pte_t *pte = pvmw->pte;
 	unsigned long addr = pvmw->address;
@@ -4655,8 +4748,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	lockdep_assert_held(pvmw->ptl);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
 
+	if (!should_look_around(pvmw->vma, addr, pte, &young))
+		return young;
+
 	if (spin_is_contended(pvmw->ptl))
-		return;
+		return young;
 
 	/* avoid taking the LRU lock under the PTL when possible */
 	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4664,6 +4760,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	start = max(addr & PMD_MASK, pvmw->vma->vm_start);
 	end = min(addr | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;
 
+	if (end - start == PAGE_SIZE)
+		return young;
+
 	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
 		if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
 			end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4677,28 +4776,37 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
 	/* folio_update_gen() requires stable folio_memcg() */
 	if (!mem_cgroup_trylock_pages(memcg))
-		return;
+		return young;
 
 	arch_enter_lazy_mmu_mode();
 
 	pte -= (addr - start) / PAGE_SIZE;
 
 	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		bool ret;
 		unsigned long pfn;
 
 		pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
-		if (pfn == -1)
+		if (pfn == -1) {
+			skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
 			continue;
+		}
 
-		if (!pte_young(pte[i]))
+		ret = test_spte_young(pvmw->vma->vm_mm, addr, end, bitmap, &last);
+		if (!ret && !pte_young(pte[i])) {
+			skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
 			continue;
+		}
 
 		folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
-		if (!folio)
+		if (!folio) {
+			skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
 			continue;
+		}
 
-		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
-			VM_WARN_ON_ONCE(true);
+		clear_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
+		if (pte_young(pte[i]))
+			ptep_test_and_clear_young(pvmw->vma, addr, pte + i);
 
 		young++;
 
@@ -4728,6 +4836,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	/* feedback from rmap walkers to page table walkers */
 	if (suitable_to_scan(i, young))
 		update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+	return young;
 }
 
 /******************************************************************************
@@ -5745,6 +5855,9 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
 	if (should_clear_pmd_young())
 		caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
 
+	if (should_walk_kvm_mmu())
+		caps |= BIT(LRU_GEN_KVM_MMU_WALK);
+
 	return sysfs_emit(buf, "0x%04x\n", caps);
 }
 
-- 
2.41.0.rc0.172.g3f132b7071-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
  2023-05-26 23:44 ` [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe Yu Zhao
@ 2023-05-27 18:08   ` Oliver Upton
  2023-05-27 20:13     ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Upton @ 2023-05-27 18:08 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

Yu,

On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote:
> Stage2 page tables are currently not RCU safe against unmapping or VM
> destruction. The previous mmu_notifier_ops members rely on
> kvm->mmu_lock to synchronize with those operations.
> 
> However, the new mmu_notifier_ops member test_clear_young() provides
> a fast path that does not take kvm->mmu_lock. To implement
> kvm_arch_test_clear_young() for that path, unmapped page tables need
> to be freed by RCU and kvm_free_stage2_pgd() needs to be after
> mmu_notifier_unregister().
> 
> Remapping, specifically stage2_free_removed_table(), is already RCU
> safe.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  2 ++
>  arch/arm64/kvm/arm.c                 |  1 +
>  arch/arm64/kvm/hyp/pgtable.c         |  8 ++++++--
>  arch/arm64/kvm/mmu.c                 | 17 ++++++++++++++++-
>  4 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index ff520598b62c..5cab52e3a35f 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:			Decrement the refcount on a page. When the
>   *				refcount reaches 0 the page is automatically
>   *				freed.
> + * @put_page_rcu:		RCU variant of the above.

You don't need to add yet another hook to implement this. I was working
on lock-free walks in a separate context and arrived at the following:

commit f82d264a37745e07ee28e116c336f139f681fd7f
Author: Oliver Upton <oliver.upton@linux.dev>
Date:   Mon May 1 08:53:37 2023 +0000

    KVM: arm64: Consistently use free_removed_table() for stage-2
    
    free_removed_table() is essential to the RCU-protected parallel walking
    scheme, as behind the scenes the cleanup is deferred until an RCU grace
    period. Nonetheless, the stage-2 unmap path calls put_page() directly,
    which leads to table memory being freed inline with the table walk.
    
    This is safe for the time being, as the stage-2 unmap walker is called
    while holding the write lock. A future change to KVM will further relax
    the locking mechanics around the stage-2 page tables to allow lock-free
    walkers protected only by RCU. As such, switch to the RCU-safe mechanism
    for freeing table memory.
    
    Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3d61bd3e591d..bfbebdcb4ef0 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 					       kvm_granule_size(ctx->level));
 
 	if (childp)
-		mm_ops->put_page(childp);
+		mm_ops->free_removed_table(childp, ctx->level);
 
 	return 0;
 }

-- 
Thanks,
Oliver

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
  2023-05-27 18:08   ` Oliver Upton
@ 2023-05-27 20:13     ` Yu Zhao
  2023-05-30 19:37       ` Oliver Upton
  0 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-05-27 20:13 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Yu,
>
> On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote:
> > Stage2 page tables are currently not RCU safe against unmapping or VM
> > destruction. The previous mmu_notifier_ops members rely on
> > kvm->mmu_lock to synchronize with those operations.
> >
> > However, the new mmu_notifier_ops member test_clear_young() provides
> > a fast path that does not take kvm->mmu_lock. To implement
> > kvm_arch_test_clear_young() for that path, unmapped page tables need
> > to be freed by RCU and kvm_free_stage2_pgd() needs to be after
> > mmu_notifier_unregister().
> >
> > Remapping, specifically stage2_free_removed_table(), is already RCU
> > safe.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  2 ++
> >  arch/arm64/kvm/arm.c                 |  1 +
> >  arch/arm64/kvm/hyp/pgtable.c         |  8 ++++++--
> >  arch/arm64/kvm/mmu.c                 | 17 ++++++++++++++++-
> >  4 files changed, 25 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index ff520598b62c..5cab52e3a35f 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> >   * @put_page:                        Decrement the refcount on a page. When the
> >   *                           refcount reaches 0 the page is automatically
> >   *                           freed.
> > + * @put_page_rcu:            RCU variant of the above.
>
> You don't need to add yet another hook to implement this. I was working
> on lock-free walks in a separate context and arrived at the following:
>
> commit f82d264a37745e07ee28e116c336f139f681fd7f
> Author: Oliver Upton <oliver.upton@linux.dev>
> Date:   Mon May 1 08:53:37 2023 +0000
>
>     KVM: arm64: Consistently use free_removed_table() for stage-2
>
>     free_removed_table() is essential to the RCU-protected parallel walking
>     scheme, as behind the scenes the cleanup is deferred until an RCU grace
>     period. Nonetheless, the stage-2 unmap path calls put_page() directly,
>     which leads to table memory being freed inline with the table walk.
>
>     This is safe for the time being, as the stage-2 unmap walker is called
>     while holding the write lock. A future change to KVM will further relax
>     the locking mechanics around the stage-2 page tables to allow lock-free
>     walkers protected only by RCU. As such, switch to the RCU-safe mechanism
>     for freeing table memory.
>
>     Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3d61bd3e591d..bfbebdcb4ef0 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>                                                kvm_granule_size(ctx->level));
>
>         if (childp)
> -               mm_ops->put_page(childp);
> +               mm_ops->free_removed_table(childp, ctx->level);

Thanks, Oliver.

A couple of things I haven't had the chance to verify -- I'm hoping
you could help clarify:
1. For unmapping, with free_removed_table(), wouldn't we have to look
into the table we know it's empty unnecessarily?
2. For remapping and unmapping, how does free_removed_table() put the
final refcnt on the table passed in? (Previously we had
put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
have to do something equivalent with free_removed_table().)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
  2023-05-27 20:13     ` Yu Zhao
@ 2023-05-30 19:37       ` Oliver Upton
  2023-05-30 20:06         ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Upton @ 2023-05-30 19:37 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

Hi Yu,

On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> >                                                kvm_granule_size(ctx->level));
> >
> >         if (childp)
> > -               mm_ops->put_page(childp);
> > +               mm_ops->free_removed_table(childp, ctx->level);
> 
> Thanks, Oliver.
> 
> A couple of things I haven't had the chance to verify -- I'm hoping
> you could help clarify:
> 1. For unmapping, with free_removed_table(), wouldn't we have to look
> into the table we know it's empty unnecessarily?

As it is currently implemented, yes. But, there's potential to fast-path
the implementation by checking page_count() before starting the walk.

> 2. For remapping and unmapping, how does free_removed_table() put the
> final refcnt on the table passed in? (Previously we had
> put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
> have to do something equivalent with free_removed_table().)

Heh, that's a bug, and an embarrassing one at that!

Sent out a fix for that, since it would appear we leak memory on
table->block transitions. PTAL if you have a chance.

https://lore.kernel.org/all/20230530193213.1663411-1-oliver.upton@linux.dev/

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
  2023-05-30 19:37       ` Oliver Upton
@ 2023-05-30 20:06         ` Yu Zhao
       [not found]           ` <ZHef0VsZvZ1Vnz0u@linux.dev>
  0 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-05-30 20:06 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hi Yu,
>
> On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > >                                                kvm_granule_size(ctx->level));
> > >
> > >         if (childp)
> > > -               mm_ops->put_page(childp);
> > > +               mm_ops->free_removed_table(childp, ctx->level);
> >
> > Thanks, Oliver.
> >
> > A couple of things I haven't had the chance to verify -- I'm hoping
> > you could help clarify:
> > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > into the table we know it's empty unnecessarily?
>
> As it is currently implemented, yes. But, there's potential to fast-path
> the implementation by checking page_count() before starting the walk.

Do you mind posting another patch? I'd be happy to ack it, as well as
the one you suggested above.

> > 2. For remapping and unmapping, how does free_removed_table() put the
> > final refcnt on the table passed in? (Previously we had
> > put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
> > have to do something equivalent with free_removed_table().)
>
> Heh, that's a bug, and an embarrassing one at that!
>
> Sent out a fix for that, since it would appear we leak memory on
> table->block transitions. PTAL if you have a chance.
>
> https://lore.kernel.org/all/20230530193213.1663411-1-oliver.upton@linux.dev/

Awesome.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
       [not found]           ` <ZHef0VsZvZ1Vnz0u@linux.dev>
@ 2023-05-31 23:10             ` Yu Zhao
  2023-05-31 23:22               ` Oliver Upton
  0 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-05-31 23:10 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > Hi Yu,
> > >
> > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > >                                                kvm_granule_size(ctx->level));
> > > > >
> > > > >         if (childp)
> > > > > -               mm_ops->put_page(childp);
> > > > > +               mm_ops->free_removed_table(childp, ctx->level);
> > > >
> > > > Thanks, Oliver.
> > > >
> > > > A couple of things I haven't had the chance to verify -- I'm hoping
> > > > you could help clarify:
> > > > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > > > into the table we know it's empty unnecessarily?
> > >
> > > As it is currently implemented, yes. But, there's potential to fast-path
> > > the implementation by checking page_count() before starting the walk.
> >
> > Do you mind posting another patch? I'd be happy to ack it, as well as
> > the one you suggested above.
>
> I'd rather not take such a patch independent of the test_clear_young
> series if you're OK with that. Do you mind implementing something
> similar to the above patch w/ the proposed optimization if you need it?

No worries. I can take the above together with the following, which
would form a new series with its own merits, since apparently you
think the !AF case is important.

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 26a8d955b49c..6ce73ce9f146 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1453,10 +1453,10 @@ static void handle_access_fault(struct
kvm_vcpu *vcpu, phys_addr_t fault_ipa)

        trace_kvm_access_fault(fault_ipa);

-       read_lock(&vcpu->kvm->mmu_lock);
+       rcu_read_lock();
        mmu = vcpu->arch.hw_mmu;
        pte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa);
-       read_unlock(&vcpu->kvm->mmu_lock);
+       rcu_read_unlock();

        if (kvm_pte_valid(pte))
                kvm_set_pfn_accessed(kvm_pte_to_pfn(pte));

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
  2023-05-31 23:10             ` Yu Zhao
@ 2023-05-31 23:22               ` Oliver Upton
  2023-05-31 23:41                 ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Upton @ 2023-05-31 23:22 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote:
> On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > As it is currently implemented, yes. But, there's potential to fast-path
> > > > the implementation by checking page_count() before starting the walk.
> > >
> > > Do you mind posting another patch? I'd be happy to ack it, as well as
> > > the one you suggested above.
> >
> > I'd rather not take such a patch independent of the test_clear_young
> > series if you're OK with that. Do you mind implementing something
> > similar to the above patch w/ the proposed optimization if you need it?
> 
> No worries. I can take the above together with the following, which
> would form a new series with its own merits, since apparently you
> think the !AF case is important.

Sorry if my suggestion was unclear.

I thought we were talking about ->free_removed_table() being called from
the stage-2 unmap path, in which case we wind up unnecessarily visiting
PTEs on a table known to be empty. You could fast-path that by only
initiating a walk if  page_count() > 1:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 95dae02ccc2e..766563dc465c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg
 		.end	= kvm_granule_size(level),
 	};
 
-	WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
+	if (mm_ops->page_count(pgtable) > 1)
+		WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
 
 	WARN_ON(mm_ops->page_count(pgtable) != 1);
 	mm_ops->put_page(pgtable);


A lock-free access fault walker is interesting, but in my testing it hasn't
led to any significant improvements over acquiring the MMU lock for
read. Because of that I hadn't bothered with posting the series upstream.

-- 
Thanks,
Oliver

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe
  2023-05-31 23:22               ` Oliver Upton
@ 2023-05-31 23:41                 ` Yu Zhao
  0 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-05-31 23:41 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Wed, May 31, 2023 at 5:23 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote:
> > On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > > As it is currently implemented, yes. But, there's potential to fast-path
> > > > > the implementation by checking page_count() before starting the walk.
> > > >
> > > > Do you mind posting another patch? I'd be happy to ack it, as well as
> > > > the one you suggested above.
> > >
> > > I'd rather not take such a patch independent of the test_clear_young
> > > series if you're OK with that. Do you mind implementing something
> > > similar to the above patch w/ the proposed optimization if you need it?
> >
> > No worries. I can take the above together with the following, which
> > would form a new series with its own merits, since apparently you
> > think the !AF case is important.
>
> Sorry if my suggestion was unclear.
>
> I thought we were talking about ->free_removed_table() being called from
> the stage-2 unmap path

Yes, we were, or in general, about how to make KVM PTs RCU safe for ARM.

So I'm thinking about taking 1) your patch above, 2) what I just
suggested and 3) what you suggested below to form a mini series, which
could land indepently and would make my job here easier.

> in which case we wind up unnecessarily visiting
> PTEs on a table known to be empty. You could fast-path that by only
> initiating a walk if  page_count() > 1:

Yes, this is what I meant.

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 95dae02ccc2e..766563dc465c 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg
>                 .end    = kvm_granule_size(level),
>         };
>
> -       WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
> +       if (mm_ops->page_count(pgtable) > 1)
> +               WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
>
>         WARN_ON(mm_ops->page_count(pgtable) != 1);
>         mm_ops->put_page(pgtable);
>
>
> A lock-free access fault walker is interesting, but in my testing it hasn't
> led to any significant improvements over acquiring the MMU lock for
> read. Because of that I hadn't bothered with posting the series upstream.

It's hard to measure but we have perf benchmarks on ChromeOS which should help.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()
  2023-05-26 23:44 ` [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young() Yu Zhao
@ 2023-06-06  8:34   ` Tzung-Bi Shih
  2023-06-09  1:00     ` Yu Zhao
       [not found]   ` <ZHedMX470b7EMwbe@ziepe.ca>
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Tzung-Bi Shih @ 2023-06-06  8:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Oliver Upton,
	Paul Mackerras, Peter Xu, Sean Christopherson, Steven Rostedt,
	Suzuki K Poulose, Thomas Gleixner, Thomas Huth, Will Deacon,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linuxppc-dev, linux-trace-kernel, x86,
	linux-mm

On Fri, May 26, 2023 at 05:44:26PM -0600, Yu Zhao wrote:
> +/*
> + * Architectures that implement kvm_arch_test_clear_young() should override
> + * kvm_arch_has_test_clear_young().
> + *
> + * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
> + * can return true if kvm_arch_test_clear_young() is supported but disabled due
> + * to some runtime constraint. In this case, kvm_arch_test_clear_young() should

Is it a typo here?  s/kvm_arch_test_clear_young/kvm_arch_has_test_clear_young/.

> +static inline int mmu_notifier_clear_young(struct mm_struct *mm,
> +					   unsigned long start,
> +					   unsigned long end)
> +{
> +	return 0;
> +}
> +

This looks irrelevant to the patch but a fix for commit 1d7715c676a1
("mmu-notifier: add clear_young callback") instead.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* kvm/arm64: Spark benchmark
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (9 preceding siblings ...)
  2023-05-26 23:44 ` [PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young() Yu Zhao
@ 2023-06-09  0:59 ` Yu Zhao
  2023-06-09 13:04   ` Marc Zyngier
  2023-06-09  0:59 ` kvm/powerpc: memcached benchmark Yu Zhao
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-06-09  0:59 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

TLDR
====
Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].

Hardware
========
HOST $ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 128
  On-line CPU(s) list:  0-127
Vendor ID:              ARM
  Model name:           Neoverse-N1
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 64
    Socket(s):          2
    Stepping:           r3p1
    Frequency boost:    disabled
    CPU max MHz:        2800.0000
    CPU min MHz:        1000.0000
    BogoMIPS:           50.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
  L1d:                  8 MiB (128 instances)
  L1i:                  8 MiB (128 instances)
  L2:                   128 MiB (128 instances)
NUMA:
  NUMA node(s):         2
  NUMA node0 CPU(s):    0-63
  NUMA node1 CPU(s):    64-127
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; CSV2, BHB
  Srbds:                Not affected
  Tsx async abort:      Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-63
node 0 size: 257730 MB
node 0 free: 1447 MB
node 1 cpus: 64-127
node 1 size: 256877 MB
node 1 free: 256093 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux arm 6.4.0-rc4 #1 SMP Sat Jun  3 05:30:06 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

HOST $ cat /proc/swaps
Filename				Type		Size		Used		Priority
/dev/nvme0n1p2                          partition	466838356	116922112	-2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000b

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-aarch64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

GUEST $ java --version
openjdk 17.0.7 2023-04-18
OpenJDK Runtime Environment (build 17.0.7+7-Ubuntu-0ubuntu122.04.2)
OpenJDK 64-Bit Server VM (build 17.0.7+7-Ubuntu-0ubuntu122.04.2, mixed mode, sharing)

GUEST $ spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0
      /_/

Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 17.0.7
Branch HEAD
Compiled by user xinrong.meng on 2023-04-07T02:18:01Z
Revision 87a5442f7ed96b11051d8a9333476d080054e5a0
Url https://github.com/apache/spark
Type --help for more information.

Procedure
=========
HOST $ sudo numactl -N 0 -m 0 qemu-system-aarch64 \
    -M virt,accel=kvm -cpu host -smp 64 -m 300g -nographic -nic user \
    -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd \
    -drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ cat gen.scala
import java.io._
import scala.collection.mutable.ArrayBuffer

object GenData {
    def main(args: Array[String]): Unit = {
        val file = new File("/dev/shm/dataset.txt")
        val writer = new BufferedWriter(new FileWriter(file))
        val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
        for(_ <- 0 until 400000000) {
            for (i <- 0 until 10) {
                buf.update(i, scala.util.Random.nextLong())
            }
            writer.write(s"${buf.mkString(",")}\n")
        }
        writer.close()
    }
}
GenData.main(Array())

GUEST $ cat sort.scala
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.SparkSession

object SparkSort {
    def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder().getOrCreate()
        val file = sc.textFile("/dev/shm/dataset.txt", 64)
        val results = file.flatMap(_.split(",")).map(x => (x, 1)).sortByKey().takeOrdered(10)
        results.foreach(println)
        spark.stop()
    }
}
SparkSort.main(Array())

GUEST $ cat run_spark.sh
export SPARK_LOCAL_DIRS=/dev/shm/

spark-shell <gen.scala

start=$SECONDS

for ((i=0; i<20; i++))
do
	spark-3.4.0-bin-hadoop3/bin/spark-shell --master "local[64]" --driver-memory 160g <sort.scala
done

echo "wall time: $((SECONDS - start))"

Results
=======
                       Before [1]    After    Change
----------------------------------------------------
Wall time (seconds)    14455         12865    -12%

Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
    reference" was included so that the comparison is apples to
    Apples.
    https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* kvm/powerpc: memcached benchmark
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (10 preceding siblings ...)
  2023-06-09  0:59 ` kvm/arm64: Spark benchmark Yu Zhao
@ 2023-06-09  0:59 ` Yu Zhao
  2023-06-09  0:59 ` kvm/x86: multichase benchmark Yu Zhao
  2023-06-09  9:07 ` [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Paolo Bonzini
  13 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-06-09  0:59 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

TLDR
====
Memcached achieved 10% more operations per second (in ~4 hours) after this patchset [1].

Hardware
========
HOST $ lscpu
Architecture:          ppc64le
  Byte Order:          Little Endian
CPU(s):                184
  On-line CPU(s) list: 0-183
Model name:            POWER9 (raw), altivec supported
  Model:               2.2 (pvr 004e 1202)
  Thread(s) per core:  4
  Core(s) per socket:  23
  Socket(s):           2
  CPU max MHz:         3000.0000
  CPU min MHz:         2300.0000
Caches (sum of all):
  L1d:                 1.4 MiB (46 instances)
  L1i:                 1.4 MiB (46 instances)
  L2:                  12 MiB (24 instances)
  L3:                  240 MiB (24 instances)
NUMA:
  NUMA node(s):        2
  NUMA node0 CPU(s):   0-91
  NUMA node1 CPU(s):   92-183
Vulnerabilities:
  Itlb multihit:       Not affected
  L1tf:                Mitigation; RFI Flush, L1D private per thread
  Mds:                 Not affected
  Meltdown:            Mitigation; RFI Flush, L1D private per thread
  Mmio stale data:     Not affected
  Retbleed:            Not affected
  Spec store bypass:   Mitigation; Kernel entry/exit barrier (eieio)
  Spectre v1:          Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
  Spectre v2:          Mitigation; Indirect branch serialisation (kernel only), Indirect branch cache disabled, Software link stack flush
  Srbds:               Not affected
  Tsx async abort:     Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-91
node 0 size: 261659 MB
node 0 free: 259152 MB
node 1 cpus: 92-183
node 1 size: 261713 MB
node 1 free: 261076 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

HOST $ uname -a
Linux ppc 6.3.0 #1 SMP Sun Jun  4 18:26:37 UTC 2023 ppc64le ppc64le ppc64le GNU/Linux

HOST $ cat /proc/swaps
Filename          Type         Size         Used    Priority
/dev/nvme0n1p2    partition	    466838272    0       -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x0009

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-ppc64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

GUEST $ cat /etc/memcached.conf
...
-t 92
-m 262144
-B binary
-s /var/run/memcached/memcached.sock
-a 0766

GUEST $ memtier_benchmark -v
memtier_benchmark 1.4.0
Copyright (C) 2011-2022 Redis Ltd.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Procedure
=========
HOST $ sudo numactl -N 0 -m 0 qemu-system-ppc64 \
    -M pseries,accel=kvm,kvm-type=HV -cpu host -smp 92 -m 270g
    -nographic -nic user \
    -drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
    -P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 1:0 \
    --key-minimum=1 --key-maximum=120000000 --key-pattern=P:P \
    -n allkeys -d 2000

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
    -P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 0:1 \
    --key-minimum=1 --key-maximum=120000000 --key-pattern=R:R \
    -n allkeys --randomize --distinct-client-seed

Results
=======
                Before [1]    After        Change
-------------------------------------------------
Ops/sec         721586.10     800210.12    +10%
Avg. Latency    0.12546       0.11260      -10%
p50 Latency     0.08700       0.08700      N/C
p99 Latency     0.28700       0.24700      -13%

Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
    reference" was included so that the comparison is apples to
    Apples.
    https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* kvm/x86: multichase benchmark
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (11 preceding siblings ...)
  2023-06-09  0:59 ` kvm/powerpc: memcached benchmark Yu Zhao
@ 2023-06-09  0:59 ` Yu Zhao
  2023-06-18 19:19   ` Yu Zhao
  2023-06-09  9:07 ` [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Paolo Bonzini
  13 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-06-09  0:59 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

TLDR
====
Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1].

Hardware
========
HOST $ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  128
  On-line CPU(s) list:   0-127
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen Threadripper PRO 3995WX 64-Cores
    CPU family:          23
    Model:               49
    Thread(s) per core:  2
    Core(s) per socket:  64
    Socket(s):           1
    Stepping:            0
    Frequency boost:     disabled
    CPU max MHz:         4308.3979
    CPU min MHz:         2200.0000
    BogoMIPS:            5390.20
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2
                         ...
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   2 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    32 MiB (64 instances)
  L3:                    256 MiB (16 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-127
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT enabled with STIBP protection
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

HOST $ numactl -H
available: 1 nodes (0)
node 0 cpus: 0-127
node 0 size: 257542 MB
node 0 free: 224855 MB
node distances:
node   0
  0:  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun  7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

HOST $ cat /proc/swaps
Filename          Type         Size         Used    Priority
/dev/nvme0n1p2    partition    466838356    0       -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000f

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

Procedure
=========
HOST $ git clone https://github.com/google/multichase

HOST $ <Build multichase>
HOST $ <Unpack /boot/initrd.img into ./initrd/>

HOST $ cp multichase/multichase ./initrd/bin/
HOST $ sed -i \
    "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \
    ./initrd/init

HOST $ <Pack ./initrd/ into ./initrd.img>

HOST $ cat run_microvms.sh
memcgs=64

run() {
    path=/sys/fs/cgroup/memcg$1

    mkdir $path
    echo $BASHPID >$path/cgroup.procs

    qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
        -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
        -append "console=ttyS0 loglevel=0"
}

for ((memcg = 0; memcg < $memcgs; memcg++)); do
    run $memcg &
done

wait

Results
=======
                 Before [1]    After    Change
----------------------------------------------
Total samples    6824          7237     +6%

Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
    reference" was included so that the comparison is apples to
    Apples.
    https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()
  2023-06-06  8:34   ` Tzung-Bi Shih
@ 2023-06-09  1:00     ` Yu Zhao
  0 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-06-09  1:00 UTC (permalink / raw)
  To: Tzung-Bi Shih
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Oliver Upton,
	Paul Mackerras, Peter Xu, Sean Christopherson, Steven Rostedt,
	Suzuki K Poulose, Thomas Gleixner, Thomas Huth, Will Deacon,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linuxppc-dev, linux-trace-kernel, x86,
	linux-mm

On Tue, Jun 6, 2023 at 2:34 AM Tzung-Bi Shih <tzungbi@kernel.org> wrote:
>
> On Fri, May 26, 2023 at 05:44:26PM -0600, Yu Zhao wrote:
> > +/*
> > + * Architectures that implement kvm_arch_test_clear_young() should override
> > + * kvm_arch_has_test_clear_young().
> > + *
> > + * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
> > + * can return true if kvm_arch_test_clear_young() is supported but disabled due
> > + * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
>
> Is it a typo here?  s/kvm_arch_test_clear_young/kvm_arch_has_test_clear_young/.

Not a typo.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()
       [not found]   ` <ZHedMX470b7EMwbe@ziepe.ca>
@ 2023-06-09  9:04     ` Paolo Bonzini
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2023-06-09  9:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Yu Zhao
  Cc: Andrew Morton, Alistair Popple, Anup Patel, Ben Gardon,
	Borislav Petkov, Catalin Marinas, Chao Peng, Christophe Leroy,
	Dave Hansen, Fabiano Rosas, Gaosheng Cui, Gavin Shan,
	H. Peter Anvin, Ingo Molnar, James Morse, Jason A. Donenfeld,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On 5/31/23 21:17, Jason Gunthorpe wrote:
>> +	int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
>> +				unsigned long start, unsigned long end,
>> +				bool clear, unsigned long *bitmap);
>> +
> Why leave clear_young behind? Just make a NULL bitmap mean
> clear_young?

It goes away in patch 2, together with:

@@ -437,7 +412,7 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
  					   unsigned long end)
  {
  	if (mm_has_notifiers(mm))
-		return __mmu_notifier_clear_young(mm, start, end);
+		return __mmu_notifier_test_clear_young(mm, start, end, true, NULL);
  	return 0;
  }
  
@@ -445,7 +420,7 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
  					  unsigned long address)
  {
  	if (mm_has_notifiers(mm))
-		return __mmu_notifier_test_young(mm, address);
+		return __mmu_notifier_test_clear_young(mm, address, address + 1, false, NULL);
  	return 0;
  }
  

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()
  2023-05-26 23:44 ` [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young() Yu Zhao
@ 2023-06-09  9:06   ` Paolo Bonzini
  2023-06-15 18:26   ` Sean Christopherson
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2023-06-09  9:06 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On 5/27/23 01:44, Yu Zhao wrote:
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> +	return IS_ENABLED(CONFIG_X86_64) &&
> +	       (!IS_REACHABLE(CONFIG_KVM) || (tdp_mmu_enabled && shadow_accessed_mask));
> +}

I don't think you need IS_REACHABLE(CONFIG_KVM) here, it would be a bug 
if this is called from outside KVM code.

Maybe make it a BUILD_BUG_ON?

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit
  2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
                   ` (12 preceding siblings ...)
  2023-06-09  0:59 ` kvm/x86: multichase benchmark Yu Zhao
@ 2023-06-09  9:07 ` Paolo Bonzini
  2023-06-20  2:19   ` Yu Zhao
  13 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2023-06-09  9:07 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On 5/27/23 01:44, Yu Zhao wrote:
> TLDR
> ====
> This patchset adds a fast path to clear the accessed bit without
> taking kvm->mmu_lock. It can significantly improve the performance of
> guests when the host is under heavy memory pressure.
> 
> ChromeOS has been using a similar approach [1] since mid 2021 and it
> was proven successful on tens of millions devices.
> 
> This v2 addressed previous requests [2] on refactoring code, removing
> inaccurate/redundant texts, etc.
> 
> [1]https://crrev.com/c/2987928
> [2]https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/

 From the KVM point of view the patches look good (though I wouldn't 
mind if Nicholas took a look at the ppc part).  Jason's comment on the 
MMU notifier side are promising as well.  Can you send v3 with Oliver's 
comments addressed?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: kvm/arm64: Spark benchmark
  2023-06-09  0:59 ` kvm/arm64: Spark benchmark Yu Zhao
@ 2023-06-09 13:04   ` Marc Zyngier
  2023-06-18 20:11     ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Marc Zyngier @ 2023-06-09 13:04 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Masami Hiramatsu, Michael Ellerman, Michael Larabel,
	Mike Rapoport, Nicholas Piggin, Oliver Upton, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Fri, 09 Jun 2023 01:59:35 +0100,
Yu Zhao <yuzhao@google.com> wrote:
> 
> TLDR
> ====
> Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].

Why are the 3 architectures you have considered being evaluated with 3
different benchmarks? I am not suspecting you to have cherry-picked
the best results, but I'd really like to see a variety of benchmarks
that exercise this stuff differently.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
  2023-05-26 23:44 ` [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask Yu Zhao
@ 2023-06-15 16:59   ` Sean Christopherson
  0 siblings, 0 replies; 40+ messages in thread
From: Sean Christopherson @ 2023-06-15 16:59 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Oliver Upton,
	Paul Mackerras, Peter Xu, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Fri, May 26, 2023, Yu Zhao wrote:
> tdp_mmu_enabled and shadow_accessed_mask are needed to implement
> kvm_arch_has_test_clear_young().
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 6 ++++++
>  arch/x86/kvm/mmu.h              | 6 ------
>  arch/x86/kvm/mmu/spte.h         | 1 -
>  3 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fb9d1f2d6136..753c67072c47 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1772,6 +1772,7 @@ struct kvm_arch_async_pf {
>  
>  extern u32 __read_mostly kvm_nr_uret_msrs;
>  extern u64 __read_mostly host_efer;
> +extern u64 __read_mostly shadow_accessed_mask;
>  extern bool __read_mostly allow_smaller_maxphyaddr;
>  extern bool __read_mostly enable_apicv;
>  extern struct kvm_x86_ops kvm_x86_ops;
> @@ -1855,6 +1856,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
>  			     bool mask);
>  
>  extern bool tdp_enabled;
> +#ifdef CONFIG_X86_64
> +extern bool tdp_mmu_enabled;
> +#else
> +#define tdp_mmu_enabled false
> +#endif

I would much prefer that these be kept in kvm/mmu.h.  And looking at all the arch
code, there's no reason to make kvm_arch_has_test_clear_young() a runtime callback,
all of the logic is constant relative to when KVM is loaded.

So rather than have generic KVM pull from arch code, what if we have arch code
push info to generic KVM?  We could even avoid #ifdefs if arch code passed in its
handler.  That might result in an extra indirect branch though, so it might be
better to just use a flag?  E.g. the x86 conversion would be something like this.

---
 arch/x86/kvm/mmu/mmu.c     |  5 +++++
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.h |  1 +
 include/linux/kvm_host.h   | 24 ++++--------------------
 virt/kvm/kvm_main.c        | 14 ++++++++++----
 5 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ebe542c565..84a4a83540f0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5809,6 +5809,11 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		max_huge_page_level = PG_LEVEL_1G;
 	else
 		max_huge_page_level = PG_LEVEL_2M;
+
+	if (tdp_mmu_enabled && kvm_ad_enabled())
+		kvm_init_test_clear_young(kvm_tdp_mmu_test_clear_young);
+	else
+		kvm_init_test_clear_young(NULL);
 }
 EXPORT_SYMBOL_GPL(kvm_configure_mmu);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f463d54228f8..e878c88f0e02 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1308,7 +1308,7 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
 }
 
-bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+bool kvm_tdp_mmu_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	struct kvm_mmu_page *root;
 	int offset = ffs(shadow_accessed_mask) - 1;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..aaa0b75b3896 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -34,6 +34,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_tdp_mmu_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);
 
 bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 			     const struct kvm_memory_slot *slot, int min_level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1714f82a0c47..7a0922cbc36f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -264,31 +264,15 @@ struct kvm_gfn_range {
 	pte_t pte;
 	bool may_block;
 };
+
+typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn);
-bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);
-#endif
-
-/*
- * Architectures that implement kvm_arch_test_clear_young() should override
- * kvm_arch_has_test_clear_young().
- *
- * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
- * can return true if kvm_arch_test_clear_young() is supported but disabled due
- * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
- * return true; otherwise, it should return false.
- *
- * For each young KVM PTE, kvm_arch_test_clear_young() should call
- * kvm_should_clear_young() to decide whether to clear the accessed bit.
- */
-#ifndef kvm_arch_has_test_clear_young
-static inline bool kvm_arch_has_test_clear_young(void)
-{
-	return false;
-}
+void kvm_init_test_clear_young(hva_handler_t arch_test_clear_young);
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ef2790469fda..ac83cfb30771 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -530,8 +530,6 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
-
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 			     unsigned long end);
 
@@ -859,6 +857,14 @@ bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
 	return args->clear;
 }
 
+static hva_handler_t kvm_test_clear_young;
+
+void kvm_init_test_clear_young(hva_handler_t arch_test_clear_young)
+{
+	WARN_ON_ONCE(!list_empty(&vm_list));
+	kvm_test_clear_young = arch_test_clear_young;
+}
+
 static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
 					     unsigned long start, unsigned long end,
 					     bool clear, unsigned long *bitmap)
@@ -873,7 +879,7 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_
 
 	trace_kvm_age_hva(start, end);
 
-	if (kvm_arch_has_test_clear_young()) {
+	if (kvm_test_clear_young) {
 		struct test_clear_young_args args = {
 			.bitmap	= bitmap,
 			.end	= end,
@@ -882,7 +888,7 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_
 
 		range.args = &args;
 		range.lockless = true;
-		range.handler = kvm_arch_test_clear_young;
+		range.handler = kvm_test_clear_young;
 
 		if (!__kvm_handle_hva_range(kvm, &range))
 			return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;

base-commit: 39ca80f27cc0d2a37b4e3d07bbf763d4954934d7
-- 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()
  2023-05-26 23:44 ` [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young() Yu Zhao
  2023-06-06  8:34   ` Tzung-Bi Shih
       [not found]   ` <ZHedMX470b7EMwbe@ziepe.ca>
@ 2023-06-15 17:42   ` Sean Christopherson
  2023-06-20  7:30   ` Nicholas Piggin
  3 siblings, 0 replies; 40+ messages in thread
From: Sean Christopherson @ 2023-06-15 17:42 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Oliver Upton,
	Paul Mackerras, Peter Xu, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Fri, May 26, 2023, Yu Zhao wrote:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0e571e973bc2..374262545f96 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>  struct kvm_gfn_range {
>  	struct kvm_memory_slot *slot;
> +	void *args;

There's no reason to make this "void *", just declare "struct test_clear_young_args"
in the header.  Arch code won't be able to use it regardless.  And I vote for
something more like "test_clear_young_metadata", as there's far more information
in there than just function arguments.

And to stave off the argument that "void *" would allow reuse, take this opportunity
to unionize the test_clear_young field with the change_pte field, e.g.

	/* comment about these fields being callback specific. */
	union {
		struct test_clear_young_metadata *metadata;
		pte_t pte;
		unsigned long callback_arg; /* needs a better name */
	};

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 51e4882d0873..31ee58754b19 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -541,6 +541,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>  
>  struct kvm_hva_range {
> +	void *args;

Same feedback as kvm_gfn_range.

>  	unsigned long start;
>  	unsigned long end;
>  	pte_t pte;
> @@ -549,6 +550,7 @@ struct kvm_hva_range {
>  	on_unlock_fn_t on_unlock;
>  	bool flush_on_ret;
>  	bool may_block;
> +	bool lockless;
>  };
>  
>  /*
> @@ -602,6 +604,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  			hva_end = min(range->end, slot->userspace_addr +
>  						  (slot->npages << PAGE_SHIFT));
>  
> +			gfn_range.args = range->args;

And this goes away because the generic callback_arg is what gets propagated.

> +
>  			/*
>  			 * To optimize for the likely case where the address
>  			 * range is covered by zero or one memslots, don't
> @@ -619,7 +623,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
>  			gfn_range.slot = slot;
>  
> -			if (!locked) {
> +			if (!range->lockless && !locked) {
>  				locked = true;
>  				KVM_MMU_LOCK(kvm);
>  				if (!IS_KVM_NULL_FN(range->on_lock))
> @@ -628,6 +632,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  					break;
>  			}
>  			ret |= range->handler(kvm, &gfn_range);
> +
> +			if (range->lockless && ret)

I don't like overloading "lockless" to also mean "stop on ret".  Just add another
flag, there's literally no cost for most callbacks as everything is constant at
compile time and gets optimized away.

> +		range.args = &args;
> +		range.lockless = true;

The lockless and stop_on_ret behavior needs comments.

> +		range.handler = kvm_arch_test_clear_young;
> +
> +		if (!__kvm_handle_hva_range(kvm, &range))
> +			return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;
> +	}
> +
> +	if (bitmap)
> +		return 0;
> +
> +	range.args = NULL;
> +	range.lockless = false;

No need to manually clear these, they'll be zeroed by the initialization code.

E.g. all in all, something like so

---
 include/linux/kvm_host.h |  9 +++++++--
 virt/kvm/kvm_main.c      | 29 +++++++++++++++++------------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7a0922cbc36f..e04605061f5e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,12 +256,17 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+struct test_clear_young_metadata;
+
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
-	void *args;
 	gfn_t start;
 	gfn_t end;
-	pte_t pte;
+	union {
+		struct test_clear_young_metadata *metadata;
+		pte_t pte;
+		unsigned long callback_arg;
+	};
 	bool may_block;
 };
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ac83cfb30771..8cf4fee9cd8b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -536,16 +536,20 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_hva_range {
-	void *args;
 	unsigned long start;
 	unsigned long end;
-	pte_t pte;
 	hva_handler_t handler;
+	union {
+		struct test_clear_young_metadata *metadata;
+		pte_t pte;
+		unsigned long callback_arg;
+	};
 	on_lock_fn_t on_lock;
 	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
 	bool may_block;
 	bool lockless;
+	bool stop_on_ret;
 };
 
 /*
@@ -576,6 +580,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 	struct kvm_memslots *slots;
 	int i, idx;
 
+	BUILD_BUG_ON(sizeof(gfn_range.callback_arg) != sizeof(gfn_range.pte) ||
+		     sizeof(gfn_range.callback_arg) != sizeof(gfn_range.metadata));
+
 	if (WARN_ON_ONCE(range->end <= range->start))
 		return 0;
 
@@ -599,16 +606,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			hva_end = min(range->end, slot->userspace_addr +
 						  (slot->npages << PAGE_SHIFT));
 
-			gfn_range.args = range->args;
-
 			/*
 			 * To optimize for the likely case where the address
 			 * range is covered by zero or one memslots, don't
 			 * bother making these conditional (to avoid writes on
 			 * the second or later invocation of the handler).
 			 */
-			gfn_range.pte = range->pte;
 			gfn_range.may_block = range->may_block;
+			gfn_range.callback_arg = range->callback_arg;
 
 			/*
 			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -628,7 +633,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			}
 			ret |= range->handler(kvm, &gfn_range);
 
-			if (range->lockless && ret)
+			/* comment goes here. */
+			if (range->stop_on_ret && ret)
 				break;
 		}
 	}
@@ -830,7 +836,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
 }
 
-struct test_clear_young_args {
+struct test_clear_young_metadata {
 	unsigned long *bitmap;
 	unsigned long end;
 	bool clear;
@@ -839,7 +845,7 @@ struct test_clear_young_args {
 
 bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
 {
-	struct test_clear_young_args *args = range->args;
+	struct test_clear_young_metadata *args = range->metadata;
 
 	VM_WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
 
@@ -880,14 +886,15 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_
 	trace_kvm_age_hva(start, end);
 
 	if (kvm_test_clear_young) {
-		struct test_clear_young_args args = {
+		struct test_clear_young_metadata args = {
 			.bitmap	= bitmap,
 			.end	= end,
 			.clear	= clear,
 		};
 
-		range.args = &args;
 		range.lockless = true;
+		range.stop_on_ret = true;
+		range.metadata = &args;
 		range.handler = kvm_test_clear_young;
 
 		if (!__kvm_handle_hva_range(kvm, &range))
@@ -897,8 +904,6 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_
 	if (bitmap)
 		return 0;
 
-	range.args = NULL;
-	range.lockless = false;
 	range.handler = clear ? kvm_age_gfn : kvm_test_age_gfn;
 
 	return __kvm_handle_hva_range(kvm, &range) ? MMU_NOTIFIER_RANGE_YOUNG : 0;

base-commit: 7a5d8be2c18415b73b9380741095f439d6983a40
-- 


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()
  2023-05-26 23:44 ` [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young() Yu Zhao
  2023-06-09  9:06   ` Paolo Bonzini
@ 2023-06-15 18:26   ` Sean Christopherson
  1 sibling, 0 replies; 40+ messages in thread
From: Sean Christopherson @ 2023-06-15 18:26 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Nicholas Piggin, Oliver Upton,
	Paul Mackerras, Peter Xu, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Fri, May 26, 2023, Yu Zhao wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 08340219c35a..6875a819e007 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1232,6 +1232,40 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
>  }
>  
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	struct kvm_mmu_page *root;
> +	int offset = ffs(shadow_accessed_mask) - 1;
> +
> +	if (kvm_shadow_root_allocated(kvm))

This needs a comment.

> +		return true;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {

As requested in v1[1], please add a macro for a lockless walk.

[1] https://lkml.kernel.org/r/Y%2Fed0XYAPx%2B7pukA%40google.com

> +		struct tdp_iter iter;
> +
> +		if (kvm_mmu_page_as_id(root) != range->slot->as_id)
> +			continue;
> +
> +		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
> +			u64 *sptep = rcu_dereference(iter.sptep);
> +
> +			VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));

Hrm, I don't like adding this in KVM.  The primary MMU might guarantee that this
callback is invoked if and only if the SPTE is backed by struct page memory, but
there's no reason to assume that's true in KVM.  If we want the sanity check, then
this needs to use kvm_pfn_to_refcounted_page().

And it should use KVM's MMU_WARN_ON(), which is a mess and effectively dead code,
but I'm working on changing that[*], i.e. by the time this gets to Linus' tree,
the sanity check should have a much cleaner implementation.

[2] https://lore.kernel.org/all/20230511235917.639770-8-seanjc@google.com

> +
> +			if (!(iter.old_spte & shadow_accessed_mask))
> +				continue;
> +
> +			if (kvm_should_clear_young(range, iter.gfn))
> +				clear_bit(offset, (unsigned long *)sptep);

If/when you rebase on https://github.com/kvm-x86/linux/tree/next, can you pull
out the atomic bits of tdp_mmu_clear_spte_bits() and use that new helper? E.g.

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..914c34518829 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -58,15 +58,18 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
        return old_spte;
 }
 
+static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
+{
+       atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
+
+       return (u64)atomic64_fetch_and(~mask, sptep_atomic);
+}
+
 static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
                                          u64 mask, int level)
 {
-       atomic64_t *sptep_atomic;
-
-       if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
-               sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
-               return (u64)atomic64_fetch_and(~mask, sptep_atomic);
-       }
+       if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
+               return tdp_mmu_clear_spte_bits_atomic(sptep, mask);
 
        __kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
        return old_spte;


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: kvm/x86: multichase benchmark
  2023-06-09  0:59 ` kvm/x86: multichase benchmark Yu Zhao
@ 2023-06-18 19:19   ` Yu Zhao
  0 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-06-18 19:19 UTC (permalink / raw)
  To: Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Thu, Jun 8, 2023 at 6:59 PM Yu Zhao <yuzhao@google.com> wrote:
>
> TLDR
> ====
> Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1].
>
> Hardware
> ========
> HOST $ lscpu
> Architecture:            x86_64
>   CPU op-mode(s):        32-bit, 64-bit
>   Address sizes:         43 bits physical, 48 bits virtual
>   Byte Order:            Little Endian
> CPU(s):                  128
>   On-line CPU(s) list:   0-127
> Vendor ID:               AuthenticAMD
>   Model name:            AMD Ryzen Threadripper PRO 3995WX 64-Cores
>     CPU family:          23
>     Model:               49
>     Thread(s) per core:  2
>     Core(s) per socket:  64
>     Socket(s):           1
>     Stepping:            0
>     Frequency boost:     disabled
>     CPU max MHz:         4308.3979
>     CPU min MHz:         2200.0000
>     BogoMIPS:            5390.20
>     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2
>                          ...
> Virtualization features:
>   Virtualization:        AMD-V
> Caches (sum of all):
>   L1d:                   2 MiB (64 instances)
>   L1i:                   2 MiB (64 instances)
>   L2:                    32 MiB (64 instances)
>   L3:                    256 MiB (16 instances)
> NUMA:
>   NUMA node(s):          1
>   NUMA node0 CPU(s):     0-127
> Vulnerabilities:
>   Itlb multihit:         Not affected
>   L1tf:                  Not affected
>   Mds:                   Not affected
>   Meltdown:              Not affected
>   Mmio stale data:       Not affected
>   Retbleed:              Mitigation; untrained return thunk; SMT enabled with STIBP protection
>   Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
>   Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
>   Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
>   Srbds:                 Not affected
>   Tsx async abort:       Not affected
>
> HOST $ numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0-127
> node 0 size: 257542 MB
> node 0 free: 224855 MB
> node distances:
> node   0
>   0:  10
>
> HOST $ cat /sys/class/nvme/nvme0/model
> INTEL SSDPF21Q800GB
>
> HOST $ cat /sys/class/nvme/nvme0/numa_node
> 0
>
> Software
> ========
> HOST $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=22.04
> DISTRIB_CODENAME=jammy
> DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
>
> HOST $ uname -a
> Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun  7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>
> HOST $ cat /proc/swaps
> Filename          Type         Size         Used    Priority
> /dev/nvme0n1p2    partition    466838356    0       -2
>
> HOST $ cat /sys/kernel/mm/lru_gen/enabled
> 0x000f
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer defer+madvise madvise [never]
>
> Procedure
> =========
> HOST $ git clone https://github.com/google/multichase
>
> HOST $ <Build multichase>
> HOST $ <Unpack /boot/initrd.img into ./initrd/>
>
> HOST $ cp multichase/multichase ./initrd/bin/
> HOST $ sed -i \
>     "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \

I was reminded that I missed one parameter above, i.e.,

"/^maybe_break top$/i multichase -N -t 2 -m 4g -n 28800; poweroff" \
                                 ^^

>     ./initrd/init
>
> HOST $ <Pack ./initrd/ into ./initrd.img>
>
> HOST $ cat run_microvms.sh
> memcgs=64
>
> run() {
>     path=/sys/fs/cgroup/memcg$1
>
>     mkdir $path
>     echo $BASHPID >$path/cgroup.procs

And one line here:

echo 4000m >$path/memory.min # or the largest size that doesn't cause OOM kills

>     qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
>         -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
>         -append "console=ttyS0 loglevel=0"
> }
>
> for ((memcg = 0; memcg < $memcgs; memcg++)); do
>     run $memcg &
> done
>
> wait
>
> Results
> =======
>                  Before [1]    After    Change
> ----------------------------------------------
> Total samples    6824          7237     +6%
>
> Notes
> =====
> [1] "mm: rmap: Don't flush TLB after checking PTE young for page
>     reference" was included so that the comparison is apples to
>     Apples.
>     https://lore.kernel.org/r/20220706112041.3831-1-21cnbao@gmail.com/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: kvm/arm64: Spark benchmark
  2023-06-09 13:04   ` Marc Zyngier
@ 2023-06-18 20:11     ` Yu Zhao
  0 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-06-18 20:11 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Masami Hiramatsu, Michael Ellerman, Michael Larabel,
	Mike Rapoport, Nicholas Piggin, Oliver Upton, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Fri, Jun 9, 2023 at 7:04 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 09 Jun 2023 01:59:35 +0100,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > TLDR
> > ====
> > Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].
>
> Why are the 3 architectures you have considered being evaluated with 3
> different benchmarks?

I was hoping people having special interests in different archs might
try to reproduce the benchmarks that I didn't report (but did cover)
and see what happens.

> I am not suspecting you to have cherry-picked
> the best results

I'm generally very conservative when reporting *synthetic* results.
For example, the same memcached benchmark used on powerpc yielded >50%
improvement on aarch64, because the default Ubuntu Kconfig uses 64KB
base page size for powerpc but 4KB for aarch64. (Before the series,
the reclaim (swap) path takes kvm->mmu_lock for *write* on O(nr of all
pages to consider); after the series, it becomes O(actual nr of pages
to swap), which is <10% given how the benchmark was set up.)

          Ops/sec  Avg. Latency  p50 Latency  p99 Latency  p99.9 Latency
------------------------------------------------------------------------
Before  639511.40       0.09940      0.04700      0.27100       22.52700
After   974184.60       0.06471      0.04700      0.15900        3.75900

> but I'd really like to see a variety of benchmarks
> that exercise this stuff differently.

I'd be happy to try other synthetic workloads that people think that
are relatively representative. Also, I've backported the series and
started an A/B experiment involving ~1 million devices (real-world
workloads). We should have the preliminary results by the time I post
the next version.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit
  2023-06-09  9:07 ` [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Paolo Bonzini
@ 2023-06-20  2:19   ` Yu Zhao
  0 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2023-06-20  2:19 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Morton
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport,
	Nicholas Piggin, Oliver Upton, Paul Mackerras, Peter Xu,
	Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Fri, Jun 9, 2023 at 3:08 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 5/27/23 01:44, Yu Zhao wrote:
> > TLDR
> > ====
> > This patchset adds a fast path to clear the accessed bit without
> > taking kvm->mmu_lock. It can significantly improve the performance of
> > guests when the host is under heavy memory pressure.
> >
> > ChromeOS has been using a similar approach [1] since mid 2021 and it
> > was proven successful on tens of millions devices.
> >
> > This v2 addressed previous requests [2] on refactoring code, removing
> > inaccurate/redundant texts, etc.
> >
> > [1]https://crrev.com/c/2987928
> > [2]https://lore.kernel.org/r/20230217041230.2417228-1-yuzhao@google.com/
>
>  From the KVM point of view the patches look good (though I wouldn't
> mind if Nicholas took a look at the ppc part).  Jason's comment on the
> MMU notifier side are promising as well.  Can you send v3 with Oliver's
> comments addressed?

Thanks. I'll address all the comments in v3 and post it asap.

Meanwhile, some updates on the recent progress from my side:
1. I've asked some downstream kernels to pick up v2 for testing, the
Archlinux Zen kernel did. I don't really expect its enthusiastic
testers to find this series relevant to their use cases. But who
knows.
2. I've also asked openbenchmarking.org to run their popular highmem
benchmark suites with v2. Hopefully they'll have some independent
results soon.
3. I've backported v2 to v5.15 and v6.1 and started an A/B experiment
involving ~1 million devices, as I mentioned in another email in this
thread. I should have some results to share when posting v3.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe
  2023-05-26 23:44 ` [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe Yu Zhao
@ 2023-06-20  6:32   ` Nicholas Piggin
  2023-06-20  8:00     ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Nicholas Piggin @ 2023-06-20  6:32 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport, Oliver Upton,
	Paul Mackerras, Peter Xu, Sean Christopherson, Steven Rostedt,
	Suzuki K Poulose, Thomas Gleixner, Thomas Huth, Will Deacon,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linuxppc-dev, linux-trace-kernel, x86,
	linux-mm

On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> KVM page tables are currently not RCU safe against remapping, i.e.,
> kvmppc_unmap_free_pmd_entry_table() et al. The previous

Minor nit but the "page table" is not RCU-safe against something. It
is RCU-freed, and therefore some algorithm that accesses it can have
the existence guarantee provided by RCU (usually there still needs
to be more to it).

> mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
> that operation.
>
> However, the new mmu_notifier_ops member test_clear_young() provides
> a fast path that does not take kvm->mmu_lock. To implement
> kvm_arch_test_clear_young() for that path, orphan page tables need to
> be freed by RCU.

Short version: clear the referenced bit using RCU instead of MMU lock
to protect against page table freeing, and there is no problem with
clearing the bit in a table that has been freed.

Seems reasonable.

>
> Unmapping, specifically kvm_unmap_radix(), does not free page tables,
> hence not a concern.

Not sure if you really need to make the distinction about why the page
table is freed, we might free them via unmapping. The point is just
anything that frees them while there can be concurrent access, right?

>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 461307b89c3a..3b65b3b11041 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
>  {
>  	unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
>  
> -	kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
> +	kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
> +					  SLAB_TYPESAFE_BY_RCU, pte_ctor);
>  	if (!kvm_pte_cache)
>  		return -ENOMEM;
>  
>  	size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
>  
> -	kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
> +	kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
> +					  SLAB_TYPESAFE_BY_RCU, pmd_ctor);
>  	if (!kvm_pmd_cache) {
>  		kmem_cache_destroy(kvm_pte_cache);
>  		return -ENOMEM;

KVM PPC HV radix PUD level page tables use the arch/powerpc allocators
(for some reason), which are not RCU freed. I think you need them too?
I wouldn't mind if the kvm pud table slab was moved in here instead of
shared.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()
  2023-05-26 23:44 ` [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young() Yu Zhao
                     ` (2 preceding siblings ...)
  2023-06-15 17:42   ` Sean Christopherson
@ 2023-06-20  7:30   ` Nicholas Piggin
  3 siblings, 0 replies; 40+ messages in thread
From: Nicholas Piggin @ 2023-06-20  7:30 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport, Oliver Upton,
	Paul Mackerras, Peter Xu, Sean Christopherson, Steven Rostedt,
	Suzuki K Poulose, Thomas Gleixner, Thomas Huth, Will Deacon,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linuxppc-dev, linux-trace-kernel, x86,
	linux-mm

On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> Add mmu_notifier_ops->test_clear_young() to supersede test_young()
> and clear_young().
>
> test_clear_young() has a fast path, which if supported, allows its
> callers to safely clear the accessed bit without taking
> kvm->mmu_lock.
>
> The fast path requires arch-specific code that generally relies on
> RCU and CAS: the former protects KVM page tables from being freed
> while the latter clears the accessed bit atomically against both the
> hardware and other software page table walkers. If the fast path is
> unsupported, test_clear_young() falls back to the existing slow path
> where kvm->mmu_lock is then taken.
>
> test_clear_young() can also operate on a range of KVM PTEs
> individually according to a bitmap, if the caller provides it.

It would be better if you could do patch 1 that only touches the
mmu_notifier code and implements mmu_notifier_test_clear_young() in
terms of existing callbacks, and next patch swaps KVM to new callbacks
and remove the old ones.

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 64a3e051c3c4..dfdbb370682d 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -60,6 +60,8 @@ enum mmu_notifier_event {
>  };
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> +#define MMU_NOTIFIER_RANGE_LOCKLESS	(1 << 1)
> +#define MMU_NOTIFIER_RANGE_YOUNG	(1 << 2)
>  
>  struct mmu_notifier_ops {
>  	/*
> @@ -122,6 +124,10 @@ struct mmu_notifier_ops {
>  			  struct mm_struct *mm,
>  			  unsigned long address);
>  
> +	int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
> +				unsigned long start, unsigned long end,
> +				bool clear, unsigned long *bitmap);

This should have a comment like the others. Callback wants to know how
to implement it.

Could add a _range on it as well while you're here, to correct that
inconsistency.

> +
>  	/*
>  	 * change_pte is called in cases that pte mapping to page is changed:
>  	 * for example, when ksm remaps pte to point to a new shared page.
> @@ -392,6 +398,9 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
>  				      unsigned long end);
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
> +extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
> +					   unsigned long start, unsigned long end,
> +					   bool clear, unsigned long *bitmap);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      unsigned long address, pte_t pte);
>  extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
> @@ -440,6 +449,35 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
>  	return 0;
>  }
>  
> +/*
> + * mmu_notifier_test_clear_young() returns nonzero if any of the KVM PTEs within
> + * a given range was young. Specifically, it returns MMU_NOTIFIER_RANGE_LOCKLESS
> + * if the fast path was successful, MMU_NOTIFIER_RANGE_YOUNG otherwise.
> + *
> + * The last parameter to the function is a bitmap and only the fast path
> + * supports it: if it is NULL, the function falls back to the slow path if the
> + * fast path was unsuccessful; otherwise, the function bails out.

Then if it was NULL, you would just not populate it. Minmize differences
and cases for the caller/implementations.

> + *
> + * The bitmap has the following specifications:
> + * 1. The number of bits should be at least (end-start)/PAGE_SIZE.
> + * 2. The offset of each bit should be relative to the end, i.e., the offset
> + *    corresponding to addr should be (end-addr)/PAGE_SIZE-1. This is convenient
> + *    for batching while forward looping.
> + *
> + * When testing, this function sets the corresponding bit in the bitmap for each
> + * young KVM PTE. When clearing, this function clears the accessed bit for each
> + * young KVM PTE whose corresponding bit in the bitmap is set.

I think this is over-designed as a first pass. The secondary MMU should
just implement the call always. If it can't do it locklessly, then just
do individual lookups. If the benefit is in the batching as you say then
the locked version will get similar benefit. Possibly more because locks
like some amount of batching when contended.

I think that would reduce some concerns about cases of secondary MMUs
that do not not support the lockless version yet, and avoid
proliferation of code paths by platform.

_If_ that was insufficient then I would like to see numbers and profiles
and incremental patch to expose more complexity like this.

Also mmu notifier code should say nothing about KVM PTEs or use kvm
names in any code or comments either. "if the page was accessed via the
secondary MMU" or similar is mutually understandable to KVM and mm
developers.

> @@ -880,6 +887,72 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  					     kvm_test_age_gfn);
>  }
>  
> +struct test_clear_young_args {
> +	unsigned long *bitmap;
> +	unsigned long end;
> +	bool clear;
> +	bool young;
> +};
> +
> +bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
> +{
> +	struct test_clear_young_args *args = range->args;
> +
> +	VM_WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
> +
> +	args->young = true;
> +
> +	if (args->bitmap) {
> +		int offset = hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
> +
> +		if (args->clear)
> +			return test_bit(offset, args->bitmap);
> +
> +		__set_bit(offset, args->bitmap);
> +	}
> +
> +	return args->clear;
> +}

I don't quite understnd what's going on here. This is actually the
function that notes the young pte, despite its name suggesting it is
only a query.

Shouldn't it set the bitmap bit even in the clear case? And why is it
testing at all? Oh, it seems to be some strange mix of test *or* clear
young. With the bitmap being a predicate in some cases for the clear
case.

This is a fairly confusing multi-modal API then. I think it should
take 2 bitmaps, one is the young bitmap and the other is the predicate
bitmap, and either/or can be NULL.

Also this kvm_should_clear_young helper is clunky and misnamed. If you
just provided an inline helper to get test_clear_young bitmap offset
from gfn, then set/clear bit in the caller is quite trivial.

> +
> +static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
> +					     unsigned long start, unsigned long end,
> +					     bool clear, unsigned long *bitmap)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	struct kvm_hva_range range = {
> +		.start		= start,
> +		.end		= end,
> +		.on_lock	= (void *)kvm_null_fn,
> +		.on_unlock	= (void *)kvm_null_fn,
> +	};
> +
> +	trace_kvm_age_hva(start, end);
> +
> +	if (kvm_arch_has_test_clear_young()) {
> +		struct test_clear_young_args args = {
> +			.bitmap	= bitmap,
> +			.end	= end,
> +			.clear	= clear,
> +		};
> +
> +		range.args = &args;
> +		range.lockless = true;
> +		range.handler = kvm_arch_test_clear_young;
> +
> +		if (!__kvm_handle_hva_range(kvm, &range))
> +			return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;
> +	}
> +
> +	if (bitmap)
> +		return 0;
> +
> +	range.args = NULL;
> +	range.lockless = false;
> +	range.handler = clear ? kvm_age_gfn : kvm_test_age_gfn;

Minor thing, but KVM's "young" handling has been called "age" until now.
Any reason not to stick with that theme?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()
  2023-05-26 23:44 ` [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young() Yu Zhao
@ 2023-06-20  7:47   ` Nicholas Piggin
  2023-06-21  0:38     ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Nicholas Piggin @ 2023-06-20  7:47 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Paolo Bonzini
  Cc: Alistair Popple, Anup Patel, Ben Gardon, Borislav Petkov,
	Catalin Marinas, Chao Peng, Christophe Leroy, Dave Hansen,
	Fabiano Rosas, Gaosheng Cui, Gavin Shan, H. Peter Anvin,
	Ingo Molnar, James Morse, Jason A. Donenfeld, Jason Gunthorpe,
	Jonathan Corbet, Marc Zyngier, Masami Hiramatsu,
	Michael Ellerman, Michael Larabel, Mike Rapoport, Oliver Upton,
	Paul Mackerras, Peter Xu, Sean Christopherson, Steven Rostedt,
	Suzuki K Poulose, Thomas Gleixner, Thomas Huth, Will Deacon,
	Zenghui Yu, kvmarm, kvm, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, linuxppc-dev, linux-trace-kernel, x86,
	linux-mm

On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> Implement kvm_arch_test_clear_young() to support the fast path in
> mmu_notifier_ops->test_clear_young().
>
> It focuses on a simple case, i.e., radix MMU sets the accessed bit in
> KVM PTEs and VMs are not nested, where it can rely on RCU and
> pte_xchg() to safely clear the accessed bit without taking
> kvm->mmu_lock. Complex cases fall back to the existing slow path
> where kvm->mmu_lock is then taken.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/powerpc/include/asm/kvm_host.h    |  8 ++++
>  arch/powerpc/include/asm/kvm_ppc.h     |  1 +
>  arch/powerpc/kvm/book3s.c              |  6 +++
>  arch/powerpc/kvm/book3s.h              |  1 +
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_hv.c           |  5 +++
>  6 files changed, 80 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 14ee0dece853..75c260ea8a9e 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>  
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> +	return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
> +	       cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
> +	       radix_enabled();

This could probably be radix_enabled() && !kvmhv_on_pseries(). Although
unclear why not nested hypervisor... I'd have to think about it a bit
more. Do you have any idea what might go wrong, or just didn't have the
time to consider it? (Not just powerpc nested but any others).

> +}
> +
>  #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 79a9c0bb8bba..ff1af6a7b44f 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -287,6 +287,7 @@ struct kvmppc_ops {
>  	bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
>  	bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>  	bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> +	bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
>  	bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
>  	void (*free_memslot)(struct kvm_memory_slot *slot);
>  	int (*init_vm)(struct kvm *kvm);
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 686d8d9eda3e..37bf40b0c4ff 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  	return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
>  }
>  
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	return !kvm->arch.kvm_ops->test_clear_young ||
> +	       kvm->arch.kvm_ops->test_clear_young(kvm, range);
> +}
> +
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> index 58391b4b32ed..fa2659e21ccc 100644
> --- a/arch/powerpc/kvm/book3s.h
> +++ b/arch/powerpc/kvm/book3s.h
> @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
>  extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>  extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>  extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> +extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>  extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>  
>  extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 3b65b3b11041..0a392e9a100a 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
>  	return ref;
>  }
>  
> +bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	bool err;
> +	gfn_t gfn = range->start;
> +
> +	rcu_read_lock();
> +
> +	err = !kvm_is_radix(kvm);
> +	if (err)
> +		goto unlock;
> +
> +	/*
> +	 * Case 1:  This function          kvmppc_switch_mmu_to_hpt()
> +	 *
> +	 *          rcu_read_lock()
> +	 *          Test kvm_is_radix()    kvm->arch.radix = 0
> +	 *          Use kvm->arch.pgtable  synchronize_rcu()
> +	 *          rcu_read_unlock()
> +	 *                                 kvmppc_free_radix()
> +	 *
> +	 *
> +	 * Case 2:  This function          kvmppc_switch_mmu_to_radix()
> +	 *
> +	 *                                 kvmppc_init_vm_radix()
> +	 *                                 smp_wmb()
> +	 *          Test kvm_is_radix()    kvm->arch.radix = 1
> +	 *          smp_rmb()
> +	 *          Use kvm->arch.pgtable
> +	 */
> +	smp_rmb();

Comment could stand to expand slightly on what you are solving, in
words.

If you use synchronize_rcu() on both sides, you wouldn't need the
barrier, right?

> +	while (gfn < range->end) {
> +		pte_t *ptep;
> +		pte_t old, new;
> +		unsigned int shift;
> +
> +		ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
> +		if (!ptep)
> +			goto next;
> +
> +		VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));

Not really appropriate at the KVM level. mm enforces this kind of
thing (with notifiers).

> +
> +		old = READ_ONCE(*ptep);
> +		if (!pte_present(old) || !pte_young(old))
> +			goto next;
> +
> +		new = pte_mkold(old);
> +
> +		if (kvm_should_clear_young(range, gfn))
> +			pte_xchg(ptep, old, new);

*Probably* will work. I can't think of a reason why not at the
moment anyway :)

Thanks,
Nick
> +next:
> +		gfn += shift ? BIT(shift - PAGE_SHIFT) : 1;
> +	}
> +unlock:
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +
>  /* Returns the number of PAGE_SIZE pages that are dirty */
>  static int kvm_radix_test_clear_dirty(struct kvm *kvm,
>  				struct kvm_memory_slot *memslot, int pagenum)
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 130bafdb1430..20a81ec9fde8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5262,6 +5262,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
>  	spin_lock(&kvm->mmu_lock);
>  	kvm->arch.radix = 0;
>  	spin_unlock(&kvm->mmu_lock);
> +	/* see the comments in kvm_test_clear_young_hv() */

I'm guilty of such comments at times, but it wouldn't hurt to say
	/* Finish concurrent kvm_test_clear_young_hv access to page tables */

Then you know where to look for more info and you have a vague
idea what it's for.

> +	synchronize_rcu();

>  	kvmppc_free_radix(kvm);
>  
>  	lpcr = LPCR_VPM1;
> @@ -5286,6 +5288,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
>  	if (err)
>  		return err;
>  	kvmppc_rmap_reset(kvm);
> +	/* see the comments in kvm_test_clear_young_hv() */
> +	smp_wmb();
>  	/* Mutual exclusion with kvm_unmap_gfn_range etc. */
>  	spin_lock(&kvm->mmu_lock);
>  	kvm->arch.radix = 1;
> @@ -6185,6 +6189,7 @@ static struct kvmppc_ops kvm_ops_hv = {
>  	.unmap_gfn_range = kvm_unmap_gfn_range_hv,
>  	.age_gfn = kvm_age_gfn_hv,
>  	.test_age_gfn = kvm_test_age_gfn_hv,
> +	.test_clear_young = kvm_test_clear_young_hv,
>  	.set_spte_gfn = kvm_set_spte_gfn_hv,
>  	.free_memslot = kvmppc_core_free_memslot_hv,
>  	.init_vm =  kvmppc_core_init_vm_hv,

Thanks for looking at the powerpc conversion!

Thanks,
Nick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe
  2023-06-20  6:32   ` Nicholas Piggin
@ 2023-06-20  8:00     ` Yu Zhao
  2023-06-20 10:49       ` Nicholas Piggin
  0 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-06-20  8:00 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Oliver Upton, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Tue, Jun 20, 2023 at 12:33 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > KVM page tables are currently not RCU safe against remapping, i.e.,
> > kvmppc_unmap_free_pmd_entry_table() et al. The previous
>
> Minor nit but the "page table" is not RCU-safe against something. It
> is RCU-freed, and therefore some algorithm that accesses it can have
> the existence guarantee provided by RCU (usually there still needs
> to be more to it).
>
> > mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
> > that operation.
> >
> > However, the new mmu_notifier_ops member test_clear_young() provides
> > a fast path that does not take kvm->mmu_lock. To implement
> > kvm_arch_test_clear_young() for that path, orphan page tables need to
> > be freed by RCU.
>
> Short version: clear the referenced bit using RCU instead of MMU lock
> to protect against page table freeing, and there is no problem with
> clearing the bit in a table that has been freed.
>
> Seems reasonable.

Thanks. All above points taken.

> > Unmapping, specifically kvm_unmap_radix(), does not free page tables,
> > hence not a concern.
>
> Not sure if you really need to make the distinction about why the page
> table is freed, we might free them via unmapping. The point is just
> anything that frees them while there can be concurrent access, right?

Correct.

> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > index 461307b89c3a..3b65b3b11041 100644
> > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
> >  {
> >       unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
> >
> > -     kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
> > +     kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
> > +                                       SLAB_TYPESAFE_BY_RCU, pte_ctor);
> >       if (!kvm_pte_cache)
> >               return -ENOMEM;
> >
> >       size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
> >
> > -     kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
> > +     kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
> > +                                       SLAB_TYPESAFE_BY_RCU, pmd_ctor);
> >       if (!kvm_pmd_cache) {
> >               kmem_cache_destroy(kvm_pte_cache);
> >               return -ENOMEM;
>
> KVM PPC HV radix PUD level page tables use the arch/powerpc allocators
> (for some reason), which are not RCU freed. I think you need them too?

We don't. The use of the arch/powerpc allocator for PUD tables seems
appropriate to me because, unlike PMD/PTE tables, we never free PUD
tables during the lifetime of a VM:
* We don't free PUD/PMD/PTE tables when they become empty, i.e., not
mapping any pages but still attached. (We could in theory, as
x86/aarch64 do.)
* We have to free PMD/PTE tables when we replace them with 1GB/2MB
pages. (Otherwise we'd lose track of detached tables.) And we
currently don't support huge pages at P4D level, so we never detach
and free PUD tables.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe
  2023-06-20  8:00     ` Yu Zhao
@ 2023-06-20 10:49       ` Nicholas Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nicholas Piggin @ 2023-06-20 10:49 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Oliver Upton, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Tue Jun 20, 2023 at 6:00 PM AEST, Yu Zhao wrote:
> On Tue, Jun 20, 2023 at 12:33 AM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > > KVM page tables are currently not RCU safe against remapping, i.e.,
> > > kvmppc_unmap_free_pmd_entry_table() et al. The previous
> >
> > Minor nit but the "page table" is not RCU-safe against something. It
> > is RCU-freed, and therefore some algorithm that accesses it can have
> > the existence guarantee provided by RCU (usually there still needs
> > to be more to it).
> >
> > > mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
> > > that operation.
> > >
> > > However, the new mmu_notifier_ops member test_clear_young() provides
> > > a fast path that does not take kvm->mmu_lock. To implement
> > > kvm_arch_test_clear_young() for that path, orphan page tables need to
> > > be freed by RCU.
> >
> > Short version: clear the referenced bit using RCU instead of MMU lock
> > to protect against page table freeing, and there is no problem with
> > clearing the bit in a table that has been freed.
> >
> > Seems reasonable.
>
> Thanks. All above points taken.
>
> > > Unmapping, specifically kvm_unmap_radix(), does not free page tables,
> > > hence not a concern.
> >
> > Not sure if you really need to make the distinction about why the page
> > table is freed, we might free them via unmapping. The point is just
> > anything that frees them while there can be concurrent access, right?
>
> Correct.
>
> > > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > index 461307b89c3a..3b65b3b11041 100644
> > > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
> > >  {
> > >       unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
> > >
> > > -     kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
> > > +     kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
> > > +                                       SLAB_TYPESAFE_BY_RCU, pte_ctor);
> > >       if (!kvm_pte_cache)
> > >               return -ENOMEM;
> > >
> > >       size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
> > >
> > > -     kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
> > > +     kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
> > > +                                       SLAB_TYPESAFE_BY_RCU, pmd_ctor);
> > >       if (!kvm_pmd_cache) {
> > >               kmem_cache_destroy(kvm_pte_cache);
> > >               return -ENOMEM;
> >
> > KVM PPC HV radix PUD level page tables use the arch/powerpc allocators
> > (for some reason), which are not RCU freed. I think you need them too?
>
> We don't. The use of the arch/powerpc allocator for PUD tables seems
> appropriate to me because, unlike PMD/PTE tables, we never free PUD
> tables during the lifetime of a VM:

Ah you're right, the pud_free only comes from the double alloc case
so it's never visible to concurrent threads.

> * We don't free PUD/PMD/PTE tables when they become empty, i.e., not
> mapping any pages but still attached. (We could in theory, as
> x86/aarch64 do.)

We may try to do that at some point, but that's not related to your
patch for now so no worries.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()
  2023-06-20  7:47   ` Nicholas Piggin
@ 2023-06-21  0:38     ` Yu Zhao
  2023-06-21  2:51       ` Nicholas Piggin
  0 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2023-06-21  0:38 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Oliver Upton, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Tue, Jun 20, 2023 at 1:48 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > Implement kvm_arch_test_clear_young() to support the fast path in
> > mmu_notifier_ops->test_clear_young().
> >
> > It focuses on a simple case, i.e., radix MMU sets the accessed bit in
> > KVM PTEs and VMs are not nested, where it can rely on RCU and
> > pte_xchg() to safely clear the accessed bit without taking
> > kvm->mmu_lock. Complex cases fall back to the existing slow path
> > where kvm->mmu_lock is then taken.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/powerpc/include/asm/kvm_host.h    |  8 ++++
> >  arch/powerpc/include/asm/kvm_ppc.h     |  1 +
> >  arch/powerpc/kvm/book3s.c              |  6 +++
> >  arch/powerpc/kvm/book3s.h              |  1 +
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
> >  arch/powerpc/kvm/book3s_hv.c           |  5 +++
> >  6 files changed, 80 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > index 14ee0dece853..75c260ea8a9e 100644
> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> >  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> >  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> >
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > +     return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
> > +            cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
> > +            radix_enabled();
>
> This could probably be radix_enabled() && !kvmhv_on_pseries().

Will do. (I used !kvmhv_on_pseries() in v1 but had second thoughts on
moving kvmhv_on_pseries() into this file.)

> Although unclear why not nested hypervisor... I'd have to think about it a bit
> more. Do you have any idea what might go wrong, or just didn't have the
> time to consider it? (Not just powerpc nested but any others).

Yes, this series excludes nested KVM support on all architures. The
common reason for such a decision on powerpc and x86 (aarch64 doesn't
support nested yet) is that it's quite challenging to make the rmap, a
complex data structure that maps one PFN to multiple GFNs, lockless.
(See kvmhv_insert_nest_rmap().) It might be possible, however, the
potential ROI would be in question.

> > +}
> > +
> >  #endif /* __POWERPC_KVM_HOST_H__ */
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > index 79a9c0bb8bba..ff1af6a7b44f 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -287,6 +287,7 @@ struct kvmppc_ops {
> >       bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> >       bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> >       bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > +     bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
> >       bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> >       void (*free_memslot)(struct kvm_memory_slot *slot);
> >       int (*init_vm)(struct kvm *kvm);
> > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> > index 686d8d9eda3e..37bf40b0c4ff 100644
> > --- a/arch/powerpc/kvm/book3s.c
> > +++ b/arch/powerpc/kvm/book3s.c
> > @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >       return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
> >  }
> >
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > +     return !kvm->arch.kvm_ops->test_clear_young ||
> > +            kvm->arch.kvm_ops->test_clear_young(kvm, range);
> > +}
> > +
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >       return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> > diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> > index 58391b4b32ed..fa2659e21ccc 100644
> > --- a/arch/powerpc/kvm/book3s.h
> > +++ b/arch/powerpc/kvm/book3s.h
> > @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
> >  extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> >  extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> >  extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > +extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> >  extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> >
> >  extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
> > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > index 3b65b3b11041..0a392e9a100a 100644
> > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > @@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
> >       return ref;
> >  }
> >
> > +bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > +     bool err;
> > +     gfn_t gfn = range->start;
> > +
> > +     rcu_read_lock();
> > +
> > +     err = !kvm_is_radix(kvm);
> > +     if (err)
> > +             goto unlock;
> > +
> > +     /*
> > +      * Case 1:  This function          kvmppc_switch_mmu_to_hpt()
> > +      *
> > +      *          rcu_read_lock()
> > +      *          Test kvm_is_radix()    kvm->arch.radix = 0
> > +      *          Use kvm->arch.pgtable  synchronize_rcu()
> > +      *          rcu_read_unlock()
> > +      *                                 kvmppc_free_radix()
> > +      *
> > +      *
> > +      * Case 2:  This function          kvmppc_switch_mmu_to_radix()
> > +      *
> > +      *                                 kvmppc_init_vm_radix()
> > +      *                                 smp_wmb()
> > +      *          Test kvm_is_radix()    kvm->arch.radix = 1
> > +      *          smp_rmb()
> > +      *          Use kvm->arch.pgtable
> > +      */
> > +     smp_rmb();
>
> Comment could stand to expand slightly on what you are solving, in
> words.

Will do.

> If you use synchronize_rcu() on both sides, you wouldn't need the
> barrier, right?

Case 2 is about memory ordering, which is orthogonal to case 1 (RCU
freeing). So we need the r/w barrier regardless.

> > +     while (gfn < range->end) {
> > +             pte_t *ptep;
> > +             pte_t old, new;
> > +             unsigned int shift;
> > +
> > +             ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
> > +             if (!ptep)
> > +                     goto next;
> > +
> > +             VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));
>
> Not really appropriate at the KVM level. mm enforces this kind of
> thing (with notifiers).

Will remove this.

> > +
> > +             old = READ_ONCE(*ptep);
> > +             if (!pte_present(old) || !pte_young(old))
> > +                     goto next;
> > +
> > +             new = pte_mkold(old);
> > +
> > +             if (kvm_should_clear_young(range, gfn))
> > +                     pte_xchg(ptep, old, new);
>
> *Probably* will work. I can't think of a reason why not at the
> moment anyway :)

My reasoning:
* It should work if we only change the dedicated A bit, i.e., not
shared for other purposes, because races are safe (the case here).
* It may not work for x86 EPT without the A bit (excluded in this
series) where accesses are trapped by the R/X bits, because races in
changing the R/X bits can be unsafe.

> > +next:
> > +             gfn += shift ? BIT(shift - PAGE_SHIFT) : 1;
> > +     }
> > +unlock:
> > +     rcu_read_unlock();
> > +
> > +     return err;
> > +}
> > +
> >  /* Returns the number of PAGE_SIZE pages that are dirty */
> >  static int kvm_radix_test_clear_dirty(struct kvm *kvm,
> >                               struct kvm_memory_slot *memslot, int pagenum)
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 130bafdb1430..20a81ec9fde8 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -5262,6 +5262,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
> >       spin_lock(&kvm->mmu_lock);
> >       kvm->arch.radix = 0;
> >       spin_unlock(&kvm->mmu_lock);
> > +     /* see the comments in kvm_test_clear_young_hv() */
>
> I'm guilty of such comments at times, but it wouldn't hurt to say
>         /* Finish concurrent kvm_test_clear_young_hv access to page tables */
>
> Then you know where to look for more info and you have a vague
> idea what it's for.

Will do.

> > +     synchronize_rcu();
>
> >       kvmppc_free_radix(kvm);
> >
> >       lpcr = LPCR_VPM1;
> > @@ -5286,6 +5288,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
> >       if (err)
> >               return err;
> >       kvmppc_rmap_reset(kvm);
> > +     /* see the comments in kvm_test_clear_young_hv() */
> > +     smp_wmb();
> >       /* Mutual exclusion with kvm_unmap_gfn_range etc. */
> >       spin_lock(&kvm->mmu_lock);
> >       kvm->arch.radix = 1;
> > @@ -6185,6 +6189,7 @@ static struct kvmppc_ops kvm_ops_hv = {
> >       .unmap_gfn_range = kvm_unmap_gfn_range_hv,
> >       .age_gfn = kvm_age_gfn_hv,
> >       .test_age_gfn = kvm_test_age_gfn_hv,
> > +     .test_clear_young = kvm_test_clear_young_hv,
> >       .set_spte_gfn = kvm_set_spte_gfn_hv,
> >       .free_memslot = kvmppc_core_free_memslot_hv,
> >       .init_vm =  kvmppc_core_init_vm_hv,
>
> Thanks for looking at the powerpc conversion!

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()
  2023-06-21  0:38     ` Yu Zhao
@ 2023-06-21  2:51       ` Nicholas Piggin
  0 siblings, 0 replies; 40+ messages in thread
From: Nicholas Piggin @ 2023-06-21  2:51 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Paolo Bonzini, Alistair Popple, Anup Patel,
	Ben Gardon, Borislav Petkov, Catalin Marinas, Chao Peng,
	Christophe Leroy, Dave Hansen, Fabiano Rosas, Gaosheng Cui,
	Gavin Shan, H. Peter Anvin, Ingo Molnar, James Morse,
	Jason A. Donenfeld, Jason Gunthorpe, Jonathan Corbet,
	Marc Zyngier, Masami Hiramatsu, Michael Ellerman,
	Michael Larabel, Mike Rapoport, Oliver Upton, Paul Mackerras,
	Peter Xu, Sean Christopherson, Steven Rostedt, Suzuki K Poulose,
	Thomas Gleixner, Thomas Huth, Will Deacon, Zenghui Yu, kvmarm,
	kvm, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev, linux-trace-kernel, x86, linux-mm

On Wed Jun 21, 2023 at 10:38 AM AEST, Yu Zhao wrote:
> On Tue, Jun 20, 2023 at 1:48 AM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > > Implement kvm_arch_test_clear_young() to support the fast path in
> > > mmu_notifier_ops->test_clear_young().
> > >
> > > It focuses on a simple case, i.e., radix MMU sets the accessed bit in
> > > KVM PTEs and VMs are not nested, where it can rely on RCU and
> > > pte_xchg() to safely clear the accessed bit without taking
> > > kvm->mmu_lock. Complex cases fall back to the existing slow path
> > > where kvm->mmu_lock is then taken.
> > >
> > > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  arch/powerpc/include/asm/kvm_host.h    |  8 ++++
> > >  arch/powerpc/include/asm/kvm_ppc.h     |  1 +
> > >  arch/powerpc/kvm/book3s.c              |  6 +++
> > >  arch/powerpc/kvm/book3s.h              |  1 +
> > >  arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
> > >  arch/powerpc/kvm/book3s_hv.c           |  5 +++
> > >  6 files changed, 80 insertions(+)
> > >
> > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > > index 14ee0dece853..75c260ea8a9e 100644
> > > --- a/arch/powerpc/include/asm/kvm_host.h
> > > +++ b/arch/powerpc/include/asm/kvm_host.h
> > > @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> > >  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> > >  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> > >
> > > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > > +static inline bool kvm_arch_has_test_clear_young(void)
> > > +{
> > > +     return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
> > > +            cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
> > > +            radix_enabled();
> >
> > This could probably be radix_enabled() && !kvmhv_on_pseries().
>
> Will do. (I used !kvmhv_on_pseries() in v1 but had second thoughts on
> moving kvmhv_on_pseries() into this file.)

That should be okay. kvmhv_on_pseries is a property of the host so it
seems reasonable to move it here if needed.

> > Although unclear why not nested hypervisor... I'd have to think about it a bit
> > more. Do you have any idea what might go wrong, or just didn't have the
> > time to consider it? (Not just powerpc nested but any others).
>
> Yes, this series excludes nested KVM support on all architures. The
> common reason for such a decision on powerpc and x86 (aarch64 doesn't
> support nested yet) is that it's quite challenging to make the rmap, a
> complex data structure that maps one PFN to multiple GFNs, lockless.
> (See kvmhv_insert_nest_rmap().) It might be possible, however, the
> potential ROI would be in question.

Okay just wondering. rmap (at least the powerpc one) is just a list
I think, with a few details. If that is all it is, it might not be
so hard to make that lock-free or a fine-grained lock on the rmap
chains maybe. But fine to ignore it to start with.

> > > +}
> > > +
> > >  #endif /* __POWERPC_KVM_HOST_H__ */
> > > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > > index 79a9c0bb8bba..ff1af6a7b44f 100644
> > > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > > @@ -287,6 +287,7 @@ struct kvmppc_ops {
> > >       bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> > >       bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > >       bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +     bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
> > >       bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > >       void (*free_memslot)(struct kvm_memory_slot *slot);
> > >       int (*init_vm)(struct kvm *kvm);
> > > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> > > index 686d8d9eda3e..37bf40b0c4ff 100644
> > > --- a/arch/powerpc/kvm/book3s.c
> > > +++ b/arch/powerpc/kvm/book3s.c
> > > @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > >       return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
> > >  }
> > >
> > > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > +     return !kvm->arch.kvm_ops->test_clear_young ||
> > > +            kvm->arch.kvm_ops->test_clear_young(kvm, range);
> > > +}
> > > +
> > >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > >  {
> > >       return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> > > diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> > > index 58391b4b32ed..fa2659e21ccc 100644
> > > --- a/arch/powerpc/kvm/book3s.h
> > > +++ b/arch/powerpc/kvm/book3s.h
> > > @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
> > >  extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > >  extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > >
> > >  extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
> > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > index 3b65b3b11041..0a392e9a100a 100644
> > > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > @@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > >       return ref;
> > >  }
> > >
> > > +bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > +     bool err;
> > > +     gfn_t gfn = range->start;
> > > +
> > > +     rcu_read_lock();
> > > +
> > > +     err = !kvm_is_radix(kvm);
> > > +     if (err)
> > > +             goto unlock;
> > > +
> > > +     /*
> > > +      * Case 1:  This function          kvmppc_switch_mmu_to_hpt()
> > > +      *
> > > +      *          rcu_read_lock()
> > > +      *          Test kvm_is_radix()    kvm->arch.radix = 0
> > > +      *          Use kvm->arch.pgtable  synchronize_rcu()
> > > +      *          rcu_read_unlock()
> > > +      *                                 kvmppc_free_radix()
> > > +      *
> > > +      *
> > > +      * Case 2:  This function          kvmppc_switch_mmu_to_radix()
> > > +      *
> > > +      *                                 kvmppc_init_vm_radix()
> > > +      *                                 smp_wmb()
> > > +      *          Test kvm_is_radix()    kvm->arch.radix = 1
> > > +      *          smp_rmb()
> > > +      *          Use kvm->arch.pgtable
> > > +      */
> > > +     smp_rmb();
> >
> > Comment could stand to expand slightly on what you are solving, in
> > words.
>
> Will do.
>
> > If you use synchronize_rcu() on both sides, you wouldn't need the
> > barrier, right?
>
> Case 2 is about memory ordering, which is orthogonal to case 1 (RCU
> freeing). So we need the r/w barrier regardless.

RCU can take care of memory ordering too though. If you had
synchronize_rcu() where smp_wmb() is, then no smp_rmb() neeed here.

>
> > > +     while (gfn < range->end) {
> > > +             pte_t *ptep;
> > > +             pte_t old, new;
> > > +             unsigned int shift;
> > > +
> > > +             ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
> > > +             if (!ptep)
> > > +                     goto next;
> > > +
> > > +             VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));
> >
> > Not really appropriate at the KVM level. mm enforces this kind of
> > thing (with notifiers).
>
> Will remove this.
>
> > > +
> > > +             old = READ_ONCE(*ptep);
> > > +             if (!pte_present(old) || !pte_young(old))
> > > +                     goto next;
> > > +
> > > +             new = pte_mkold(old);
> > > +
> > > +             if (kvm_should_clear_young(range, gfn))
> > > +                     pte_xchg(ptep, old, new);
> >
> > *Probably* will work. I can't think of a reason why not at the
> > moment anyway :)
>
> My reasoning:
> * It should work if we only change the dedicated A bit, i.e., not
> shared for other purposes, because races are safe (the case here).
> * It may not work for x86 EPT without the A bit (excluded in this
> series) where accesses are trapped by the R/X bits, because races in
> changing the R/X bits can be unsafe.

(For the benefit of others reading, it works because powerpc's pte_xchg
is actually a cmpxchg, for some reason which we really should fix).
Although it can fail to clear the bit if the cmpxchg fails.

I think pte_xchg is only used on with hash MMU in Linux before this
change. I think we may want to keep it that way and use something
like kvmppc_radix_update_pte() here to clear out the bit. But don't
worry too much about fine details so much before sorting out the
core changes I will have a better look after that.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-06-21  2:51 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26 23:44 [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Yu Zhao
2023-05-26 23:44 ` [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young() Yu Zhao
2023-06-06  8:34   ` Tzung-Bi Shih
2023-06-09  1:00     ` Yu Zhao
     [not found]   ` <ZHedMX470b7EMwbe@ziepe.ca>
2023-06-09  9:04     ` Paolo Bonzini
2023-06-15 17:42   ` Sean Christopherson
2023-06-20  7:30   ` Nicholas Piggin
2023-05-26 23:44 ` [PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young() Yu Zhao
2023-05-26 23:44 ` [PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros Yu Zhao
2023-05-26 23:44 ` [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe Yu Zhao
2023-05-27 18:08   ` Oliver Upton
2023-05-27 20:13     ` Yu Zhao
2023-05-30 19:37       ` Oliver Upton
2023-05-30 20:06         ` Yu Zhao
     [not found]           ` <ZHef0VsZvZ1Vnz0u@linux.dev>
2023-05-31 23:10             ` Yu Zhao
2023-05-31 23:22               ` Oliver Upton
2023-05-31 23:41                 ` Yu Zhao
2023-05-26 23:44 ` [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young() Yu Zhao
2023-05-26 23:44 ` [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe Yu Zhao
2023-06-20  6:32   ` Nicholas Piggin
2023-06-20  8:00     ` Yu Zhao
2023-06-20 10:49       ` Nicholas Piggin
2023-05-26 23:44 ` [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young() Yu Zhao
2023-06-20  7:47   ` Nicholas Piggin
2023-06-21  0:38     ` Yu Zhao
2023-06-21  2:51       ` Nicholas Piggin
2023-05-26 23:44 ` [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask Yu Zhao
2023-06-15 16:59   ` Sean Christopherson
2023-05-26 23:44 ` [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young() Yu Zhao
2023-06-09  9:06   ` Paolo Bonzini
2023-06-15 18:26   ` Sean Christopherson
2023-05-26 23:44 ` [PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young() Yu Zhao
2023-06-09  0:59 ` kvm/arm64: Spark benchmark Yu Zhao
2023-06-09 13:04   ` Marc Zyngier
2023-06-18 20:11     ` Yu Zhao
2023-06-09  0:59 ` kvm/powerpc: memcached benchmark Yu Zhao
2023-06-09  0:59 ` kvm/x86: multichase benchmark Yu Zhao
2023-06-18 19:19   ` Yu Zhao
2023-06-09  9:07 ` [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit Paolo Bonzini
2023-06-20  2:19   ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).