kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/24] Allow parallel page faults with TDP MMU
@ 2021-01-12 18:10 Ben Gardon
  2021-01-12 18:10 ` [PATCH 01/24] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
                   ` (23 more replies)
  0 siblings, 24 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

The TDP MMU was implemented to simplify and improve the performance of
KVM's memory management on modern hardware with TDP (EPT / NPT). To build
on the existing performance improvements of the TDP MMU, add the ability
to handle vCPU page faults in parallel. In the current implementation,
vCPU page faults (actually EPT/NPT violations/misconfigurations) are the
largest source of MMU lock contention on VMs with many vCPUs. This
contention, and the resulting page fault latency, can soft-lock guests
and degrade performance. Handling page faults in parallel is especially
useful when booting VMs, enabling dirty logging, and handling demand
paging. In all these cases vCPUs are constantly incurring  page faults on
each new page accessed.

Broadly, the following changes were required to allow parallel page
faults:
-- Contention detection and yielding added to rwlocks to bring them up to
   feature parity with spin locks, at least as far as the use of the MMU
   lock is concerned.
-- TDP MMU page table memory is protected with RCU and freed in RCU
   callbacks to allow multiple threads to operate on that memory
   concurrently.
-- When the TDP MMU is enabled, a rwlock is used instead of a spin lock on
   x86. This allows the page fault handlers to acquire the MMU lock in read
   mode and handle page faults in parallel while other operations maintain
   exclusive use of the lock by acquiring it in write mode.
-- An additional lock is added to protect some data structures needed by
   the page fault handlers, for relatively infrequent operations.
-- The page fault handler is modified to use atomic cmpxchgs to set SPTEs
   and some page fault handler operations are modified slightly to work
   concurrently with other threads.

This series also contains a few bug fixes and optimizations, related to
the above, but not strictly part of enabling parallel page fault handling.

Performance testing:
The KVM selftests dirty_log_perf_test demonstrates the performance
improvements associated with this patch series. The dirty_log_perf test
was run on a two socket Indus Skylake, with a VM with 96 vCPUs.
5 get-dirty-log iterations were run. Each test was run 3 times and the
results averaged. The test was conducted with 3 different variables:
Overlapping versus partitioned memory
With overlapping memory vCPUs are more likely to incur retries handling
parallel page faults, so the TDP MMU with parallel page faults is expected
to fare the worst in this situation.
Partitioned memory between vCPUs is a best case for parallel page faults
with the TDP MMU as it should minimize contention and retries.
When running with partitioned memory, 3G was allocated for each vCPU's
data region. When running with overlapping memory accesses, a total of 6G
was allocated for the VM's data region. This meant that the VM was much
smaller overall, but each vCPU had more memory to access. Since the VMs
were very different in size, the results cannot be reliably compared. The
VM sizes were chosen to balance test runtime and repeatability of results.
The option to overlap memory accesses will be added to dirty_log_perf_test
in a (near-)future series.
With this patch set applied versus without
In these tests the series was applied on commit:
9f1abbe97c08 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
That commit was also used as the baseline.
TDP MMU enabled versus disabled
This is primarily included to ensure that this series does not regress
performance with the TDP MMU disabled.

Does this series improve performance with the TDP MMU enabled?

Baseline, TDP MMU enabled, partitioned accesses:
Populate memory time (s)		110.193
Enabling dirty logging time (s)		4.829
Dirty memory time (s)			3.949
Get dirty log time (s)			0.822
Disabling dirty logging time (s)	2.995
Parallel PFs series, TDP MMU enabled, partitioned accesses:
Populate memory time (s)		16.112
Enabling dirty logging time (s)		7.057
Dirty memory time (s)			0.275
Get dirty log time (s)			5.468
Disabling dirty logging time (s)	3.088

This scenario demonstrates the big win in this series: an 85% reduction in
the time taken to populate memory! Note that the time taken to dirty memory
is much shorter and the time to get the dirty log higher with this series.

Baseline, TDP MMU enabled, overlapping accesses:
Populate memory time (s)		117.31
Enabling dirty logging time (s)		0.191
Dirty memory time (s)			0.193
Get dirty log time (s)			2.33
Disabling dirty logging time (s)	0.059
Parallel PFs series, TDP MMU enabled, overlapping accesses:
Populate memory time (s)		141.155
Enabling dirty logging time (s)		0.245
Dirty memory time (s)			0.236
Get dirty log time (s)			2.35
Disabling dirty logging time (s)	0.075

With overlapping accesses, we can see that this parallel page faults
series actually reduces performance when populating memory. In profiling,
it appeared that most of the time was spent in get_user_pages, so it's
possible the extra concurrency hit the main MM subsystem harder, creating
contention there.

Does this series degrade performance with the TDP MMU disabled?

Baseline, TDP MMU disabled, partitioned accesses:
Populate memory time (s)		110.193
Enabling dirty logging time (s)		4.829
Dirty memory time (s)			3.949
Get dirty log time (s)			0.822
Disabling dirty logging time (s)	2.995
Parallel PFs series, TDP MMU disabled, partitioned accesses:
Populate memory time (s)		110.917
Enabling dirty logging time (s)		5.196
Dirty memory time (s)			4.559
Get dirty log time (s)			0.879
Disabling dirty logging time (s)	3.278

Here we can see that the parallel PFs series appears to have made enabling
and disabling dirty logging, and dirtying memory slightly slower. It's
possible that this is a result of additional checks around MMU lock
acquisition.

Baseline, TDP MMU disabled, overlapping accesses:
Populate memory time (s)		103.115
Enabling dirty logging time (s)		0.222
Dirty memory time (s)			0.189
Get dirty log time (s)			2.341
Disabling dirty logging time (s)	0.126
Parallel PFs series, TDP MMU disabled, overlapping accesses:
Populate memory time (s)		85.392
Enabling dirty logging time (s)		0.224
Dirty memory time (s)			0.201
Get dirty log time (s)			2.363
Disabling dirty logging time (s)	0.131

From the above results we can see that the parallel PF series only had a
significant effect on the population time, with overlapping accesses and
the TDP MMU disabled. It is not currently known what in this series caused
the improvement.

Correctness testing:
The following tests were performed with an SMP kernel and DBX kernel on an
Intel Skylake machine. The tests were run both with and without the TDP
MMU enabled.
-- This series introduces no new failures in kvm-unit-tests
SMP + no TDP MMU no new failures
SMP + TDP MMU no new failures
DBX + no TDP MMU no new failures
DBX + TDP MMU no new failures
-- All KVM selftests behave as expected
SMP + no TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
SMP + TDP MMU all pass except ./x86_64/vmx_preemption_timer_test
(./x86_64/vmx_preemption_timer_test also fails without this patch set,
both with the TDP MMU on and off.)
DBX + no TDP MMU all pass
DBX + TDP MMU all pass
-- A VM can be booted running a Debian 9 and all memory accessed
SMP + no TDP MMU works
SMP + TDP MMU works
DBX + no TDP MMU works
DBX + TDP MMU works
Cross-compilation was also checked for PowerPC and ARM64.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/7172

Ben Gardon (24):
  locking/rwlocks: Add contention detection for rwlocks
  sched: Add needbreak for rwlocks
  sched: Add cond_resched_rwlock
  kvm: x86/mmu: change TDP MMU yield function returns to match
    cond_resched
  kvm: x86/mmu: Fix yielding in TDP MMU
  kvm: x86/mmu: Skip no-op changes in TDP MMU functions
  kvm: x86/mmu: Add comment on __tdp_mmu_set_spte
  kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory
  kvm: x86/mmu: Factor out handle disconnected pt
  kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section
  kvm: x86/kvm: RCU dereference tdp mmu page table links
  kvm: x86/mmu: Only free tdp_mmu pages after a grace period
  kvm: mmu: Wrap mmu_lock lock / unlock in a function
  kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  kvm: mmu: Wrap mmu_lock assertions
  kvm: mmu: Move mmu_lock to struct kvm_arch
  kvm: x86/mmu: Use an rwlock for the x86 TDP MMU
  kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  kvm: x86/mmu: Add atomic option for setting SPTEs
  kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  kvm: x86/mmu: Freeze SPTEs in disconnected pages
  kvm: x86/mmu: Allow parallel page faults for the TDP MMU

 Documentation/virt/kvm/locking.rst       |   2 +-
 arch/arm64/include/asm/kvm_host.h        |   2 +
 arch/arm64/kvm/arm.c                     |   2 +
 arch/arm64/kvm/mmu.c                     |  40 +-
 arch/mips/include/asm/kvm_host.h         |   2 +
 arch/mips/kvm/mips.c                     |  10 +-
 arch/mips/kvm/mmu.c                      |  20 +-
 arch/powerpc/include/asm/kvm_book3s_64.h |   7 +-
 arch/powerpc/include/asm/kvm_host.h      |   2 +
 arch/powerpc/kvm/book3s_64_mmu_host.c    |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  12 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  32 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |   4 +-
 arch/powerpc/kvm/book3s_hv.c             |   8 +-
 arch/powerpc/kvm/book3s_hv_nested.c      |  59 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |  14 +-
 arch/powerpc/kvm/book3s_mmu_hpte.c       |  10 +-
 arch/powerpc/kvm/e500_mmu_host.c         |   6 +-
 arch/powerpc/kvm/powerpc.c               |   2 +
 arch/s390/include/asm/kvm_host.h         |   2 +
 arch/s390/kvm/kvm-s390.c                 |   2 +
 arch/x86/include/asm/kvm_host.h          |  23 +
 arch/x86/kvm/mmu/mmu.c                   | 189 ++++++--
 arch/x86/kvm/mmu/mmu_internal.h          |  16 +-
 arch/x86/kvm/mmu/page_track.c            |   8 +-
 arch/x86/kvm/mmu/paging_tmpl.h           |   8 +-
 arch/x86/kvm/mmu/spte.h                  |  16 +-
 arch/x86/kvm/mmu/tdp_iter.c              |   6 +-
 arch/x86/kvm/mmu/tdp_mmu.c               | 540 +++++++++++++++++++----
 arch/x86/kvm/x86.c                       |   4 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c         |  12 +-
 include/asm-generic/qrwlock.h            |  24 +-
 include/linux/kvm_host.h                 |   7 +-
 include/linux/rwlock.h                   |   7 +
 include/linux/sched.h                    |  29 ++
 kernel/sched/core.c                      |  40 ++
 virt/kvm/dirty_ring.c                    |   4 +-
 virt/kvm/kvm_main.c                      |  58 ++-
 38 files changed, 938 insertions(+), 295 deletions(-)

-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 01/24] locking/rwlocks: Add contention detection for rwlocks
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 02/24] sched: Add needbreak " Ben Gardon
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso,
	Waiman Long

rwlocks do not currently have any facility to detect contention
like spinlocks do. In order to allow users of rwlocks to better manage
latency, add contention detection for queued rwlocks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/asm-generic/qrwlock.h | 24 ++++++++++++++++++------
 include/linux/rwlock.h        |  7 +++++++
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 84ce841ce735..0020d3b820a7 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -14,6 +14,7 @@
 #include <asm/processor.h>
 
 #include <asm-generic/qrwlock_types.h>
+#include <asm-generic/qspinlock.h>
 
 /*
  * Writer states & reader shift and bias.
@@ -116,15 +117,26 @@ static inline void queued_write_unlock(struct qrwlock *lock)
 	smp_store_release(&lock->wlocked, 0);
 }
 
+/**
+ * queued_rwlock_is_contended - check if the lock is contended
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static inline int queued_rwlock_is_contended(struct qrwlock *lock)
+{
+	return arch_spin_is_locked(&lock->wait_lock);
+}
+
 /*
  * Remapping rwlock architecture specific functions to the corresponding
  * queue rwlock functions.
  */
-#define arch_read_lock(l)	queued_read_lock(l)
-#define arch_write_lock(l)	queued_write_lock(l)
-#define arch_read_trylock(l)	queued_read_trylock(l)
-#define arch_write_trylock(l)	queued_write_trylock(l)
-#define arch_read_unlock(l)	queued_read_unlock(l)
-#define arch_write_unlock(l)	queued_write_unlock(l)
+#define arch_read_lock(l)		queued_read_lock(l)
+#define arch_write_lock(l)		queued_write_lock(l)
+#define arch_read_trylock(l)		queued_read_trylock(l)
+#define arch_write_trylock(l)		queued_write_trylock(l)
+#define arch_read_unlock(l)		queued_read_unlock(l)
+#define arch_write_unlock(l)		queued_write_unlock(l)
+#define arch_rwlock_is_contended(l)	queued_rwlock_is_contended(l)
 
 #endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/include/linux/rwlock.h b/include/linux/rwlock.h
index 3dcd617e65ae..7ce9a51ae5c0 100644
--- a/include/linux/rwlock.h
+++ b/include/linux/rwlock.h
@@ -128,4 +128,11 @@ do {								\
 	1 : ({ local_irq_restore(flags); 0; }); \
 })
 
+#ifdef arch_rwlock_is_contended
+#define rwlock_is_contended(lock) \
+	 arch_rwlock_is_contended(&(lock)->raw_lock)
+#else
+#define rwlock_is_contended(lock)	((void)(lock), 0)
+#endif /* arch_rwlock_is_contended */
+
 #endif /* __LINUX_RWLOCK_H */
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 02/24] sched: Add needbreak for rwlocks
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
  2021-01-12 18:10 ` [PATCH 01/24] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 03/24] sched: Add cond_resched_rwlock Ben Gardon
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso,
	Waiman Long

Contention awareness while holding a spin lock is essential for reducing
latency when long running kernel operations can hold that lock. Add the
same contention detection interface for read/write spin locks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/sched.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e3a5eeec509..5d1378e5a040 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1912,6 +1912,23 @@ static inline int spin_needbreak(spinlock_t *lock)
 #endif
 }
 
+/*
+ * Check if a rwlock is contended.
+ * Returns non-zero if there is another task waiting on the rwlock.
+ * Returns zero if the lock is not contended or the system / underlying
+ * rwlock implementation does not support contention detection.
+ * Technically does not depend on CONFIG_PREEMPTION, but a general need
+ * for low latency.
+ */
+static inline int rwlock_needbreak(rwlock_t *lock)
+{
+#ifdef CONFIG_PREEMPTION
+	return rwlock_is_contended(lock);
+#else
+	return 0;
+#endif
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 03/24] sched: Add cond_resched_rwlock
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
  2021-01-12 18:10 ` [PATCH 01/24] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
  2021-01-12 18:10 ` [PATCH 02/24] sched: Add needbreak " Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon,
	Ingo Molnar, Will Deacon, Peter Zijlstra, Davidlohr Bueso,
	Waiman Long

Safely rescheduling while holding a spin lock is essential for keeping
long running kernel operations running smoothly. Add the facility to
cond_resched rwlocks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/sched.h | 12 ++++++++++++
 kernel/sched/core.c   | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d1378e5a040..3052d16da3cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1883,12 +1883,24 @@ static inline int _cond_resched(void) { return 0; }
 })
 
 extern int __cond_resched_lock(spinlock_t *lock);
+extern int __cond_resched_rwlock_read(rwlock_t *lock);
+extern int __cond_resched_rwlock_write(rwlock_t *lock);
 
 #define cond_resched_lock(lock) ({				\
 	___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
 	__cond_resched_lock(lock);				\
 })
 
+#define cond_resched_rwlock_read(lock) ({			\
+	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	__cond_resched_rwlock_read(lock);			\
+})
+
+#define cond_resched_rwlock_write(lock) ({			\
+	__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);	\
+	__cond_resched_rwlock_write(lock);			\
+})
+
 static inline void cond_resched_rcu(void)
 {
 #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15d2562118d1..ade357642279 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6695,6 +6695,46 @@ int __cond_resched_lock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(__cond_resched_lock);
 
+int __cond_resched_rwlock_read(rwlock_t *lock)
+{
+	int resched = should_resched(PREEMPT_LOCK_OFFSET);
+	int ret = 0;
+
+	lockdep_assert_held_read(lock);
+
+	if (rwlock_needbreak(lock) || resched) {
+		read_unlock(lock);
+		if (resched)
+			preempt_schedule_common();
+		else
+			cpu_relax();
+		ret = 1;
+		read_lock(lock);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_read);
+
+int __cond_resched_rwlock_write(rwlock_t *lock)
+{
+	int resched = should_resched(PREEMPT_LOCK_OFFSET);
+	int ret = 0;
+
+	lockdep_assert_held_write(lock);
+
+	if (rwlock_needbreak(lock) || resched) {
+		write_unlock(lock);
+		if (resched)
+			preempt_schedule_common();
+		else
+			cpu_relax();
+		ret = 1;
+		write_lock(lock);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(__cond_resched_rwlock_write);
+
 /**
  * yield - yield the current processor to other threads.
  *
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (2 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 03/24] sched: Add cond_resched_rwlock Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-20 18:38   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU Ben Gardon
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Currently the TDP MMU yield / cond_resched functions either return
nothing or return true if the TLBs were not flushed. These are confusing
semantics, especially when making control flow decisions in calling
functions.

To clean things up, change both functions to have the same
return value semantics as cond_resched: true if the thread yielded,
false if it did not. If the function yielded in the _flush_ version,
then the TLBs will have been flushed.

Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 38 +++++++++++++++++++++++++++++---------
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2ef8615f9dba..b2784514ca2d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -413,8 +413,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 			 _mmu->shadow_root_level, _start, _end)
 
 /*
- * Flush the TLB if the process should drop kvm->mmu_lock.
- * Return whether the caller still needs to flush the tlb.
+ * Flush the TLB and yield if the MMU lock is contended or this thread needs to
+ * return control to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded, the TLBs were flushed, and the
+ * iterator's traversal was reset. Return false if a yield was not needed.
  */
 static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
@@ -422,18 +429,30 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *it
 		kvm_flush_remote_tlbs(kvm);
 		cond_resched_lock(&kvm->mmu_lock);
 		tdp_iter_refresh_walk(iter);
-		return false;
-	} else {
 		return true;
-	}
+	} else
+		return false;
 }
 
-static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+/*
+ * Yield if the MMU lock is contended or this thread needs to return control
+ * to the scheduler.
+ *
+ * If this function yields, it will also reset the tdp_iter's walk over the
+ * paging structure and the calling function should allow the iterator to
+ * continue its traversal from the paging structure root.
+ *
+ * Return true if this function yielded and the iterator's traversal was reset.
+ * Return false if a yield was not needed.
+ */
+static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
 		cond_resched_lock(&kvm->mmu_lock);
 		tdp_iter_refresh_walk(iter);
-	}
+		return true;
+	} else
+		return false;
 }
 
 /*
@@ -470,7 +489,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
 		if (can_yield)
-			flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+			flush_needed = !tdp_mmu_iter_flush_cond_resched(kvm,
+									&iter);
 		else
 			flush_needed = true;
 	}
@@ -1072,7 +1092,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
 
-		spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+		spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
 	}
 
 	if (spte_set)
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (3 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-20 19:28   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

There are two problems with the way the TDP MMU yields in long running
functions. 1.) Given certain conditions, the function may not yield
reliably / frequently enough. 2.) In some functions the TDP iter risks
not making forward progress if two threads livelock yielding to
one another.

Case 1 is possible if for example, a paging structure was very large
but had few, if any writable entries. wrprot_gfn_range could traverse many
entries before finding a writable entry and yielding.

Case 2 is possible if two threads were trying to execute wrprot_gfn_range.
Each could write protect an entry and then yield. This would reset the
tdp_iter's walk over the paging structure and the loop would end up
repeating the same entry over and over, preventing either thread from
making forward progress.

Fix these issues by moving the yield to the beginning of the loop,
before other checks and only yielding if the loop has made forward
progress since the last yield.

Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 83 +++++++++++++++++++++++++++++++-------
 1 file changed, 69 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b2784514ca2d..1987da0da66e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -470,9 +470,23 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			  gfn_t start, gfn_t end, bool can_yield)
 {
 	struct tdp_iter iter;
+	gfn_t last_goal_gfn = start;
 	bool flush_needed = false;
 
 	tdp_root_for_each_pte(iter, root, start, end) {
+		/* Ensure forward progress has been made before yielding. */
+		if (can_yield && iter.goal_gfn != last_goal_gfn &&
+		    tdp_mmu_iter_flush_cond_resched(kvm, &iter)) {
+			last_goal_gfn = iter.goal_gfn;
+			flush_needed = false;
+			/*
+			 * Yielding caused the paging structure walk to be
+			 * reset so skip to the next iteration to continue the
+			 * walk from the root.
+			 */
+			continue;
+		}
+
 		if (!is_shadow_present_pte(iter.old_spte))
 			continue;
 
@@ -487,12 +501,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
-
-		if (can_yield)
-			flush_needed = !tdp_mmu_iter_flush_cond_resched(kvm,
-									&iter);
-		else
-			flush_needed = true;
+		flush_needed = true;
 	}
 	return flush_needed;
 }
@@ -850,12 +859,25 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 	u64 new_spte;
+	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
 				   min_level, start, end) {
+		/* Ensure forward progress has been made before yielding. */
+		if (iter.goal_gfn != last_goal_gfn &&
+		    tdp_mmu_iter_cond_resched(kvm, &iter)) {
+			last_goal_gfn = iter.goal_gfn;
+			/*
+			 * Yielding caused the paging structure walk to be
+			 * reset so skip to the next iteration to continue the
+			 * walk from the root.
+			 */
+			continue;
+		}
+
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
@@ -864,8 +886,6 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
-
-		tdp_mmu_iter_cond_resched(kvm, &iter);
 	}
 	return spte_set;
 }
@@ -906,9 +926,22 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 	u64 new_spte;
+	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
+		/* Ensure forward progress has been made before yielding. */
+		if (iter.goal_gfn != last_goal_gfn &&
+		    tdp_mmu_iter_cond_resched(kvm, &iter)) {
+			last_goal_gfn = iter.goal_gfn;
+			/*
+			 * Yielding caused the paging structure walk to be
+			 * reset so skip to the next iteration to continue the
+			 * walk from the root.
+			 */
+			continue;
+		}
+
 		if (spte_ad_need_write_protect(iter.old_spte)) {
 			if (is_writable_pte(iter.old_spte))
 				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
@@ -923,8 +956,6 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
-
-		tdp_mmu_iter_cond_resched(kvm, &iter);
 	}
 	return spte_set;
 }
@@ -1029,9 +1060,22 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 	u64 new_spte;
+	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
 	tdp_root_for_each_pte(iter, root, start, end) {
+		/* Ensure forward progress has been made before yielding. */
+		if (iter.goal_gfn != last_goal_gfn &&
+		    tdp_mmu_iter_cond_resched(kvm, &iter)) {
+			last_goal_gfn = iter.goal_gfn;
+			/*
+			 * Yielding caused the paging structure walk to be
+			 * reset so skip to the next iteration to continue the
+			 * walk from the root.
+			 */
+			continue;
+		}
+
 		if (!is_shadow_present_pte(iter.old_spte))
 			continue;
 
@@ -1039,8 +1083,6 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		tdp_mmu_set_spte(kvm, &iter, new_spte);
 		spte_set = true;
-
-		tdp_mmu_iter_cond_resched(kvm, &iter);
 	}
 
 	return spte_set;
@@ -1078,9 +1120,23 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 {
 	struct tdp_iter iter;
 	kvm_pfn_t pfn;
+	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
 	tdp_root_for_each_pte(iter, root, start, end) {
+		/* Ensure forward progress has been made before yielding. */
+		if (iter.goal_gfn != last_goal_gfn &&
+		    tdp_mmu_iter_flush_cond_resched(kvm, &iter)) {
+			last_goal_gfn = iter.goal_gfn;
+			spte_set = false;
+			/*
+			 * Yielding caused the paging structure walk to be
+			 * reset so skip to the next iteration to continue the
+			 * walk from the root.
+			 */
+			continue;
+		}
+
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    is_last_spte(iter.old_spte, iter.level))
 			continue;
@@ -1091,8 +1147,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
-
-		spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
+		spte_set = true;
 	}
 
 	if (spte_set)
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (4 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-20 19:51   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Skip setting SPTEs if no change is expected.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1987da0da66e..2650fa9fe066 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -882,6 +882,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		if (!(iter.old_spte & PT_WRITABLE_MASK))
+			continue;
+
 		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
@@ -1079,6 +1082,9 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (!is_shadow_present_pte(iter.old_spte))
 			continue;
 
+		if (iter.old_spte & shadow_dirty_mask)
+			continue;
+
 		new_spte = iter.old_spte | shadow_dirty_mask;
 
 		tdp_mmu_set_spte(kvm, &iter, new_spte);
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (5 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-26 14:13   ` Paolo Bonzini
  2021-01-12 18:10 ` [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

__tdp_mmu_set_spte is a very important function in the TDP MMU which
already accepts several arguments and will take more in future commits.
To offset this complexity, add a comment to the function describing each
of the arguemnts.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2650fa9fe066..b033da8243fc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -357,6 +357,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				      new_spte, level);
 }
 
+/*
+ * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @new_spte: The value the SPTE should be set to
+ * @record_acc_track: Notify the MM subsystem of changes to the accessed state
+ *		      of the page. Should be set unless handling an MMU
+ *		      notifier for access tracking. Leaving record_acc_track
+ *		      unset in that case prevents page accesses from being
+ *		      double counted.
+ * @record_dirty_log: Record the page as dirty in the dirty bitmap if
+ *		      appropriate for the change being made. Should be set
+ *		      unless performing certain dirty logging operations.
+ *		      Leaving record_dirty_log unset in that case prevents page
+ *		      writes from being double counted.
+ */
 static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 				      u64 new_spte, bool record_acc_track,
 				      bool record_dirty_log)
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (6 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-20 19:58   ` Sean Christopherson
  2021-01-26 14:13   ` Paolo Bonzini
  2021-01-12 18:10 ` [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
                   ` (15 subsequent siblings)
  23 siblings, 2 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified
under the MMU lock. This lockdep will be updated in future commits to
reflect and validate changes to the TDP MMU's synchronization strategy.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b033da8243fc..411938e97a00 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -381,6 +381,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
 	int as_id = kvm_mmu_page_as_id(root);
 
+	lockdep_assert_held(&kvm->mmu_lock);
+
 	WRITE_ONCE(*iter->sptep, new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (7 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-20 20:06   ` Sean Christopherson
  2021-01-26 14:14   ` Paolo Bonzini
  2021-01-12 18:10 ` [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt Ben Gardon
                   ` (14 subsequent siblings)
  23 siblings, 2 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

The KVM MMU caches already guarantee that shadow page table memory will
be zeroed, so there is no reason to re-zero the page in the TDP MMU page
fault handler.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 411938e97a00..55df596696c7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -665,7 +665,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
 			list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
 			child_pt = sp->spt;
-			clear_page(child_pt);
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (8 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-20 20:30   ` Sean Christopherson
  2021-01-26 14:14   ` Paolo Bonzini
  2021-01-12 18:10 ` [PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section Ben Gardon
                   ` (13 subsequent siblings)
  23 siblings, 2 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Factor out the code to handle a disconnected subtree of the TDP paging
structure from the code to handle the change to an individual SPTE.
Future commits will build on this to allow asynchronous page freeing.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 75 +++++++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 55df596696c7..e8f35cd46b4c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -234,6 +234,49 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 
+/**
+ * handle_disconnected_tdp_mmu_page - handle a pt removed from the TDP structure
+ *
+ * @kvm: kvm instance
+ * @pt: the page removed from the paging structure
+ *
+ * Given a page table that has been removed from the TDP paging structure,
+ * iterates through the page table to clear SPTEs and free child page tables.
+ */
+static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
+{
+	struct kvm_mmu_page *sp;
+	gfn_t gfn;
+	int level;
+	u64 old_child_spte;
+	int i;
+
+	sp = sptep_to_sp(pt);
+	gfn = sp->gfn;
+	level = sp->role.level;
+
+	trace_kvm_mmu_prepare_zap_page(sp);
+
+	list_del(&sp->link);
+
+	if (sp->lpage_disallowed)
+		unaccount_huge_nx_page(kvm, sp);
+
+	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+		old_child_spte = READ_ONCE(*(pt + i));
+		WRITE_ONCE(*(pt + i), 0);
+		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
+			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
+			old_child_spte, 0, level - 1);
+	}
+
+	kvm_flush_remote_tlbs_with_address(kvm, gfn,
+					   KVM_PAGES_PER_HPAGE(level));
+
+	free_page((unsigned long)pt);
+	kmem_cache_free(mmu_page_header_cache, sp);
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -254,10 +297,6 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
-	u64 *pt;
-	struct kvm_mmu_page *sp;
-	u64 old_child_spte;
-	int i;
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
@@ -321,31 +360,9 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.
 	 */
-	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
-		pt = spte_to_child_pt(old_spte, level);
-		sp = sptep_to_sp(pt);
-
-		trace_kvm_mmu_prepare_zap_page(sp);
-
-		list_del(&sp->link);
-
-		if (sp->lpage_disallowed)
-			unaccount_huge_nx_page(kvm, sp);
-
-		for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-			old_child_spte = READ_ONCE(*(pt + i));
-			WRITE_ONCE(*(pt + i), 0);
-			handle_changed_spte(kvm, as_id,
-				gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-				old_child_spte, 0, level - 1);
-		}
-
-		kvm_flush_remote_tlbs_with_address(kvm, gfn,
-						   KVM_PAGES_PER_HPAGE(level));
-
-		free_page((unsigned long)pt);
-		kmem_cache_free(mmu_page_header_cache, sp);
-	}
+	if (was_present && !was_leaf && (pfn_changed || !is_present))
+		handle_disconnected_tdp_mmu_page(kvm,
+				spte_to_child_pt(old_spte, level));
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (9 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-20 22:19   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links Ben Gardon
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

In order to enable concurrent modifications to the paging structures in
the TDP MMU, threads must be able to safely remove pages of page table
memory while other threads are traversing the same memory. To ensure
threads do not access PT memory after it is freed, protect PT memory
with RCU.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 53 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e8f35cd46b4c..662907d374b3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -458,11 +458,14 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
  * Return true if this function yielded, the TLBs were flushed, and the
  * iterator's traversal was reset. Return false if a yield was not needed.
  */
-static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,
+		struct tdp_iter *iter)
 {
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
 		kvm_flush_remote_tlbs(kvm);
+		rcu_read_unlock();
 		cond_resched_lock(&kvm->mmu_lock);
+		rcu_read_lock();
 		tdp_iter_refresh_walk(iter);
 		return true;
 	} else
@@ -483,7 +486,9 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *it
 static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
 	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		rcu_read_unlock();
 		cond_resched_lock(&kvm->mmu_lock);
+		rcu_read_lock();
 		tdp_iter_refresh_walk(iter);
 		return true;
 	} else
@@ -508,6 +513,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t last_goal_gfn = start;
 	bool flush_needed = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_pte(iter, root, start, end) {
 		/* Ensure forward progress has been made before yielding. */
 		if (can_yield && iter.goal_gfn != last_goal_gfn &&
@@ -538,6 +545,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte(kvm, &iter, 0);
 		flush_needed = true;
 	}
+
+	rcu_read_unlock();
 	return flush_needed;
 }
 
@@ -650,6 +659,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 					huge_page_disallowed, &req_level);
 
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
+
+	rcu_read_lock();
+
 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
 		if (nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(iter.old_spte, gfn,
@@ -693,11 +705,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		}
 	}
 
-	if (WARN_ON(iter.level != level))
+	if (WARN_ON(iter.level != level)) {
+		rcu_read_unlock();
 		return RET_PF_RETRY;
+	}
 
 	ret = tdp_mmu_map_handle_target_level(vcpu, write, map_writable, &iter,
 					      pfn, prefault);
+	rcu_read_unlock();
 
 	return ret;
 }
@@ -768,6 +783,8 @@ static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
 	int young = 0;
 	u64 new_spte = 0;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
 		/*
 		 * If we have a non-accessed entry we don't need to change the
@@ -799,6 +816,8 @@ static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
 		trace_kvm_age_page(iter.gfn, iter.level, slot, young);
 	}
 
+	rcu_read_unlock();
+
 	return young;
 }
 
@@ -844,6 +863,8 @@ static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
 	u64 new_spte;
 	int need_flush = 0;
 
+	rcu_read_lock();
+
 	WARN_ON(pte_huge(*ptep));
 
 	new_pfn = pte_pfn(*ptep);
@@ -872,6 +893,8 @@ static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (need_flush)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 
+	rcu_read_unlock();
+
 	return 0;
 }
 
@@ -896,6 +919,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
@@ -924,6 +949,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
 	}
+
+	rcu_read_unlock();
 	return spte_set;
 }
 
@@ -966,6 +993,8 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
 		/* Ensure forward progress has been made before yielding. */
 		if (iter.goal_gfn != last_goal_gfn &&
@@ -994,6 +1023,8 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 		spte_set = true;
 	}
+
+	rcu_read_unlock();
 	return spte_set;
 }
 
@@ -1035,6 +1066,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 	struct tdp_iter iter;
 	u64 new_spte;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
 				    gfn + BITS_PER_LONG) {
 		if (!mask)
@@ -1060,6 +1093,8 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		mask &= ~(1UL << (iter.gfn - gfn));
 	}
+
+	rcu_read_unlock();
 }
 
 /*
@@ -1100,6 +1135,8 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_pte(iter, root, start, end) {
 		/* Ensure forward progress has been made before yielding. */
 		if (iter.goal_gfn != last_goal_gfn &&
@@ -1125,6 +1162,7 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		spte_set = true;
 	}
 
+	rcu_read_unlock();
 	return spte_set;
 }
 
@@ -1163,6 +1201,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	gfn_t last_goal_gfn = start;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_pte(iter, root, start, end) {
 		/* Ensure forward progress has been made before yielding. */
 		if (iter.goal_gfn != last_goal_gfn &&
@@ -1190,6 +1230,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		spte_set = true;
 	}
 
+	rcu_read_unlock();
 	if (spte_set)
 		kvm_flush_remote_tlbs(kvm);
 }
@@ -1226,6 +1267,8 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	rcu_read_lock();
+
 	tdp_root_for_each_leaf_pte(iter, root, gfn, gfn + 1) {
 		if (!is_writable_pte(iter.old_spte))
 			break;
@@ -1237,6 +1280,8 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 		spte_set = true;
 	}
 
+	rcu_read_unlock();
+
 	return spte_set;
 }
 
@@ -1277,10 +1322,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 
 	*root_level = vcpu->arch.mmu->shadow_root_level;
 
+	rcu_read_lock();
+
 	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
 
+	rcu_read_unlock();
+
 	return leaf;
 }
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (10 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-22 18:32   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 13/24] kvm: x86/mmu: Only free tdp_mmu pages after a grace period Ben Gardon
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

In order to protect TDP MMU PT memory with RCU, ensure that page table
links are properly rcu_derefenced.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 87b7e16911db..82855613ffa0 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -49,6 +49,8 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
  */
 u64 *spte_to_child_pt(u64 spte, int level)
 {
+	u64 *child_pt;
+
 	/*
 	 * There's no child entry if this entry isn't present or is a
 	 * last-level entry.
@@ -56,7 +58,9 @@ u64 *spte_to_child_pt(u64 spte, int level)
 	if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
 		return NULL;
 
-	return __va(spte_to_pfn(spte) << PAGE_SHIFT);
+	child_pt = __va(spte_to_pfn(spte) << PAGE_SHIFT);
+
+	return rcu_dereference(child_pt);
 }
 
 /*
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 13/24] kvm: x86/mmu: Only free tdp_mmu pages after a grace period
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (11 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 14/24] kvm: mmu: Wrap mmu_lock lock / unlock in a function Ben Gardon
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

By waiting until an RCU grace period has elapsed to free TDP MMU PT memory,
the system can ensure that no kernel threads access the memory after it
has been freed.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu_internal.h |  3 +++
 arch/x86/kvm/mmu/tdp_mmu.c      | 31 +++++++++++++++++++++++++++++--
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index bfc6389edc28..7f599cc64178 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -57,6 +57,9 @@ struct kvm_mmu_page {
 	atomic_t write_flooding_count;
 
 	bool tdp_mmu_page;
+
+	/* Used for freeing the page asyncronously if it is a TDP MMU page. */
+	struct rcu_head rcu_head;
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 662907d374b3..dc5b4bf34ca2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -42,6 +42,12 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 		return;
 
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
+
+	/*
+	 * Ensure that all the outstanding RCU callbacks to free shadow pages
+	 * can run before the VM is torn down.
+	 */
+	rcu_barrier();
 }
 
 static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
@@ -196,6 +202,28 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	return __pa(root->spt);
 }
 
+static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
+{
+	free_page((unsigned long)sp->spt);
+	kmem_cache_free(mmu_page_header_cache, sp);
+}
+
+/*
+ * This is called through call_rcu in order to free TDP page table memory
+ * safely with respect to other kernel threads that may be operating on
+ * the memory.
+ * By only accessing TDP MMU page table memory in an RCU read critical
+ * section, and freeing it after a grace period, lockless access to that
+ * memory won't use it after it is freed.
+ */
+static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
+{
+	struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
+					       rcu_head);
+
+	tdp_mmu_free_sp(sp);
+}
+
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level);
 
@@ -273,8 +301,7 @@ static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
 	kvm_flush_remote_tlbs_with_address(kvm, gfn,
 					   KVM_PAGES_PER_HPAGE(level));
 
-	free_page((unsigned long)pt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
 /**
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 14/24] kvm: mmu: Wrap mmu_lock lock / unlock in a function
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (12 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 13/24] kvm: x86/mmu: Only free tdp_mmu pages after a grace period Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak Ben Gardon
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Wrap locking and unlocking the mmu_lock in a function. This will
facilitate future logging and stat collection for the lock and more
immediately support a refactoring to move the lock into the struct
kvm_arch(s) so that x86 can change the spinlock to a rwlock without
affecting the performance of other archs.

No functional change intended.

Signed-off-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/arm64/kvm/mmu.c                   | 36 ++++++-------
 arch/mips/kvm/mips.c                   |  8 +--
 arch/mips/kvm/mmu.c                    | 14 ++---
 arch/powerpc/kvm/book3s_64_mmu_host.c  |  4 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    | 12 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 22 ++++----
 arch/powerpc/kvm/book3s_hv.c           |  8 +--
 arch/powerpc/kvm/book3s_hv_nested.c    | 52 +++++++++---------
 arch/powerpc/kvm/book3s_mmu_hpte.c     | 10 ++--
 arch/powerpc/kvm/e500_mmu_host.c       |  4 +-
 arch/x86/kvm/mmu/mmu.c                 | 74 +++++++++++++-------------
 arch/x86/kvm/mmu/page_track.c          |  8 +--
 arch/x86/kvm/mmu/paging_tmpl.h         |  8 +--
 arch/x86/kvm/mmu/tdp_mmu.c             |  6 +--
 arch/x86/kvm/x86.c                     |  4 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c       | 12 ++---
 include/linux/kvm_host.h               |  3 ++
 virt/kvm/dirty_ring.c                  |  4 +-
 virt/kvm/kvm_main.c                    | 42 +++++++++------
 19 files changed, 172 insertions(+), 159 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7d2257cc5438..402b1642c944 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -164,13 +164,13 @@ static void stage2_flush_vm(struct kvm *kvm)
 	int idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	slots = kvm_memslots(kvm);
 	kvm_for_each_memslot(memslot, slots)
 		stage2_flush_memslot(kvm, memslot);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -456,13 +456,13 @@ void stage2_unmap_vm(struct kvm *kvm)
 
 	idx = srcu_read_lock(&kvm->srcu);
 	mmap_read_lock(current->mm);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	slots = kvm_memslots(kvm);
 	kvm_for_each_memslot(memslot, slots)
 		stage2_unmap_memslot(kvm, memslot);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	mmap_read_unlock(current->mm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
@@ -472,14 +472,14 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 	struct kvm *kvm = mmu->kvm;
 	struct kvm_pgtable *pgt = NULL;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	pgt = mmu->pgt;
 	if (pgt) {
 		mmu->pgd_phys = 0;
 		mmu->pgt = NULL;
 		free_percpu(mmu->last_vcpu_ran);
 	}
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (pgt) {
 		kvm_pgtable_stage2_destroy(pgt);
@@ -516,10 +516,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		if (ret)
 			break;
 
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
 					     &cache);
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 		if (ret)
 			break;
 
@@ -567,9 +567,9 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 	start = memslot->base_gfn << PAGE_SHIFT;
 	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	stage2_wp_range(&kvm->arch.mmu, start, end);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	kvm_flush_remote_tlbs(kvm);
 }
 
@@ -867,7 +867,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault && device)
 		return -ENOEXEC;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -912,7 +912,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 out_unlock:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	kvm_set_pfn_accessed(pfn);
 	kvm_release_pfn_clean(pfn);
 	return ret;
@@ -927,10 +927,10 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 
 	trace_kvm_access_fault(fault_ipa);
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_lock(vcpu->kvm);
 	mmu = vcpu->arch.hw_mmu;
 	kpte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_unlock(vcpu->kvm);
 
 	pte = __pte(kpte);
 	if (pte_valid(pte))
@@ -1365,12 +1365,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	if (change == KVM_MR_FLAGS_ONLY)
 		goto out;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	if (ret)
 		unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
 	else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
 		stage2_flush_memslot(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 out:
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -1395,9 +1395,9 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 	gpa_t gpa = slot->base_gfn << PAGE_SHIFT;
 	phys_addr_t size = slot->npages << PAGE_SHIFT;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	unmap_stage2_range(&kvm->arch.mmu, gpa, size);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 /*
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 3d6a7f5827b1..4e393d93c1aa 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -217,13 +217,13 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 	 * need to ensure that it can no longer be accessed by any guest VCPUs.
 	 */
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	/* Flush slot from GPA */
 	kvm_mips_flush_gpa_pt(kvm, slot->base_gfn,
 			      slot->base_gfn + slot->npages - 1);
 	/* Let implementation do the rest */
 	kvm_mips_callbacks->flush_shadow_memslot(kvm, slot);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
@@ -258,14 +258,14 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	if (change == KVM_MR_FLAGS_ONLY &&
 	    (!(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
 	     new->flags & KVM_MEM_LOG_DIRTY_PAGES)) {
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		/* Write protect GPA page table entries */
 		needs_flush = kvm_mips_mkclean_gpa_pt(kvm, new->base_gfn,
 					new->base_gfn + new->npages - 1);
 		/* Let implementation do the rest */
 		if (needs_flush)
 			kvm_mips_callbacks->flush_shadow_memslot(kvm, new);
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 	}
 }
 
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 3dabeda82458..449663152b3c 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -593,7 +593,7 @@ static int _kvm_mips_map_page_fast(struct kvm_vcpu *vcpu, unsigned long gpa,
 	bool pfn_valid = false;
 	int ret = 0;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	/* Fast path - just check GPA page table for an existing entry */
 	ptep = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
@@ -628,7 +628,7 @@ static int _kvm_mips_map_page_fast(struct kvm_vcpu *vcpu, unsigned long gpa,
 		*out_buddy = *ptep_buddy(ptep);
 
 out:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	if (pfn_valid)
 		kvm_set_pfn_accessed(pfn);
 	return ret;
@@ -710,7 +710,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
 		goto out;
 	}
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	/* Check if an invalidation has taken place since we got pfn */
 	if (mmu_notifier_retry(kvm, mmu_seq)) {
 		/*
@@ -718,7 +718,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
 		 * also synchronously if a COW is triggered by
 		 * gfn_to_pfn_prot().
 		 */
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 		kvm_release_pfn_clean(pfn);
 		goto retry;
 	}
@@ -748,7 +748,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
 	if (out_buddy)
 		*out_buddy = *ptep_buddy(ptep);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	kvm_release_pfn_clean(pfn);
 	kvm_set_pfn_accessed(pfn);
 out:
@@ -1041,12 +1041,12 @@ int kvm_mips_handle_mapped_seg_tlb_fault(struct kvm_vcpu *vcpu,
 	/* And its GVA buddy's GPA page table entry if it also exists */
 	pte_gpa[!idx] = pfn_pte(0, __pgprot(0));
 	if (tlb_lo[!idx] & ENTRYLO_V) {
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		ptep_buddy = kvm_mips_pte_for_gpa(kvm, NULL,
 					mips3_tlbpfn_to_paddr(tlb_lo[!idx]));
 		if (ptep_buddy)
 			pte_gpa[!idx] = *ptep_buddy;
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 	}
 
 	/* Get the GVA page table entry pair */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c
index e452158a18d7..4039a90c250c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -148,7 +148,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 
 	cpte = kvmppc_mmu_hpte_cache_next(vcpu);
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	if (!cpte || mmu_notifier_retry(kvm, mmu_seq)) {
 		r = -EAGAIN;
 		goto out_unlock;
@@ -200,7 +200,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
 	}
 
 out_unlock:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	kvm_release_pfn_clean(pfn);
 	if (cpte)
 		kvmppc_mmu_hpte_cache_free(cpte);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 38ea396a23d6..b1300a18efa7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -605,12 +605,12 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	 * Read the PTE from the process' radix tree and use that
 	 * so we get the shift and attribute bits.
 	 */
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	ptep = find_kvm_host_pte(kvm, mmu_seq, hva, &shift);
 	pte = __pte(0);
 	if (ptep)
 		pte = READ_ONCE(*ptep);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	/*
 	 * If the PTE disappeared temporarily due to a THP
 	 * collapse, just return and let the guest try again.
@@ -739,14 +739,14 @@ void kvmppc_rmap_reset(struct kvm *kvm)
 	slots = kvm_memslots(kvm);
 	kvm_for_each_memslot(memslot, slots) {
 		/* Mutual exclusion with kvm_unmap_hva_range etc. */
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		/*
 		 * This assumes it is acceptable to lose reference and
 		 * change bits across a reset.
 		 */
 		memset(memslot->arch.rmap, 0,
 		       memslot->npages * sizeof(*memslot->arch.rmap));
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 	}
 	srcu_read_unlock(&kvm->srcu, srcu_idx);
 }
@@ -1405,14 +1405,14 @@ static void resize_hpt_pivot(struct kvm_resize_hpt *resize)
 
 	resize_hpt_debug(resize, "resize_hpt_pivot()\n");
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	asm volatile("ptesync" : : : "memory");
 
 	hpt_tmp = kvm->arch.hpt;
 	kvmppc_set_hpt(kvm, &resize->hpt);
 	resize->hpt = hpt_tmp;
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	synchronize_srcu_expedited(&kvm->srcu);
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index bb35490400e9..b628980c871b 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -613,7 +613,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 		new_ptep = kvmppc_pte_alloc();
 
 	/* Check if we might have been invalidated; let the guest retry if so */
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	ret = -EAGAIN;
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -749,7 +749,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 	ret = 0;
 
  out_unlock:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	if (new_pud)
 		pud_free(kvm->mm, new_pud);
 	if (new_pmd)
@@ -837,12 +837,12 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 	 * Read the PTE from the process' radix tree and use that
 	 * so we get the shift and attribute bits.
 	 */
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	ptep = find_kvm_host_pte(kvm, mmu_seq, hva, &shift);
 	pte = __pte(0);
 	if (ptep)
 		pte = READ_ONCE(*ptep);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	/*
 	 * If the PTE disappeared temporarily due to a THP
 	 * collapse, just return and let the guest try again.
@@ -972,11 +972,11 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 
 	/* Failed to set the reference/change bits */
 	if (dsisr & DSISR_SET_RC) {
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		if (kvmppc_hv_handle_set_rc(kvm, false, writing,
 					    gpa, kvm->arch.lpid))
 			dsisr &= ~DSISR_SET_RC;
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 
 		if (!(dsisr & (DSISR_BAD_FAULT_64S | DSISR_NOHPTE |
 			       DSISR_PROTFAULT | DSISR_SET_RC)))
@@ -1082,7 +1082,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 
 	pte = READ_ONCE(*ptep);
 	if (pte_present(pte) && pte_dirty(pte)) {
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		/*
 		 * Recheck the pte again
 		 */
@@ -1094,7 +1094,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 			 * walk.
 			 */
 			if (!pte_present(*ptep) || !pte_dirty(*ptep)) {
-				spin_unlock(&kvm->mmu_lock);
+				kvm_mmu_unlock(kvm);
 				return 0;
 			}
 		}
@@ -1109,7 +1109,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 		kvmhv_update_nest_rmap_rc_list(kvm, rmapp, _PAGE_DIRTY, 0,
 					       old & PTE_RPN_MASK,
 					       1UL << shift);
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 	}
 	return ret;
 }
@@ -1154,7 +1154,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
 		return;
 
 	gpa = memslot->base_gfn << PAGE_SHIFT;
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	for (n = memslot->npages; n; --n) {
 		ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
 		if (ptep && pte_present(*ptep))
@@ -1167,7 +1167,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
 	 * fault that read the memslot earlier from writing a PTE.
 	 */
 	kvm->mmu_notifier_seq++;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 static void add_rmmu_ap_encoding(struct kvm_ppc_rmmu_info *info,
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6f612d240392..ec08abd532f1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4753,9 +4753,9 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
 	kvmppc_rmap_reset(kvm);
 	kvm->arch.process_table = 0;
 	/* Mutual exclusion with kvm_unmap_hva_range etc. */
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	kvm->arch.radix = 0;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	kvmppc_free_radix(kvm);
 	kvmppc_update_lpcr(kvm, LPCR_VPM1,
 			   LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR);
@@ -4775,9 +4775,9 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
 		return err;
 	kvmppc_rmap_reset(kvm);
 	/* Mutual exclusion with kvm_unmap_hva_range etc. */
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	kvm->arch.radix = 1;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	kvmppc_free_hpt(&kvm->arch.hpt);
 	kvmppc_update_lpcr(kvm, LPCR_UPRT | LPCR_GTSE | LPCR_HR,
 			   LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR);
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 33b58549a9aa..18890dca9476 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -628,7 +628,7 @@ static void kvmhv_remove_nested(struct kvm_nested_guest *gp)
 	int lpid = gp->l1_lpid;
 	long ref;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	if (gp == kvm->arch.nested_guests[lpid]) {
 		kvm->arch.nested_guests[lpid] = NULL;
 		if (lpid == kvm->arch.max_nested_lpid) {
@@ -639,7 +639,7 @@ static void kvmhv_remove_nested(struct kvm_nested_guest *gp)
 		--gp->refcnt;
 	}
 	ref = gp->refcnt;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	if (ref == 0)
 		kvmhv_release_nested(gp);
 }
@@ -658,7 +658,7 @@ void kvmhv_release_all_nested(struct kvm *kvm)
 	struct kvm_memory_slot *memslot;
 	int srcu_idx;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	for (i = 0; i <= kvm->arch.max_nested_lpid; i++) {
 		gp = kvm->arch.nested_guests[i];
 		if (!gp)
@@ -670,7 +670,7 @@ void kvmhv_release_all_nested(struct kvm *kvm)
 		}
 	}
 	kvm->arch.max_nested_lpid = -1;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	while ((gp = freelist) != NULL) {
 		freelist = gp->next;
 		kvmhv_release_nested(gp);
@@ -687,9 +687,9 @@ static void kvmhv_flush_nested(struct kvm_nested_guest *gp)
 {
 	struct kvm *kvm = gp->l1_host;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable, gp->shadow_lpid);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	kvmhv_flush_lpid(gp->shadow_lpid);
 	kvmhv_update_ptbl_cache(gp);
 	if (gp->l1_gr_to_hr == 0)
@@ -705,11 +705,11 @@ struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
 	    l1_lpid >= (1ul << ((kvm->arch.l1_ptcr & PRTS_MASK) + 12 - 4)))
 		return NULL;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	gp = kvm->arch.nested_guests[l1_lpid];
 	if (gp)
 		++gp->refcnt;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (gp || !create)
 		return gp;
@@ -717,7 +717,7 @@ struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
 	newgp = kvmhv_alloc_nested(kvm, l1_lpid);
 	if (!newgp)
 		return NULL;
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	if (kvm->arch.nested_guests[l1_lpid]) {
 		/* someone else beat us to it */
 		gp = kvm->arch.nested_guests[l1_lpid];
@@ -730,7 +730,7 @@ struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
 			kvm->arch.max_nested_lpid = l1_lpid;
 	}
 	++gp->refcnt;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (newgp)
 		kvmhv_release_nested(newgp);
@@ -743,9 +743,9 @@ void kvmhv_put_nested(struct kvm_nested_guest *gp)
 	struct kvm *kvm = gp->l1_host;
 	long ref;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	ref = --gp->refcnt;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	if (ref == 0)
 		kvmhv_release_nested(gp);
 }
@@ -940,7 +940,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
 	pte_t *ptep;
 	int shift;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	ptep = find_kvm_nested_guest_pte(kvm, gp->l1_lpid, gpa, &shift);
 	if (!shift)
 		shift = PAGE_SHIFT;
@@ -948,7 +948,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
 		kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid);
 		ret = true;
 	}
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (shift_ret)
 		*shift_ret = shift;
@@ -1035,11 +1035,11 @@ static void kvmhv_emulate_tlbie_lpid(struct kvm_vcpu *vcpu,
 	switch (ric) {
 	case 0:
 		/* Invalidate TLB */
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable,
 					  gp->shadow_lpid);
 		kvmhv_flush_lpid(gp->shadow_lpid);
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 		break;
 	case 1:
 		/*
@@ -1063,16 +1063,16 @@ static void kvmhv_emulate_tlbie_all_lpid(struct kvm_vcpu *vcpu, int ric)
 	struct kvm_nested_guest *gp;
 	int i;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	for (i = 0; i <= kvm->arch.max_nested_lpid; i++) {
 		gp = kvm->arch.nested_guests[i];
 		if (gp) {
-			spin_unlock(&kvm->mmu_lock);
+			kvm_mmu_unlock(kvm);
 			kvmhv_emulate_tlbie_lpid(vcpu, gp, ric);
-			spin_lock(&kvm->mmu_lock);
+			kvm_mmu_lock(kvm);
 		}
 	}
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 static int kvmhv_emulate_priv_tlbie(struct kvm_vcpu *vcpu, unsigned int instr,
@@ -1230,7 +1230,7 @@ static long kvmhv_handle_nested_set_rc(struct kvm_vcpu *vcpu,
 	if (pgflags & ~gpte.rc)
 		return RESUME_HOST;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	/* Set the rc bit in the pte of our (L0) pgtable for the L1 guest */
 	ret = kvmppc_hv_handle_set_rc(kvm, false, writing,
 				      gpte.raddr, kvm->arch.lpid);
@@ -1248,7 +1248,7 @@ static long kvmhv_handle_nested_set_rc(struct kvm_vcpu *vcpu,
 		ret = 0;
 
 out_unlock:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	return ret;
 }
 
@@ -1380,13 +1380,13 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu,
 
 	/* See if can find translation in our partition scoped tables for L1 */
 	pte = __pte(0);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	pte_p = find_kvm_secondary_pte(kvm, gpa, &shift);
 	if (!shift)
 		shift = PAGE_SHIFT;
 	if (pte_p)
 		pte = *pte_p;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
 		/* No suitable pte found -> try to insert a mapping */
@@ -1461,13 +1461,13 @@ int kvmhv_nested_next_lpid(struct kvm *kvm, int lpid)
 {
 	int ret = -1;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	while (++lpid <= kvm->arch.max_nested_lpid) {
 		if (kvm->arch.nested_guests[lpid]) {
 			ret = lpid;
 			break;
 		}
 	}
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	return ret;
 }
diff --git a/arch/powerpc/kvm/book3s_mmu_hpte.c b/arch/powerpc/kvm/book3s_mmu_hpte.c
index ce79ac33e8d3..ec1b5a6dfee1 100644
--- a/arch/powerpc/kvm/book3s_mmu_hpte.c
+++ b/arch/powerpc/kvm/book3s_mmu_hpte.c
@@ -60,7 +60,7 @@ void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
 
 	trace_kvm_book3s_mmu_map(pte);
 
-	spin_lock(&vcpu3s->mmu_lock);
+	kvm_mmu_lock(vcpu3s);
 
 	/* Add to ePTE list */
 	index = kvmppc_mmu_hash_pte(pte->pte.eaddr);
@@ -89,7 +89,7 @@ void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
 
 	vcpu3s->hpte_cache_count++;
 
-	spin_unlock(&vcpu3s->mmu_lock);
+	kvm_mmu_unlock(vcpu3s);
 }
 
 static void free_pte_rcu(struct rcu_head *head)
@@ -107,11 +107,11 @@ static void invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
 	/* Different for 32 and 64 bit */
 	kvmppc_mmu_invalidate_pte(vcpu, pte);
 
-	spin_lock(&vcpu3s->mmu_lock);
+	kvm_mmu_lock(vcpu3s);
 
 	/* pte already invalidated in between? */
 	if (hlist_unhashed(&pte->list_pte)) {
-		spin_unlock(&vcpu3s->mmu_lock);
+		kvm_mmu_unlock(vcpu3s);
 		return;
 	}
 
@@ -124,7 +124,7 @@ static void invalidate_pte(struct kvm_vcpu *vcpu, struct hpte_cache *pte)
 #endif
 	vcpu3s->hpte_cache_count--;
 
-	spin_unlock(&vcpu3s->mmu_lock);
+	kvm_mmu_unlock(vcpu3s);
 
 	call_rcu(&pte->rcu_head, free_pte_rcu);
 }
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index ed0c9c43d0cf..633ae418ba0e 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -459,7 +459,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 		gvaddr &= ~((tsize_pages << PAGE_SHIFT) - 1);
 	}
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	if (mmu_notifier_retry(kvm, mmu_seq)) {
 		ret = -EAGAIN;
 		goto out;
@@ -499,7 +499,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 	kvmppc_mmu_flush_icache(pfn);
 
 out:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	/* Drop refcount on page, so that mmu notifiers can clear it */
 	kvm_release_pfn_clean(pfn);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..5a4577830606 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2470,7 +2470,7 @@ static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
  */
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 {
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	if (kvm->arch.n_used_mmu_pages > goal_nr_mmu_pages) {
 		kvm_mmu_zap_oldest_mmu_pages(kvm, kvm->arch.n_used_mmu_pages -
@@ -2481,7 +2481,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 
 	kvm->arch.n_max_mmu_pages = goal_nr_mmu_pages;
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
@@ -2492,7 +2492,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 
 	pgprintk("%s: looking for gfn %llx\n", __func__, gfn);
 	r = 0;
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
 		pgprintk("%s: gfn %llx role %x\n", __func__, gfn,
 			 sp->role.word);
@@ -2500,7 +2500,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	return r;
 }
@@ -3192,7 +3192,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			return;
 	}
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
@@ -3215,7 +3215,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	}
 
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_free_roots);
 
@@ -3236,16 +3236,16 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 {
 	struct kvm_mmu_page *sp;
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_lock(vcpu->kvm);
 
 	if (make_mmu_pages_available(vcpu)) {
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		kvm_mmu_unlock(vcpu->kvm);
 		return INVALID_PAGE;
 	}
 	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
 	++sp->root_count;
 
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_unlock(vcpu->kvm);
 	return __pa(sp->spt);
 }
 
@@ -3416,17 +3416,17 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 		    !smp_load_acquire(&sp->unsync_children))
 			return;
 
-		spin_lock(&vcpu->kvm->mmu_lock);
+		kvm_mmu_lock(vcpu->kvm);
 		kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
 		mmu_sync_children(vcpu, sp);
 
 		kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		kvm_mmu_unlock(vcpu->kvm);
 		return;
 	}
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_lock(vcpu->kvm);
 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
 	for (i = 0; i < 4; ++i) {
@@ -3440,7 +3440,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 	}
 
 	kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_unlock(vcpu->kvm);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
 
@@ -3724,7 +3724,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		return r;
 
 	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_lock(vcpu->kvm);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 	r = make_mmu_pages_available(vcpu);
@@ -3739,7 +3739,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 prefault, is_tdp);
 
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_unlock(vcpu->kvm);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -4999,7 +4999,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	 */
 	mmu_topup_memory_caches(vcpu, true);
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_lock(vcpu->kvm);
 
 	gentry = mmu_pte_write_fetch_gpte(vcpu, &gpa, &bytes);
 
@@ -5035,7 +5035,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	}
 	kvm_mmu_flush_or_zap(vcpu, &invalid_list, remote_flush, local_flush);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PTE_WRITE);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_unlock(vcpu->kvm);
 }
 
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
@@ -5423,7 +5423,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 {
 	lockdep_assert_held(&kvm->slots_lock);
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	trace_kvm_mmu_zap_all_fast(kvm);
 
 	/*
@@ -5450,7 +5450,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	if (kvm->arch.tdp_mmu_enabled)
 		kvm_tdp_mmu_zap_all(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
@@ -5492,7 +5492,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	int i;
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
 		kvm_for_each_memslot(memslot, slots) {
@@ -5516,7 +5516,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 			kvm_flush_remote_tlbs(kvm);
 	}
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
@@ -5531,12 +5531,12 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
 				start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	/*
 	 * We can flush all the TLBs out of the mmu lock without TLB
@@ -5596,13 +5596,13 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot)
 {
 	/* FIXME: const-ify all uses of struct kvm_memory_slot.  */
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
 			 kvm_mmu_zap_collapsible_spte, true);
 
 	if (kvm->arch.tdp_mmu_enabled)
 		kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
@@ -5625,11 +5625,11 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	/*
 	 * It's also safe to flush TLBs out of mmu lock here as currently this
@@ -5647,12 +5647,12 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
 					false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_2M);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
@@ -5664,11 +5664,11 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 {
 	bool flush;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
 	if (kvm->arch.tdp_mmu_enabled)
 		flush |= kvm_tdp_mmu_slot_set_dirty(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
@@ -5681,7 +5681,7 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	LIST_HEAD(invalid_list);
 	int ign;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 restart:
 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
 		if (WARN_ON(sp->role.invalid))
@@ -5697,7 +5697,7 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	if (kvm->arch.tdp_mmu_enabled)
 		kvm_tdp_mmu_zap_all(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
@@ -5757,7 +5757,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			continue;
 
 		idx = srcu_read_lock(&kvm->srcu);
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 
 		if (kvm_has_zapped_obsolete_pages(kvm)) {
 			kvm_mmu_commit_zap_page(kvm,
@@ -5768,7 +5768,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
 
 unlock:
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 		srcu_read_unlock(&kvm->srcu, idx);
 
 		/*
@@ -5988,7 +5988,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	ulong to_zap;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
 	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
@@ -6020,7 +6020,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 8443a675715b..7ae4567c58bf 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -184,9 +184,9 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 
 	head = &kvm->arch.track_notifier_head;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	hlist_add_head_rcu(&n->node, &head->track_notifier_list);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_register_notifier);
 
@@ -202,9 +202,9 @@ kvm_page_track_unregister_notifier(struct kvm *kvm,
 
 	head = &kvm->arch.track_notifier_head;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	hlist_del_rcu(&n->node);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	synchronize_srcu(&head->track_srcu);
 }
 EXPORT_SYMBOL_GPL(kvm_page_track_unregister_notifier);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50e268eb8e1a..a7a29bf6c683 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -868,7 +868,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	}
 
 	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_lock(vcpu->kvm);
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 
@@ -881,7 +881,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_unlock(vcpu->kvm);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
@@ -919,7 +919,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 		return;
 	}
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_lock(vcpu->kvm);
 	for_each_shadow_entry_using_root(vcpu, root_hpa, gva, iterator) {
 		level = iterator.level;
 		sptep = iterator.sptep;
@@ -954,7 +954,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
 			break;
 	}
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_mmu_unlock(vcpu->kvm);
 }
 
 /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dc5b4bf34ca2..90807f2d928f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -170,13 +170,13 @@ static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
 
 	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	/* Check for an existing root before allocating a new one. */
 	for_each_tdp_mmu_root(kvm, root) {
 		if (root->role.word == role.word) {
 			kvm_mmu_get_root(kvm, root);
-			spin_unlock(&kvm->mmu_lock);
+			kvm_mmu_unlock(kvm);
 			return root;
 		}
 	}
@@ -186,7 +186,7 @@ static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
 
 	list_add(&root->link, &kvm->arch.tdp_mmu_roots);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	return root;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a8969a6dd06..302042af87ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7088,9 +7088,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (vcpu->arch.mmu->direct_map) {
 		unsigned int indirect_shadow_pages;
 
-		spin_lock(&vcpu->kvm->mmu_lock);
+		kvm_mmu_lock(vcpu->kvm);
 		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
-		spin_unlock(&vcpu->kvm->mmu_lock);
+		kvm_mmu_unlock(vcpu->kvm);
 
 		if (indirect_shadow_pages)
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 60f1a386dd06..069e189961ff 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1703,7 +1703,7 @@ static int kvmgt_page_track_add(unsigned long handle, u64 gfn)
 		return -EINVAL;
 	}
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	if (kvmgt_gfn_is_write_protected(info, gfn))
 		goto out;
@@ -1712,7 +1712,7 @@ static int kvmgt_page_track_add(unsigned long handle, u64 gfn)
 	kvmgt_protect_table_add(info, gfn);
 
 out:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 	return 0;
 }
@@ -1737,7 +1737,7 @@ static int kvmgt_page_track_remove(unsigned long handle, u64 gfn)
 		return -EINVAL;
 	}
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	if (!kvmgt_gfn_is_write_protected(info, gfn))
 		goto out;
@@ -1746,7 +1746,7 @@ static int kvmgt_page_track_remove(unsigned long handle, u64 gfn)
 	kvmgt_protect_table_del(info, gfn);
 
 out:
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 	return 0;
 }
@@ -1772,7 +1772,7 @@ static void kvmgt_page_track_flush_slot(struct kvm *kvm,
 	struct kvmgt_guest_info *info = container_of(node,
 					struct kvmgt_guest_info, track_node);
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	for (i = 0; i < slot->npages; i++) {
 		gfn = slot->base_gfn + i;
 		if (kvmgt_gfn_is_write_protected(info, gfn)) {
@@ -1781,7 +1781,7 @@ static void kvmgt_page_track_flush_slot(struct kvm *kvm,
 			kvmgt_protect_table_del(info, gfn);
 		}
 	}
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 static bool __kvmgt_vgpu_exist(struct intel_vgpu *vgpu, struct kvm *kvm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f3b1013fb22c..433d14fdae30 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1495,4 +1495,7 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+void kvm_mmu_lock(struct kvm *kvm);
+void kvm_mmu_unlock(struct kvm *kvm);
+
 #endif
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 9d01299563ee..e1c1538f59a6 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -60,9 +60,9 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
 		return;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 }
 
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa9e3614d30e..32f97ed1188d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -432,6 +432,16 @@ void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 
+void kvm_mmu_lock(struct kvm *kvm)
+{
+	spin_lock(&kvm->mmu_lock);
+}
+
+void kvm_mmu_unlock(struct kvm *kvm)
+{
+	spin_unlock(&kvm->mmu_lock);
+}
+
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
@@ -459,13 +469,13 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	int idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	kvm->mmu_notifier_seq++;
 
 	if (kvm_set_spte_hva(kvm, address, pte))
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -476,7 +486,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	int need_tlb_flush = 0, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	/*
 	 * The count increase must become visible at unlock time as no
 	 * spte can be established without taking the mmu_lock and
@@ -489,7 +499,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	if (need_tlb_flush || kvm->tlbs_dirty)
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return 0;
@@ -500,7 +510,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	/*
 	 * This sequence increase will notify the kvm page fault that
 	 * the page that is going to be mapped in the spte could have
@@ -514,7 +524,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	 * in conjunction with the smp_rmb in mmu_notifier_retry().
 	 */
 	kvm->mmu_notifier_count--;
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	BUG_ON(kvm->mmu_notifier_count < 0);
 }
@@ -528,13 +538,13 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 
 	young = kvm_age_hva(kvm, start, end);
 	if (young)
 		kvm_flush_remote_tlbs(kvm);
 
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -549,7 +559,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	/*
 	 * Even though we do not flush TLB, this will still adversely
 	 * affect performance on pre-Haswell Intel EPT, where there is
@@ -564,7 +574,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 	 * more sophisticated heuristic later.
 	 */
 	young = kvm_age_hva(kvm, start, end);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -578,9 +588,9 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 	int young, idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	young = kvm_test_age_hva(kvm, address);
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return young;
@@ -1524,7 +1534,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
 		memset(dirty_bitmap_buffer, 0, n);
 
-		spin_lock(&kvm->mmu_lock);
+		kvm_mmu_lock(kvm);
 		for (i = 0; i < n / sizeof(long); i++) {
 			unsigned long mask;
 			gfn_t offset;
@@ -1540,7 +1550,7 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
 								offset, mask);
 		}
-		spin_unlock(&kvm->mmu_lock);
+		kvm_mmu_unlock(kvm);
 	}
 
 	if (flush)
@@ -1635,7 +1645,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
 		return -EFAULT;
 
-	spin_lock(&kvm->mmu_lock);
+	kvm_mmu_lock(kvm);
 	for (offset = log->first_page, i = offset / BITS_PER_LONG,
 		 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
 	     i++, offset += BITS_PER_LONG) {
@@ -1658,7 +1668,7 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 								offset, mask);
 		}
 	}
-	spin_unlock(&kvm->mmu_lock);
+	kvm_mmu_unlock(kvm);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (13 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 14/24] kvm: mmu: Wrap mmu_lock lock / unlock in a function Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-21  0:19   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions Ben Gardon
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Wrap the MMU lock cond_reseched and needbreak operations in a function.
This will support a refactoring to move the lock into the struct
kvm_arch(s) so that x86 can change the spinlock to a rwlock without
affecting the performance of other archs.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/arm64/kvm/mmu.c       |  2 +-
 arch/x86/kvm/mmu/mmu.c     | 16 ++++++++--------
 arch/x86/kvm/mmu/tdp_mmu.c |  8 ++++----
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/kvm_main.c        | 10 ++++++++++
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 402b1642c944..57ef1ec23b56 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -58,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 			break;
 
 		if (resched && next != end)
-			cond_resched_lock(&kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(kvm);
 	} while (addr = next, addr != end);
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5a4577830606..659ed0a2875f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2016,9 +2016,9 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_page(vcpu, sp, &invalid_list);
 			mmu_pages_clear_parents(&parents);
 		}
-		if (need_resched() || spin_needbreak(&vcpu->kvm->mmu_lock)) {
+		if (need_resched() || kvm_mmu_lock_needbreak(vcpu->kvm)) {
 			kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
-			cond_resched_lock(&vcpu->kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(vcpu->kvm);
 			flush = false;
 		}
 	}
@@ -5233,14 +5233,14 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		if (iterator.rmap)
 			flush |= fn(kvm, iterator.rmap);
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 			if (flush && lock_flush_tlb) {
 				kvm_flush_remote_tlbs_with_address(kvm,
 						start_gfn,
 						iterator.gfn - start_gfn + 1);
 				flush = false;
 			}
-			cond_resched_lock(&kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(kvm);
 		}
 	}
 
@@ -5390,7 +5390,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 		 * be in active use by the guest.
 		 */
 		if (batch >= BATCH_ZAP_PAGES &&
-		    cond_resched_lock(&kvm->mmu_lock)) {
+		    kvm_mmu_lock_cond_resched(kvm)) {
 			batch = 0;
 			goto restart;
 		}
@@ -5688,7 +5688,7 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 			continue;
 		if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
 			goto restart;
-		if (cond_resched_lock(&kvm->mmu_lock))
+		if (kvm_mmu_lock_cond_resched(kvm))
 			goto restart;
 	}
 
@@ -6013,9 +6013,9 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 			WARN_ON_ONCE(sp->lpage_disallowed);
 		}
 
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 			kvm_mmu_commit_zap_page(kvm, &invalid_list);
-			cond_resched_lock(&kvm->mmu_lock);
+			kvm_mmu_lock_cond_resched(kvm);
 		}
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 90807f2d928f..fb911ca428b2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -488,10 +488,10 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,
 		struct tdp_iter *iter)
 {
-	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+	if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 		kvm_flush_remote_tlbs(kvm);
 		rcu_read_unlock();
-		cond_resched_lock(&kvm->mmu_lock);
+		kvm_mmu_lock_cond_resched(kvm);
 		rcu_read_lock();
 		tdp_iter_refresh_walk(iter);
 		return true;
@@ -512,9 +512,9 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,
  */
 static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
 {
-	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+	if (need_resched() || kvm_mmu_lock_needbreak(kvm)) {
 		rcu_read_unlock();
-		cond_resched_lock(&kvm->mmu_lock);
+		kvm_mmu_lock_cond_resched(kvm);
 		rcu_read_lock();
 		tdp_iter_refresh_walk(iter);
 		return true;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 433d14fdae30..6e2773fc406c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1497,5 +1497,7 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 
 void kvm_mmu_lock(struct kvm *kvm);
 void kvm_mmu_unlock(struct kvm *kvm);
+int kvm_mmu_lock_needbreak(struct kvm *kvm);
+int kvm_mmu_lock_cond_resched(struct kvm *kvm);
 
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 32f97ed1188d..b4c49a7e0556 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -442,6 +442,16 @@ void kvm_mmu_unlock(struct kvm *kvm)
 	spin_unlock(&kvm->mmu_lock);
 }
 
+int kvm_mmu_lock_needbreak(struct kvm *kvm)
+{
+	return spin_needbreak(&kvm->mmu_lock);
+}
+
+int kvm_mmu_lock_cond_resched(struct kvm *kvm)
+{
+	return cond_resched_lock(&kvm->mmu_lock);
+}
+
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (14 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-26 14:29   ` Paolo Bonzini
  2021-01-12 18:10 ` [PATCH 17/24] kvm: mmu: Move mmu_lock to struct kvm_arch Ben Gardon
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Wrap assertions and warnings checking the MMU lock state in a function
which uses lockdep_assert_held. While the existing checks use a few
different functions to check the lock state, they are all better off
using lockdep_assert_held. This will support a refactoring to move the
mmu_lock to struct kvm_arch so that it can be replaced with an rwlock for
x86.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/arm64/kvm/mmu.c                     | 2 +-
 arch/powerpc/include/asm/kvm_book3s_64.h | 7 +++----
 arch/powerpc/kvm/book3s_hv_nested.c      | 3 +--
 arch/x86/kvm/mmu/mmu_internal.h          | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c               | 8 ++++----
 include/linux/kvm_host.h                 | 1 +
 virt/kvm/kvm_main.c                      | 5 +++++
 7 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 57ef1ec23b56..8b54eb58bf47 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -130,7 +130,7 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
 	struct kvm *kvm = mmu->kvm;
 	phys_addr_t end = start + size;
 
-	assert_spin_locked(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held(kvm);
 	WARN_ON(size & ~PAGE_MASK);
 	WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
 				   may_block));
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 9bb9bb370b53..db2e437cd97c 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -650,8 +650,8 @@ static inline pte_t *find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
 {
 	pte_t *pte;
 
-	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
-		"%s called with kvm mmu_lock not held \n", __func__);
+	kvm_mmu_lock_assert_held(kvm);
+
 	pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
 
 	return pte;
@@ -662,8 +662,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq,
 {
 	pte_t *pte;
 
-	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
-		"%s called with kvm mmu_lock not held \n", __func__);
+	kvm_mmu_lock_assert_held(kvm);
 
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		return NULL;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 18890dca9476..6d5987d1eee7 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -767,8 +767,7 @@ pte_t *find_kvm_nested_guest_pte(struct kvm *kvm, unsigned long lpid,
 	if (!gp)
 		return NULL;
 
-	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
-		"%s called with kvm mmu_lock not held \n", __func__);
+	kvm_mmu_lock_assert_held(kvm);
 	pte = __find_linux_pte(gp->shadow_pgtable, ea, NULL, hshift);
 
 	return pte;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7f599cc64178..cc8268cf28d2 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,14 +101,14 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 static inline void kvm_mmu_get_root(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	BUG_ON(!sp->root_count);
-	lockdep_assert_held(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held(kvm);
 
 	++sp->root_count;
 }
 
 static inline bool kvm_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held(kvm);
 	--sp->root_count;
 
 	return !sp->root_count;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fb911ca428b2..1d7c01300495 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -117,7 +117,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
 	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held(kvm);
 
 	WARN_ON(root->root_count);
 	WARN_ON(!root->tdp_mmu_page);
@@ -425,7 +425,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
 	int as_id = kvm_mmu_page_as_id(root);
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held(kvm);
 
 	WRITE_ONCE(*iter->sptep, new_spte);
 
@@ -1139,7 +1139,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 	struct kvm_mmu_page *root;
 	int root_as_id;
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held(kvm);
 	for_each_tdp_mmu_root(kvm, root) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
@@ -1324,7 +1324,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	int root_as_id;
 	bool spte_set = false;
 
-	lockdep_assert_held(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held(kvm);
 	for_each_tdp_mmu_root(kvm, root) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6e2773fc406c..022e3522788f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1499,5 +1499,6 @@ void kvm_mmu_lock(struct kvm *kvm);
 void kvm_mmu_unlock(struct kvm *kvm);
 int kvm_mmu_lock_needbreak(struct kvm *kvm);
 int kvm_mmu_lock_cond_resched(struct kvm *kvm);
+void kvm_mmu_lock_assert_held(struct kvm *kvm);
 
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b4c49a7e0556..c504f876176b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -452,6 +452,11 @@ int kvm_mmu_lock_cond_resched(struct kvm *kvm)
 	return cond_resched_lock(&kvm->mmu_lock);
 }
 
+void kvm_mmu_lock_assert_held(struct kvm *kvm)
+{
+	lockdep_assert_held(&kvm->mmu_lock);
+}
+
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 17/24] kvm: mmu: Move mmu_lock to struct kvm_arch
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (15 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU Ben Gardon
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Move the mmu_lock to struct kvm_arch so that it can be replaced with a
rwlock on x86 without affecting the performance of other archs.

No functional change intended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 Documentation/virt/kvm/locking.rst     |  2 +-
 arch/arm64/include/asm/kvm_host.h      |  2 ++
 arch/arm64/kvm/arm.c                   |  2 ++
 arch/mips/include/asm/kvm_host.h       |  2 ++
 arch/mips/kvm/mips.c                   |  2 ++
 arch/mips/kvm/mmu.c                    |  6 +++---
 arch/powerpc/include/asm/kvm_host.h    |  2 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 10 +++++-----
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  4 ++--
 arch/powerpc/kvm/book3s_hv_nested.c    |  4 ++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c    | 14 +++++++-------
 arch/powerpc/kvm/e500_mmu_host.c       |  2 +-
 arch/powerpc/kvm/powerpc.c             |  2 ++
 arch/s390/include/asm/kvm_host.h       |  2 ++
 arch/s390/kvm/kvm-s390.c               |  2 ++
 arch/x86/include/asm/kvm_host.h        |  2 ++
 arch/x86/kvm/mmu/mmu.c                 |  2 +-
 arch/x86/kvm/x86.c                     |  2 ++
 include/linux/kvm_host.h               |  1 -
 virt/kvm/kvm_main.c                    | 11 +++++------
 20 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index b21a34c34a21..06c006c73c4b 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -212,7 +212,7 @@ which time it will be set using the Dirty tracking mechanism described above.
 		- tsc offset in vmcb
 :Comment:	'raw' because updating the tsc offsets must not be preempted.
 
-:Name:		kvm->mmu_lock
+:Name:		kvm_arch::mmu_lock
 :Type:		spinlock_t
 :Arch:		any
 :Protects:	-shadow page/shadow tlb entry
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 8fcfab0c2567..6fd4d64eb202 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -102,6 +102,8 @@ struct kvm_arch_memory_slot {
 };
 
 struct kvm_arch {
+	spinlock_t mmu_lock;
+
 	struct kvm_s2_mmu mmu;
 
 	/* VTCR_EL2 value for this VM */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 04c44853b103..90f4fcd84bb5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -130,6 +130,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
 	int ret;
 
+	spin_lock_init(&kvm->arch.mmu_lock);
+
 	ret = kvm_arm_setup_stage2(kvm, type);
 	if (ret)
 		return ret;
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 24f3d0f9996b..eb3caeffaf91 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -216,6 +216,8 @@ struct loongson_kvm_ipi {
 #endif
 
 struct kvm_arch {
+	spinlock_t mmu_lock;
+
 	/* Guest physical mm */
 	struct mm_struct gpa_mm;
 	/* Mask of CPUs needing GPA ASID flush */
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 4e393d93c1aa..7b8d65d8c863 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -150,6 +150,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 		return -EINVAL;
 	};
 
+	spin_lock_init(&kvm->arch.mmu_lock);
+
 	/* Allocate page table to map GPA -> RPA */
 	kvm->arch.gpa_mm.pgd = kvm_pgd_alloc();
 	if (!kvm->arch.gpa_mm.pgd)
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 449663152b3c..68fcda1e48f9 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -263,7 +263,7 @@ static bool kvm_mips_flush_gpa_pgd(pgd_t *pgd, unsigned long start_gpa,
  *
  * Flushes a range of GPA mappings from the GPA page tables.
  *
- * The caller must hold the @kvm->mmu_lock spinlock.
+ * The caller must hold the @kvm->arch.mmu_lock spinlock.
  *
  * Returns:	Whether its safe to remove the top level page directory because
  *		all lower levels have been removed.
@@ -388,7 +388,7 @@ BUILD_PTE_RANGE_OP(mkclean, pte_mkclean)
  * Make a range of GPA mappings clean so that guest writes will fault and
  * trigger dirty page logging.
  *
- * The caller must hold the @kvm->mmu_lock spinlock.
+ * The caller must hold the @kvm->arch.mmu_lock spinlock.
  *
  * Returns:	Whether any GPA mappings were modified, which would require
  *		derived mappings (GVA page tables & TLB enties) to be
@@ -410,7 +410,7 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn)
  *		slot to be write protected
  *
  * Walks bits set in mask write protects the associated pte's. Caller must
- * acquire @kvm->mmu_lock.
+ * acquire @kvm->arch.mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index d67a470e95a3..7bb8e5847fb4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -282,6 +282,8 @@ struct kvm_resize_hpt;
 #define KVMPPC_SECURE_INIT_ABORT 0x4 /* H_SVM_INIT_ABORT issued */
 
 struct kvm_arch {
+	spinlock_t mmu_lock;
+
 	unsigned int lpid;
 	unsigned int smt_mode;		/* # vcpus per virtual core */
 	unsigned int emul_smt_mode;	/* emualted SMT mode, on P9 */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index b628980c871b..522d19723512 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -388,7 +388,7 @@ static void kvmppc_pmd_free(pmd_t *pmdp)
 	kmem_cache_free(kvm_pmd_cache, pmdp);
 }
 
-/* Called with kvm->mmu_lock held */
+/* Called with kvm->arch.mmu_lock held */
 void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
 		      unsigned int shift,
 		      const struct kvm_memory_slot *memslot,
@@ -992,7 +992,7 @@ int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
-/* Called with kvm->mmu_lock held */
+/* Called with kvm->arch.mmu_lock held */
 int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		    unsigned long gfn)
 {
@@ -1012,7 +1012,7 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	return 0;
 }
 
-/* Called with kvm->mmu_lock held */
+/* Called with kvm->arch.mmu_lock held */
 int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		  unsigned long gfn)
 {
@@ -1040,7 +1040,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	return ref;
 }
 
-/* Called with kvm->mmu_lock held */
+/* Called with kvm->arch.mmu_lock held */
 int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		       unsigned long gfn)
 {
@@ -1073,7 +1073,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 		return ret;
 
 	/*
-	 * For performance reasons we don't hold kvm->mmu_lock while walking the
+	 * For performance reasons we don't hold kvm->arch.mmu_lock while walking the
 	 * partition scoped table.
 	 */
 	ptep = find_kvm_secondary_pte_unlocked(kvm, gpa, &shift);
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 083a4e037718..adffa111ebe9 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -545,7 +545,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		if (kvmppc_rm_tce_to_ua(vcpu->kvm, tce_list, &ua))
 			return H_TOO_HARD;
 
-		arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
+		arch_spin_lock(&kvm->arch.mmu_lock.rlock.raw_lock);
 		if (kvmppc_rm_ua_to_hpa(vcpu, mmu_seq, ua, &tces)) {
 			ret = H_TOO_HARD;
 			goto unlock_exit;
@@ -590,7 +590,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 unlock_exit:
 	if (!prereg)
-		arch_spin_unlock(&kvm->mmu_lock.rlock.raw_lock);
+		arch_spin_unlock(&kvm->arch.mmu_lock.rlock.raw_lock);
 	return ret;
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 6d5987d1eee7..fe0a4e3fef1b 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -611,7 +611,7 @@ static void kvmhv_release_nested(struct kvm_nested_guest *gp)
 		/*
 		 * No vcpu is using this struct and no call to
 		 * kvmhv_get_nested can find this struct,
-		 * so we don't need to hold kvm->mmu_lock.
+		 * so we don't need to hold kvm->arch.mmu_lock.
 		 */
 		kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable,
 					  gp->shadow_lpid);
@@ -892,7 +892,7 @@ static void kvmhv_remove_nest_rmap_list(struct kvm *kvm, unsigned long *rmapp,
 	}
 }
 
-/* called with kvm->mmu_lock held */
+/* called with kvm->arch.mmu_lock held */
 void kvmhv_remove_nest_rmap_range(struct kvm *kvm,
 				  const struct kvm_memory_slot *memslot,
 				  unsigned long gpa, unsigned long hpa,
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 88da2764c1bb..897baf210a2d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -249,7 +249,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	/* Translate to host virtual address */
 	hva = __gfn_to_hva_memslot(memslot, gfn);
 
-	arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
+	arch_spin_lock(&kvm->arch.mmu_lock.rlock.raw_lock);
 	ptep = find_kvm_host_pte(kvm, mmu_seq, hva, &hpage_shift);
 	if (ptep) {
 		pte_t pte;
@@ -264,7 +264,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		 * to <= host page size, if host is using hugepage
 		 */
 		if (host_pte_size < psize) {
-			arch_spin_unlock(&kvm->mmu_lock.rlock.raw_lock);
+			arch_spin_unlock(&kvm->arch.mmu_lock.rlock.raw_lock);
 			return H_PARAMETER;
 		}
 		pte = kvmppc_read_update_linux_pte(ptep, writing);
@@ -278,7 +278,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			pa |= gpa & ~PAGE_MASK;
 		}
 	}
-	arch_spin_unlock(&kvm->mmu_lock.rlock.raw_lock);
+	arch_spin_unlock(&kvm->arch.mmu_lock.rlock.raw_lock);
 
 	ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1);
 	ptel |= pa;
@@ -933,7 +933,7 @@ static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu,
 	mmu_seq = kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
+	arch_spin_lock(&kvm->arch.mmu_lock.rlock.raw_lock);
 
 	ret = kvmppc_get_hpa(vcpu, mmu_seq, dest, 1, &pa, &memslot);
 	if (ret != H_SUCCESS)
@@ -945,7 +945,7 @@ static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu,
 	kvmppc_update_dirty_map(memslot, dest >> PAGE_SHIFT, PAGE_SIZE);
 
 out_unlock:
-	arch_spin_unlock(&kvm->mmu_lock.rlock.raw_lock);
+	arch_spin_unlock(&kvm->arch.mmu_lock.rlock.raw_lock);
 	return ret;
 }
 
@@ -961,7 +961,7 @@ static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu,
 	mmu_seq = kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock);
+	arch_spin_lock(&kvm->arch.mmu_lock.rlock.raw_lock);
 	ret = kvmppc_get_hpa(vcpu, mmu_seq, dest, 1, &dest_pa, &dest_memslot);
 	if (ret != H_SUCCESS)
 		goto out_unlock;
@@ -976,7 +976,7 @@ static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu,
 	kvmppc_update_dirty_map(dest_memslot, dest >> PAGE_SHIFT, PAGE_SIZE);
 
 out_unlock:
-	arch_spin_unlock(&kvm->mmu_lock.rlock.raw_lock);
+	arch_spin_unlock(&kvm->arch.mmu_lock.rlock.raw_lock);
 	return ret;
 }
 
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 633ae418ba0e..fef60e614aaf 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -470,7 +470,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 	/*
 	 * We are just looking at the wimg bits, so we don't
 	 * care much about the trans splitting bit.
-	 * We are holding kvm->mmu_lock so a notifier invalidate
+	 * We are holding kvm->arch.mmu_lock so a notifier invalidate
 	 * can't run hence pfn won't change.
 	 */
 	local_irq_save(flags);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index cf52d26f49cd..11e35ba0272e 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -452,6 +452,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	} else
 		goto err_out;
 
+	spin_lock_init(&kvm->arch.mmu_lock);
+
 	if (kvm_ops->owner && !try_module_get(kvm_ops->owner))
 		return -ENOENT;
 
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 74f9a036bab2..1299deef70b5 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -926,6 +926,8 @@ struct kvm_s390_pv {
 };
 
 struct kvm_arch{
+	spinlock_t mmu_lock;
+
 	void *sca;
 	int use_esca;
 	rwlock_t sca_lock;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index dbafd057ca6a..20c6ae7bc25b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2642,6 +2642,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 		goto out_err;
 #endif
 
+	spin_lock_init(&kvm->arch.mmu_lock);
+
 	rc = s390_enable_sie();
 	if (rc)
 		goto out_err;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d6616f6f6ef..3087de84fad3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -902,6 +902,8 @@ enum kvm_irqchip_mode {
 #define APICV_INHIBIT_REASON_X2APIC	5
 
 struct kvm_arch {
+	spinlock_t mmu_lock;
+
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 659ed0a2875f..ba296ad051c3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5747,7 +5747,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		if (!nr_to_scan--)
 			break;
 		/*
-		 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
+		 * n_used_mmu_pages is accessed without holding kvm->arch.mmu_lock
 		 * here. We may skip a VM instance errorneosly, but we do not
 		 * want to shrink a VM that only started to populate its MMU
 		 * anyway.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 302042af87ee..a6cc34e8ccad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10366,6 +10366,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	if (type)
 		return -EINVAL;
 
+	spin_lock_init(&kvm->arch.mmu_lock);
+
 	INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 022e3522788f..97e301b8cafd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -451,7 +451,6 @@ struct kvm_memslots {
 };
 
 struct kvm {
-	spinlock_t mmu_lock;
 	struct mutex slots_lock;
 	struct mm_struct *mm; /* userspace tied to this vm */
 	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c504f876176b..d168bd4517d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -434,27 +434,27 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 
 void kvm_mmu_lock(struct kvm *kvm)
 {
-	spin_lock(&kvm->mmu_lock);
+	spin_lock(&kvm->arch.mmu_lock);
 }
 
 void kvm_mmu_unlock(struct kvm *kvm)
 {
-	spin_unlock(&kvm->mmu_lock);
+	spin_unlock(&kvm->arch.mmu_lock);
 }
 
 int kvm_mmu_lock_needbreak(struct kvm *kvm)
 {
-	return spin_needbreak(&kvm->mmu_lock);
+	return spin_needbreak(&kvm->arch.mmu_lock);
 }
 
 int kvm_mmu_lock_cond_resched(struct kvm *kvm)
 {
-	return cond_resched_lock(&kvm->mmu_lock);
+	return cond_resched_lock(&kvm->arch.mmu_lock);
 }
 
 void kvm_mmu_lock_assert_held(struct kvm *kvm)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
+	lockdep_assert_held(&kvm->arch.mmu_lock);
 }
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
@@ -770,7 +770,6 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);
 
-	spin_lock_init(&kvm->mmu_lock);
 	mmgrab(current->mm);
 	kvm->mm = current->mm;
 	kvm_eventfd_init(kvm);
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (16 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 17/24] kvm: mmu: Move mmu_lock to struct kvm_arch Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-21  0:45   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock Ben Gardon
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Add a read / write lock to be used in place of the MMU spinlock when the
TDP MMU is enabled. The rwlock will enable the TDP MMU to handle page
faults in parallel in a future commit. In cases where the TDP MMU is not
in use, no operation would be acquiring the lock in read mode, so a
regular spin lock is still used as locking and unlocking a spin lock is
slightly faster.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  8 ++-
 arch/x86/kvm/mmu/mmu.c          | 89 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  9 ++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 10 ++--
 arch/x86/kvm/x86.c              |  2 -
 virt/kvm/kvm_main.c             | 10 ++--
 6 files changed, 115 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3087de84fad3..92d5340842c8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -902,7 +902,13 @@ enum kvm_irqchip_mode {
 #define APICV_INHIBIT_REASON_X2APIC	5
 
 struct kvm_arch {
-	spinlock_t mmu_lock;
+	union {
+		/* Used if the TDP MMU is enabled. */
+		rwlock_t mmu_rwlock;
+
+		/* Used if the TDP MMU is not enabled. */
+		spinlock_t mmu_lock;
+	};
 
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba296ad051c3..280d7cd6f94b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5471,6 +5471,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 
 	kvm_mmu_init_tdp_mmu(kvm);
 
+	if (kvm->arch.tdp_mmu_enabled)
+		rwlock_init(&kvm->arch.mmu_rwlock);
+	else
+		spin_lock_init(&kvm->arch.mmu_lock);
+
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
@@ -6074,3 +6079,87 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_lpage_recovery_thread)
 		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
 }
+
+void kvm_mmu_lock_shared(struct kvm *kvm)
+{
+	WARN_ON(!kvm->arch.tdp_mmu_enabled);
+	read_lock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_unlock_shared(struct kvm *kvm)
+{
+	WARN_ON(!kvm->arch.tdp_mmu_enabled);
+	read_unlock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock_exclusive(struct kvm *kvm)
+{
+	WARN_ON(!kvm->arch.tdp_mmu_enabled);
+	write_lock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_unlock_exclusive(struct kvm *kvm)
+{
+	WARN_ON(!kvm->arch.tdp_mmu_enabled);
+	write_unlock(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock(struct kvm *kvm)
+{
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_mmu_lock_exclusive(kvm);
+	else
+		spin_lock(&kvm->arch.mmu_lock);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_lock);
+
+void kvm_mmu_unlock(struct kvm *kvm)
+{
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_mmu_unlock_exclusive(kvm);
+	else
+		spin_unlock(&kvm->arch.mmu_lock);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_unlock);
+
+int kvm_mmu_lock_needbreak(struct kvm *kvm)
+{
+	if (kvm->arch.tdp_mmu_enabled)
+		return rwlock_needbreak(&kvm->arch.mmu_rwlock);
+	else
+		return spin_needbreak(&kvm->arch.mmu_lock);
+}
+
+int kvm_mmu_lock_cond_resched_exclusive(struct kvm *kvm)
+{
+	WARN_ON(!kvm->arch.tdp_mmu_enabled);
+	return cond_resched_rwlock_write(&kvm->arch.mmu_rwlock);
+}
+
+int kvm_mmu_lock_cond_resched(struct kvm *kvm)
+{
+	if (kvm->arch.tdp_mmu_enabled)
+		return kvm_mmu_lock_cond_resched_exclusive(kvm);
+	else
+		return cond_resched_lock(&kvm->arch.mmu_lock);
+}
+
+void kvm_mmu_lock_assert_held_shared(struct kvm *kvm)
+{
+	WARN_ON(!kvm->arch.tdp_mmu_enabled);
+	lockdep_assert_held_read(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock_assert_held_exclusive(struct kvm *kvm)
+{
+	WARN_ON(!kvm->arch.tdp_mmu_enabled);
+	lockdep_assert_held_write(&kvm->arch.mmu_rwlock);
+}
+
+void kvm_mmu_lock_assert_held(struct kvm *kvm)
+{
+	if (kvm->arch.tdp_mmu_enabled)
+		lockdep_assert_held(&kvm->arch.mmu_rwlock);
+	else
+		lockdep_assert_held(&kvm->arch.mmu_lock);
+}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index cc8268cf28d2..53a789b8a820 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -149,4 +149,13 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+void kvm_mmu_lock_shared(struct kvm *kvm);
+void kvm_mmu_unlock_shared(struct kvm *kvm);
+void kvm_mmu_lock_exclusive(struct kvm *kvm);
+void kvm_mmu_unlock_exclusive(struct kvm *kvm);
+int kvm_mmu_lock_cond_resched_exclusive(struct kvm *kvm);
+void kvm_mmu_lock_assert_held_shared(struct kvm *kvm);
+void kvm_mmu_lock_assert_held_exclusive(struct kvm *kvm);
+void kvm_mmu_lock_assert_held(struct kvm *kvm);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1d7c01300495..8b61bdb391a0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -59,7 +59,7 @@ static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 static inline bool tdp_mmu_next_root_valid(struct kvm *kvm,
 					   struct kvm_mmu_page *root)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
+	kvm_mmu_lock_assert_held_exclusive(kvm);
 
 	if (list_entry_is_head(root, &kvm->arch.tdp_mmu_roots, link))
 		return false;
@@ -117,7 +117,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
 	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 
-	kvm_mmu_lock_assert_held(kvm);
+	kvm_mmu_lock_assert_held_exclusive(kvm);
 
 	WARN_ON(root->root_count);
 	WARN_ON(!root->tdp_mmu_page);
@@ -425,7 +425,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
 	int as_id = kvm_mmu_page_as_id(root);
 
-	kvm_mmu_lock_assert_held(kvm);
+	kvm_mmu_lock_assert_held_exclusive(kvm);
 
 	WRITE_ONCE(*iter->sptep, new_spte);
 
@@ -1139,7 +1139,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 	struct kvm_mmu_page *root;
 	int root_as_id;
 
-	kvm_mmu_lock_assert_held(kvm);
+	kvm_mmu_lock_assert_held_exclusive(kvm);
 	for_each_tdp_mmu_root(kvm, root) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
@@ -1324,7 +1324,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	int root_as_id;
 	bool spte_set = false;
 
-	kvm_mmu_lock_assert_held(kvm);
+	kvm_mmu_lock_assert_held_exclusive(kvm);
 	for_each_tdp_mmu_root(kvm, root) {
 		root_as_id = kvm_mmu_page_as_id(root);
 		if (root_as_id != slot->as_id)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a6cc34e8ccad..302042af87ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10366,8 +10366,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	if (type)
 		return -EINVAL;
 
-	spin_lock_init(&kvm->arch.mmu_lock);
-
 	INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d168bd4517d4..dcbdb3beb084 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -432,27 +432,27 @@ void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_destroy);
 
-void kvm_mmu_lock(struct kvm *kvm)
+__weak void kvm_mmu_lock(struct kvm *kvm)
 {
 	spin_lock(&kvm->arch.mmu_lock);
 }
 
-void kvm_mmu_unlock(struct kvm *kvm)
+__weak void kvm_mmu_unlock(struct kvm *kvm)
 {
 	spin_unlock(&kvm->arch.mmu_lock);
 }
 
-int kvm_mmu_lock_needbreak(struct kvm *kvm)
+__weak int kvm_mmu_lock_needbreak(struct kvm *kvm)
 {
 	return spin_needbreak(&kvm->arch.mmu_lock);
 }
 
-int kvm_mmu_lock_cond_resched(struct kvm *kvm)
+__weak int kvm_mmu_lock_cond_resched(struct kvm *kvm)
 {
 	return cond_resched_lock(&kvm->arch.mmu_lock);
 }
 
-void kvm_mmu_lock_assert_held(struct kvm *kvm)
+__weak void kvm_mmu_lock_assert_held(struct kvm *kvm)
 {
 	lockdep_assert_held(&kvm->arch.mmu_lock);
 }
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (17 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-21 19:22   ` Sean Christopherson
  2021-01-26 13:37   ` Paolo Bonzini
  2021-01-12 18:10 ` [PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs Ben Gardon
                   ` (4 subsequent siblings)
  23 siblings, 2 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Add a lock to protect the data structures that track the page table
memory used by the TDP MMU. In order to handle multiple TDP MMU
operations in parallel, pages of PT memory must be added and removed
without the exclusive protection of the MMU lock. A new lock to protect
the list(s) of in-use pages will cause some serialization, but only on
non-leaf page table entries, so the lock is not expected to be very
contended.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h | 15 ++++++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 67 +++++++++++++++++++++++++++++----
 2 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 92d5340842c8..f8dccb27c722 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1034,6 +1034,21 @@ struct kvm_arch {
 	 * tdp_mmu_page set and a root_count of 0.
 	 */
 	struct list_head tdp_mmu_pages;
+
+	/*
+	 * Protects accesses to the following fields when the MMU lock is
+	 * not held exclusively:
+	 *  - tdp_mmu_pages (above)
+	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
+	 *    when they are part of tdp_mmu_pages (but not when they are part
+	 *    of the tdp_mmu_free_list or tdp_mmu_disconnected_list)
+	 *  - lpage_disallowed_mmu_pages
+	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
+	 *    by the TDP MMU
+	 *  May be acquired under the MMU lock in read mode or non-overlapping
+	 *  with the MMU lock.
+	 */
+	spinlock_t tdp_mmu_pages_lock;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8b61bdb391a0..264594947c3b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -33,6 +33,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	kvm->arch.tdp_mmu_enabled = true;
 
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 }
 
@@ -262,6 +263,58 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 
+/**
+ * tdp_mmu_link_page - Add a new page to the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the new page
+ * @atomic: This operation is not running under the exclusive use of the MMU
+ *	    lock and the operation must be atomic with respect to ther threads
+ *	    that might be adding or removing pages.
+ * @account_nx: This page replaces a NX large page and should be marked for
+ *		eventual reclaim.
+ */
+static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			      bool atomic, bool account_nx)
+{
+	if (atomic)
+		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+	else
+		kvm_mmu_lock_assert_held_exclusive(kvm);
+
+	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
+	if (account_nx)
+		account_huge_nx_page(kvm, sp);
+
+	if (atomic)
+		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+}
+
+/**
+ * tdp_mmu_unlink_page - Remove page from the list of pages used by the TDP MMU
+ *
+ * @kvm: kvm instance
+ * @sp: the page to be removed
+ * @atomic: This operation is not running under the exclusive use of the MMU
+ *	    lock and the operation must be atomic with respect to ther threads
+ *	    that might be adding or removing pages.
+ */
+static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				bool atomic)
+{
+	if (atomic)
+		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+	else
+		kvm_mmu_lock_assert_held_exclusive(kvm);
+
+	list_del(&sp->link);
+	if (sp->lpage_disallowed)
+		unaccount_huge_nx_page(kvm, sp);
+
+	if (atomic)
+		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+}
+
 /**
  * handle_disconnected_tdp_mmu_page - handle a pt removed from the TDP structure
  *
@@ -285,10 +338,7 @@ static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
 
 	trace_kvm_mmu_prepare_zap_page(sp);
 
-	list_del(&sp->link);
-
-	if (sp->lpage_disallowed)
-		unaccount_huge_nx_page(kvm, sp);
+	tdp_mmu_unlink_page(kvm, sp, atomic);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		old_child_spte = READ_ONCE(*(pt + i));
@@ -719,15 +769,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
-			list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
 			child_pt = sp->spt;
+
+			tdp_mmu_link_page(vcpu->kvm, sp, false,
+					  huge_page_disallowed &&
+					  req_level >= iter.level);
+
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
 			trace_kvm_mmu_get_page(sp, true);
-			if (huge_page_disallowed && req_level >= iter.level)
-				account_huge_nx_page(vcpu->kvm, sp);
-
 			tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
 		}
 	}
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (18 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-26 14:21   ` Paolo Bonzini
  2021-01-12 18:10 ` [PATCH 21/24] kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

In order to allow multiple TDP MMU operations to proceed in parallel,
there must be an option to modify SPTEs atomically so that changes are
not lost. Add that option to __tdp_mmu_set_spte and
__handle_changed_spte.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 67 ++++++++++++++++++++++++++++++++------
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 264594947c3b..1380ed313476 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -7,6 +7,7 @@
 #include "tdp_mmu.h"
 #include "spte.h"
 
+#include <asm/cmpxchg.h>
 #include <trace/events/kvm.h>
 
 #ifdef CONFIG_X86_64
@@ -226,7 +227,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level);
+				u64 old_spte, u64 new_spte, int level,
+				bool atomic);
 
 static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 {
@@ -320,15 +322,19 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
  *
  * @kvm: kvm instance
  * @pt: the page removed from the paging structure
+ * @atomic: Use atomic operations to clear the SPTEs in any disconnected
+ *	    pages of memory.
  *
  * Given a page table that has been removed from the TDP paging structure,
  * iterates through the page table to clear SPTEs and free child page tables.
  */
-static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
+static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt,
+					     bool atomic)
 {
 	struct kvm_mmu_page *sp;
 	gfn_t gfn;
 	int level;
+	u64 *sptep;
 	u64 old_child_spte;
 	int i;
 
@@ -341,11 +347,17 @@ static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
 	tdp_mmu_unlink_page(kvm, sp, atomic);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-		old_child_spte = READ_ONCE(*(pt + i));
-		WRITE_ONCE(*(pt + i), 0);
+		sptep = pt + i;
+
+		if (atomic) {
+			old_child_spte = xchg(sptep, 0);
+		} else {
+			old_child_spte = READ_ONCE(*sptep);
+			WRITE_ONCE(*sptep, 0);
+		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
 			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-			old_child_spte, 0, level - 1);
+			old_child_spte, 0, level - 1, atomic);
 	}
 
 	kvm_flush_remote_tlbs_with_address(kvm, gfn,
@@ -362,12 +374,15 @@ static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
  * @level: the level of the PT the SPTE is part of in the paging structure
+ * @atomic: Use atomic operations to clear the SPTEs in any disconnected
+ *	    pages of memory.
  *
  * Handle bookkeeping that might result from the modification of a SPTE.
  * This function must be called for all TDP SPTE modifications.
  */
 static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level)
+				  u64 old_spte, u64 new_spte, int level,
+				  bool atomic)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
@@ -439,18 +454,50 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 */
 	if (was_present && !was_leaf && (pfn_changed || !is_present))
 		handle_disconnected_tdp_mmu_page(kvm,
-				spte_to_child_pt(old_spte, level));
+				spte_to_child_pt(old_spte, level), atomic);
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level)
+				u64 old_spte, u64 new_spte, int level,
+				bool atomic)
 {
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
+	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
+			      atomic);
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
 	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
 				      new_spte, level);
 }
 
+/*
+ * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
+ * associated bookkeeping
+ *
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @new_spte: The value the SPTE should be set to
+ * Returns: true if the SPTE was set, false if it was not. If false is returned,
+ *	    this function will have no side-effects.
+ */
+static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
+					   struct tdp_iter *iter,
+					   u64 new_spte)
+{
+	u64 *root_pt = tdp_iter_root_pt(iter);
+	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
+	int as_id = kvm_mmu_page_as_id(root);
+
+	kvm_mmu_lock_assert_held_shared(kvm);
+
+	if (cmpxchg64(iter->sptep, iter->old_spte, new_spte) != iter->old_spte)
+		return false;
+
+	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
+			    iter->level, true);
+
+	return true;
+}
+
+
 /*
  * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
  * @kvm: kvm instance
@@ -480,7 +527,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 	WRITE_ONCE(*iter->sptep, new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
-			      iter->level);
+			      iter->level, false);
 	if (record_acc_track)
 		handle_changed_spte_acc_track(iter->old_spte, new_spte,
 					      iter->level);
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 21/24] kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (19 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

To prepare for handling page faults in parallel, change the TDP MMU
page fault handler to use atomic operations to set SPTEs so that changes
are not lost if multiple threads attempt to modify the same SPTE.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1380ed313476..7b12a87a4124 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -714,21 +714,18 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 	int ret = 0;
 	int make_spte_ret = 0;
 
-	if (unlikely(is_noslot_pfn(pfn))) {
+	if (unlikely(is_noslot_pfn(pfn)))
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
-		trace_mark_mmio_spte(iter->sptep, iter->gfn, new_spte);
-	} else {
+	else
 		make_spte_ret = make_spte(vcpu, ACC_ALL, iter->level, iter->gfn,
 					 pfn, iter->old_spte, prefault, true,
 					 map_writable, !shadow_accessed_mask,
 					 &new_spte);
-		trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
-	}
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
-	else
-		tdp_mmu_set_spte(vcpu->kvm, iter, new_spte);
+	else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+		return RET_PF_RETRY;
 
 	/*
 	 * If the page fault was caused by a write but the page is write
@@ -742,8 +739,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
 	}
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
-	if (unlikely(is_mmio_spte(new_spte)))
+	if (unlikely(is_mmio_spte(new_spte))) {
+		trace_mark_mmio_spte(iter->sptep, iter->gfn, new_spte);
 		ret = RET_PF_EMULATE;
+	} else
+		trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
 
 	trace_kvm_mmu_set_spte(iter->level, iter->gfn, iter->sptep);
 	if (!prefault)
@@ -801,7 +801,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			tdp_mmu_set_spte(vcpu->kvm, &iter, 0);
+			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
+				break;
 
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
 					KVM_PAGES_PER_HPAGE(iter.level));
@@ -818,19 +819,24 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
 			child_pt = sp->spt;
 
-			tdp_mmu_link_page(vcpu->kvm, sp, false,
-					  huge_page_disallowed &&
-					  req_level >= iter.level);
-
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
-			trace_kvm_mmu_get_page(sp, true);
-			tdp_mmu_set_spte(vcpu->kvm, &iter, new_spte);
+			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
+						    new_spte)) {
+				tdp_mmu_link_page(vcpu->kvm, sp, true,
+						  huge_page_disallowed &&
+						  req_level >= iter.level);
+
+				trace_kvm_mmu_get_page(sp, true);
+			} else {
+				tdp_mmu_free_sp(sp);
+				break;
+			}
 		}
 	}
 
-	if (WARN_ON(iter.level != level)) {
+	if (iter.level != level) {
 		rcu_read_unlock();
 		return RET_PF_RETRY;
 	}
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (20 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 21/24] kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-21  0:05   ` Sean Christopherson
  2021-01-12 18:10 ` [PATCH 23/24] kvm: x86/mmu: Freeze SPTEs in disconnected pages Ben Gardon
  2021-01-12 18:10 ` [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
  23 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

When the TDP MMU is allowed to handle page faults in parallel there is
the possiblity of a race where an SPTE is cleared and then imediately
replaced with a present SPTE pointing to a different PFN, before the
TLBs can be flushed. This race would violate architectural specs. Ensure
that the TLBs are flushed properly before other threads are allowed to
install any present value for the SPTE.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.h    | 16 +++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++++------
 2 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 2b3a30bd38b0..ecd9bfbccef4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -130,6 +130,20 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 					  PT64_EPT_EXECUTABLE_MASK)
 #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
 
+/*
+ * If a thread running without exclusive control of the MMU lock must perform a
+ * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a
+ * non-present intermediate value. This will guarantee that other threads will
+ * not modify the spte.
+ *
+ * This constant works because it is considered non-present on both AMD and
+ * Intel CPUs and does not create a L1TF vulnerability because the pfn section
+ * is zeroed out.
+ *
+ * Only used by the TDP MMU.
+ */
+#define FROZEN_SPTE (1ull << 59)
+
 /*
  * In some cases, we need to preserve the GFN of a non-present or reserved
  * SPTE when we usurp the upper five bits of the physical address space to
@@ -187,7 +201,7 @@ static inline bool is_access_track_spte(u64 spte)
 
 static inline int is_shadow_present_pte(u64 pte)
 {
-	return (pte != 0) && !is_mmio_spte(pte);
+	return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);
 }
 
 static inline int is_large_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b12a87a4124..5c9d053000ad 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -429,15 +429,19 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 */
 	if (!was_present && !is_present) {
 		/*
-		 * If this change does not involve a MMIO SPTE, it is
-		 * unexpected. Log the change, though it should not impact the
-		 * guest since both the former and current SPTEs are nonpresent.
+		 * If this change does not involve a MMIO SPTE or FROZEN_SPTE,
+		 * it is unexpected. Log the change, though it should not
+		 * impact the guest since both the former and current SPTEs
+		 * are nonpresent.
 		 */
-		if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
+		if (WARN_ON(!is_mmio_spte(old_spte) &&
+			    !is_mmio_spte(new_spte) &&
+			    new_spte != FROZEN_SPTE))
 			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
 			       "should not be replaced with another,\n"
 			       "different nonpresent SPTE, unless one or both\n"
-			       "are MMIO SPTEs.\n"
+			       "are MMIO SPTEs, or the new SPTE is\n"
+			       "FROZEN_SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
 		return;
@@ -488,6 +492,13 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
 	kvm_mmu_lock_assert_held_shared(kvm);
 
+	/*
+	 * Do not change FROZEN_SPTEs. Only the thread that froze the SPTE
+	 * may modify it.
+	 */
+	if (iter->old_spte == FROZEN_SPTE)
+		return false;
+
 	if (cmpxchg64(iter->sptep, iter->old_spte, new_spte) != iter->old_spte)
 		return false;
 
@@ -497,6 +508,34 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	return true;
 }
 
+static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+					   struct tdp_iter *iter)
+{
+	/*
+	 * Freeze the SPTE by setting it to a special,
+	 * non-present value. This will stop other threads from
+	 * immediately installing a present entry in its place
+	 * before the TLBs are flushed.
+	 */
+	if (!tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE))
+		return false;
+
+	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
+					   KVM_PAGES_PER_HPAGE(iter->level));
+
+	/*
+	 * No other thread can overwrite the frozen SPTE as they
+	 * must either wait on the MMU lock or use
+	 * tdp_mmu_set_spte_atomic which will not overrite the
+	 * special frozen SPTE value. No bookkeeping is needed
+	 * here since the SPTE is going from non-present
+	 * to non-present.
+	 */
+	WRITE_ONCE(*iter->sptep, 0);
+
+	return true;
+}
+
 
 /*
  * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
@@ -524,6 +563,14 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 
 	kvm_mmu_lock_assert_held_exclusive(kvm);
 
+	/*
+	 * No thread should be using this function to set SPTEs to FROZEN_SPTE.
+	 * If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
+	 * should be used. If operating under the MMU lock in write mode, the
+	 * use of FROZEN_SPTE should not be necessary.
+	 */
+	WARN_ON(iter->old_spte == FROZEN_SPTE);
+
 	WRITE_ONCE(*iter->sptep, new_spte);
 
 	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
@@ -801,12 +848,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 0))
+			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
 				break;
 
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
-					KVM_PAGES_PER_HPAGE(iter.level));
-
 			/*
 			 * The iter must explicitly re-read the spte here
 			 * because the new value informs the !present
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 23/24] kvm: x86/mmu: Freeze SPTEs in disconnected pages
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (21 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-12 18:10 ` [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
  23 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

When clearing TDP MMU pages what have been disconnected from the paging
structure root, set the SPTEs to a special non-present value which will
not be overwritten by other threads. This is needed to prevent races in
which a thread is clearing a disconnected page table, but another thread
has already acquired a pointer to that memory and installs a mapping in
an already cleared entry. This can lead to memory leaks and accounting
errors.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5c9d053000ad..45160ff84e91 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -333,13 +333,14 @@ static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt,
 {
 	struct kvm_mmu_page *sp;
 	gfn_t gfn;
+	gfn_t base_gfn;
 	int level;
 	u64 *sptep;
 	u64 old_child_spte;
 	int i;
 
 	sp = sptep_to_sp(pt);
-	gfn = sp->gfn;
+	base_gfn = sp->gfn;
 	level = sp->role.level;
 
 	trace_kvm_mmu_prepare_zap_page(sp);
@@ -348,16 +349,38 @@ static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt,
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 		sptep = pt + i;
+		gfn = base_gfn + (i * KVM_PAGES_PER_HPAGE(level - 1));
 
 		if (atomic) {
-			old_child_spte = xchg(sptep, 0);
+			/*
+			 * Set the SPTE to a nonpresent value that other
+			 * threads will not overwrite. If the SPTE was already
+			 * frozen then another thread handling a page fault
+			 * could overwrite it, so set the SPTE until it is set
+			 * from nonfrozen -> frozen.
+			 */
+			for (;;) {
+				old_child_spte = xchg(sptep, FROZEN_SPTE);
+				if (old_child_spte != FROZEN_SPTE)
+					break;
+				cpu_relax();
+			}
 		} else {
 			old_child_spte = READ_ONCE(*sptep);
-			WRITE_ONCE(*sptep, 0);
+
+			/*
+			 * Setting the SPTE to FROZEN_SPTE is not strictly
+			 * necessary here as the MMU lock should stop other
+			 * threads from concurrentrly modifying this SPTE.
+			 * Using FROZEN_SPTE keeps the atomic and
+			 * non-atomic cases consistent and simplifies the
+			 * function.
+			 */
+			WRITE_ONCE(*sptep, FROZEN_SPTE);
 		}
-		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
-			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
-			old_child_spte, 0, level - 1, atomic);
+		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
+				    old_child_spte, FROZEN_SPTE, level - 1,
+				    atomic);
 	}
 
 	kvm_flush_remote_tlbs_with_address(kvm, gfn,
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
                   ` (22 preceding siblings ...)
  2021-01-12 18:10 ` [PATCH 23/24] kvm: x86/mmu: Freeze SPTEs in disconnected pages Ben Gardon
@ 2021-01-12 18:10 ` Ben Gardon
  2021-01-21  0:55   ` Sean Christopherson
  2021-01-26 13:37   ` Paolo Bonzini
  23 siblings, 2 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-12 18:10 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong, Ben Gardon

Make the last few changes necessary to enable the TDP MMU to handle page
faults in parallel while holding the mmu_lock in read mode.

Reviewed-by: Peter Feiner <pfeiner@google.com>

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 280d7cd6f94b..fa111ceb67d4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3724,7 +3724,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		return r;
 
 	r = RET_PF_RETRY;
-	kvm_mmu_lock(vcpu->kvm);
+
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		kvm_mmu_lock_shared(vcpu->kvm);
+	else
+		kvm_mmu_lock(vcpu->kvm);
+
 	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
 		goto out_unlock;
 	r = make_mmu_pages_available(vcpu);
@@ -3739,7 +3744,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 prefault, is_tdp);
 
 out_unlock:
-	kvm_mmu_unlock(vcpu->kvm);
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		kvm_mmu_unlock_shared(vcpu->kvm);
+	else
+		kvm_mmu_unlock(vcpu->kvm);
 	kvm_release_pfn_clean(pfn);
 	return r;
 }
-- 
2.30.0.284.gd98b1dd5eaa7-goog


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched
  2021-01-12 18:10 ` [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
@ 2021-01-20 18:38   ` Sean Christopherson
  2021-01-21 20:22     ` Paolo Bonzini
  2021-01-26 14:11     ` Paolo Bonzini
  0 siblings, 2 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-20 18:38 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> Currently the TDP MMU yield / cond_resched functions either return
> nothing or return true if the TLBs were not flushed. These are confusing
> semantics, especially when making control flow decisions in calling
> functions.
> 
> To clean things up, change both functions to have the same
> return value semantics as cond_resched: true if the thread yielded,
> false if it did not. If the function yielded in the _flush_ version,
> then the TLBs will have been flushed.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 38 +++++++++++++++++++++++++++++---------
>  1 file changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 2ef8615f9dba..b2784514ca2d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -413,8 +413,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>  			 _mmu->shadow_root_level, _start, _end)
>  
>  /*
> - * Flush the TLB if the process should drop kvm->mmu_lock.
> - * Return whether the caller still needs to flush the tlb.
> + * Flush the TLB and yield if the MMU lock is contended or this thread needs to
> + * return control to the scheduler.
> + *
> + * If this function yields, it will also reset the tdp_iter's walk over the
> + * paging structure and the calling function should allow the iterator to
> + * continue its traversal from the paging structure root.
> + *
> + * Return true if this function yielded, the TLBs were flushed, and the
> + * iterator's traversal was reset. Return false if a yield was not needed.
>   */
>  static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>  {
> @@ -422,18 +429,30 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *it
>  		kvm_flush_remote_tlbs(kvm);
>  		cond_resched_lock(&kvm->mmu_lock);
>  		tdp_iter_refresh_walk(iter);
> -		return false;
> -	} else {
>  		return true;
> -	}
> +	} else
> +		return false;

Kernel style is to have curly braces on all branches if any branch has 'em.  Or,
omit the else since the taken branch always returns.  I think I prefer the latter?

>  }
>  
> -static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +/*
> + * Yield if the MMU lock is contended or this thread needs to return control
> + * to the scheduler.
> + *
> + * If this function yields, it will also reset the tdp_iter's walk over the
> + * paging structure and the calling function should allow the iterator to
> + * continue its traversal from the paging structure root.
> + *
> + * Return true if this function yielded and the iterator's traversal was reset.
> + * Return false if a yield was not needed.
> + */
> +static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>  {
>  	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>  		cond_resched_lock(&kvm->mmu_lock);
>  		tdp_iter_refresh_walk(iter);
> -	}
> +		return true;
> +	} else
> +		return false;

Same here.

>  }
>  
>  /*
> @@ -470,7 +489,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  		tdp_mmu_set_spte(kvm, &iter, 0);
>  
>  		if (can_yield)
> -			flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
> +			flush_needed = !tdp_mmu_iter_flush_cond_resched(kvm,
> +									&iter);

As with the existing code, I'd let this poke out.  Alternatively, this could be
written as:

		flush_needed = !can_yield ||
			       !tdp_mmu_iter_flush_cond_resched(kvm, &iter);

>  		else
>  			flush_needed = true;
>  	}
> @@ -1072,7 +1092,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  
>  		tdp_mmu_set_spte(kvm, &iter, 0);
>  
> -		spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
> +		spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
>  	}
>  
>  	if (spte_set)
> -- 
> 2.30.0.284.gd98b1dd5eaa7-goog
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU
  2021-01-12 18:10 ` [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU Ben Gardon
@ 2021-01-20 19:28   ` Sean Christopherson
  2021-01-22  1:06     ` Ben Gardon
  0 siblings, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2021-01-20 19:28 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> There are two problems with the way the TDP MMU yields in long running
> functions. 1.) Given certain conditions, the function may not yield
> reliably / frequently enough. 2.) In some functions the TDP iter risks
> not making forward progress if two threads livelock yielding to
> one another.
> 
> Case 1 is possible if for example, a paging structure was very large
> but had few, if any writable entries. wrprot_gfn_range could traverse many
> entries before finding a writable entry and yielding.
> 
> Case 2 is possible if two threads were trying to execute wrprot_gfn_range.
> Each could write protect an entry and then yield. This would reset the
> tdp_iter's walk over the paging structure and the loop would end up
> repeating the same entry over and over, preventing either thread from
> making forward progress.
> 
> Fix these issues by moving the yield to the beginning of the loop,
> before other checks and only yielding if the loop has made forward
> progress since the last yield.

I think it'd be best to split this into two patches, e.g. ensure forward
progress and then yield more agressively.  They are two separate bugs, and I
don't think that ensuring forward progress would exacerbate case #1.  I'm not
worried about breaking things so much as getting more helpful shortlogs; "Fix
yielding in TDP MMU" doesn't provide any insight into what exactly was broken.
E.g. something like:

  KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
  KVM: x86/mmu: Yield in TDU MMU iter even if no real work was done

> Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 83 +++++++++++++++++++++++++++++++-------
>  1 file changed, 69 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b2784514ca2d..1987da0da66e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -470,9 +470,23 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  gfn_t start, gfn_t end, bool can_yield)
>  {
>  	struct tdp_iter iter;
> +	gfn_t last_goal_gfn = start;
>  	bool flush_needed = false;
>  
>  	tdp_root_for_each_pte(iter, root, start, end) {
> +		/* Ensure forward progress has been made before yielding. */
> +		if (can_yield && iter.goal_gfn != last_goal_gfn &&

Make last_goal_gfn a property of the iterator, that way all this logic can be
shoved into tdp_mmu_iter_flush_cond_resched(), and the comments about ensuring
forward progress and effectively invalidating/resetting the iterator (the
comment below) can be a function comment, as opposed to being copied everywhere.
E.g. there can be a big scary warning in the function comment stating that the
caller must restart its loop if the helper yielded.

Tangentially related, the name goal_gfn is quite confusing.  "goal" and "end"
are synonyms, but "goal" is often initialized with "start", and it's not used to
terminate the walk.  Maybe next_gfn instead?  And maybe yielded_gfn, since
last_next_gfn is pretty horrendous.

> +		    tdp_mmu_iter_flush_cond_resched(kvm, &iter)) {

This isn't quite correct, as tdp_mmu_iter_flush_cond_resched() will do an
expensive remote TLB flush on every yield, even if no flush is needed.  The
cleanest solution is likely to drop tdp_mmu_iter_flush_cond_resched() and
instead add a @flush param to tdp_mmu_iter_cond_resched().  If it's tagged
__always_inline, then the callers that unconditionally pass true/false will
optimize out the conditional code.

At that point, I think it would also make sense to fold tdp_iter_refresh_walk()
into tdp_mmu_iter_cond_resched(), because really we shouldn't be mucking with
the guts of the iter except for the yield case.

> +			last_goal_gfn = iter.goal_gfn;

Another argument for both renaming goal_gfn and moving last_*_gfn into the iter:
it's not at all obvious that updating the last gfn _after_ tdp_iter_refresh_walk()
is indeed correct.

You can also avoid a local variable by doing max(iter->next_gfn, iter->gfn) when
calling tdp_iter_refresh_walk().  IMO, that's also a bit easier to understand
than an open-coded equivalent.

E.g. putting it all together, with yielded_gfn set by tdp_iter_start():

static __always_inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
						     struct tdp_iter *iter,
						     bool flush)
{
	/* Ensure forward progress has been made since the last yield. */
	if (iter->next_gfn == iter->yielded_gfn)
		return false;

	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
		if (flush)
			kvm_flush_remote_tlbs(kvm);
		cond_resched_lock(&kvm->mmu_lock);

		/*
		 * Restart the walk over the paging structure from the root,
		 * starting from the highest gfn the iterator had previously
		 * reached.  The entire paging structure, except the root, may
		 * have been completely torn down and rebuilt while we yielded.
		 */
		tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
			       iter->root_level, iter->min_level,
			       max(iter->next_gfn, iter->gfn));
		return true;
	}

	return false;
}

> +			flush_needed = false;
> +			/*
> +			 * Yielding caused the paging structure walk to be
> +			 * reset so skip to the next iteration to continue the
> +			 * walk from the root.
> +			 */
> +			continue;
> +		}
> +
>  		if (!is_shadow_present_pte(iter.old_spte))
>  			continue;
>  

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions
  2021-01-12 18:10 ` [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
@ 2021-01-20 19:51   ` Sean Christopherson
  2021-01-25 23:51     ` Ben Gardon
  0 siblings, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2021-01-20 19:51 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> Skip setting SPTEs if no change is expected.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
>
Nit on all of these, can you remove the extra newline between the Reviewed-by
and SOB?

> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1987da0da66e..2650fa9fe066 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -882,6 +882,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> +		if (!(iter.old_spte & PT_WRITABLE_MASK))

Include the new check with the existing if statement?  I think it makes sense to
group all the checks on old_spte.

> +			continue;
> +
>  		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
>  
>  		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
> @@ -1079,6 +1082,9 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  		if (!is_shadow_present_pte(iter.old_spte))
>  			continue;
>  
> +		if (iter.old_spte & shadow_dirty_mask)

Same comment here.

> +			continue;
> +

Unrelated to this patch, but it got me looking at the code: shouldn't
clear_dirty_pt_masked() clear the bit in @mask before checking whether or not
the spte needs to be modified?  That way the early break kicks in after sptes
are checked, not necessarily written.  E.g.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2650fa9fe066..d8eeae910cbf 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1010,21 +1010,21 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
                    !(mask & (1UL << (iter.gfn - gfn))))
                        continue;

-               if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
-                       if (is_writable_pte(iter.old_spte))
-                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
-                       else
-                               continue;
-               } else {
-                       if (iter.old_spte & shadow_dirty_mask)
-                               new_spte = iter.old_spte & ~shadow_dirty_mask;
-                       else
-                               continue;
-               }
-
-               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
-
                mask &= ~(1UL << (iter.gfn - gfn));
+
+               if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
+                       if (is_writable_pte(iter.old_spte))
+                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+                       else
+                               continue;
+               } else {
+                       if (iter.old_spte & shadow_dirty_mask)
+                               new_spte = iter.old_spte & ~shadow_dirty_mask;
+                       else
+                               continue;
+               }
+
+               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
        }
 }


>  		new_spte = iter.old_spte | shadow_dirty_mask;
>  
>  		tdp_mmu_set_spte(kvm, &iter, new_spte);
> -- 
> 2.30.0.284.gd98b1dd5eaa7-goog
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  2021-01-12 18:10 ` [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
@ 2021-01-20 19:58   ` Sean Christopherson
  2021-01-26 14:13   ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-20 19:58 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified
> under the MMU lock. This lockdep will be updated in future commits to
> reflect and validate changes to the TDP MMU's synchronization strategy.

I'd omit the "updated in future commits" justification.  IMO this is a good
change even if we never build on it, and the extra justification would be
confusing if this is merged separately from the parallelization patches.

> No functional change intended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>

Reviewed-by: Sean Christopherson <seanjc@google.com> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory
  2021-01-12 18:10 ` [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
@ 2021-01-20 20:06   ` Sean Christopherson
  2021-01-26 14:14   ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-20 20:06 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> The KVM MMU caches already guarantee that shadow page table memory will
> be zeroed, so there is no reason to re-zero the page in the TDP MMU page
> fault handler.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>

Reviewed-by: Sean Christopherson <seanjc@google.com> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt
  2021-01-12 18:10 ` [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt Ben Gardon
@ 2021-01-20 20:30   ` Sean Christopherson
  2021-01-26 14:14   ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-20 20:30 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

Spell out "page tables"?  Not short on chars.  The grammar is also a bit funky.

  KVM: x86/mmu: Factor out handling of disconnected page tables

On Tue, Jan 12, 2021, Ben Gardon wrote:
> Factor out the code to handle a disconnected subtree of the TDP paging
> structure from the code to handle the change to an individual SPTE.
> Future commits will build on this to allow asynchronous page freeing.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 75 +++++++++++++++++++++++---------------
>  1 file changed, 46 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 55df596696c7..e8f35cd46b4c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -234,6 +234,49 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
>  	}
>  }
>  
> +/**
> + * handle_disconnected_tdp_mmu_page - handle a pt removed from the TDP structure

Maybe s/disconnected/removed?

I completely understand why you used "disconnected", and to a large extent I
agree it's a good descriptor, but all of the surrounding comments talk about the
page tables being "removed".  And for me, "disconnected" implies that that it
could be reconnected in the future, whereas "removed" is a more firm "this page,
in its current form, is gone for good".

> + *
> + * @kvm: kvm instance
> + * @pt: the page removed from the paging structure
> + *
> + * Given a page table that has been removed from the TDP paging structure,
> + * iterates through the page table to clear SPTEs and free child page tables.
> + */
> +static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
> +{
> +	struct kvm_mmu_page *sp;
> +	gfn_t gfn;
> +	int level;
> +	u64 old_child_spte;
> +	int i;

Nit: use reverse fir tree?  I don't think KVM needs to be as strict as tip for
that rule/guideline, but I do think it's usually a net positive for readability.

> +	sp = sptep_to_sp(pt);
> +	gfn = sp->gfn;
> +	level = sp->role.level;

Initialize these from the get-go?  That would held the reader understand these
are local snapshots to shorten lines, as opposed to scratch variables.

	struct kvm_mmu_page *sp = sptep_to_sp(pt);
	int level = sp->role.level;
	gfn_t gfn = sp->gfn;
	u64 old_child_spte;
	int i;

> +
> +	trace_kvm_mmu_prepare_zap_page(sp);
> +
> +	list_del(&sp->link);
> +
> +	if (sp->lpage_disallowed)
> +		unaccount_huge_nx_page(kvm, sp);
> +
> +	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> +		old_child_spte = READ_ONCE(*(pt + i));
> +		WRITE_ONCE(*(pt + i), 0);
> +		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
> +			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
> +			old_child_spte, 0, level - 1);
> +	}
> +
> +	kvm_flush_remote_tlbs_with_address(kvm, gfn,
> +					   KVM_PAGES_PER_HPAGE(level));
> +
> +	free_page((unsigned long)pt);
> +	kmem_cache_free(mmu_page_header_cache, sp);
> +}

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section
  2021-01-12 18:10 ` [PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section Ben Gardon
@ 2021-01-20 22:19   ` Sean Christopherson
  0 siblings, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-20 22:19 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:                                         
> In order to enable concurrent modifications to the paging structures in       
> the TDP MMU, threads must be able to safely remove pages of page table        
> memory while other threads are traversing the same memory. To ensure          
> threads do not access PT memory after it is freed, protect PT memory          
> with RCU.                                                                     
                                                                                
Normally I like splitting up patches, but the three RCU patches (11-13) probably
need to be combined into a single patch.  I assume you introduced the RCU       
readers in a separate patch to isolate deadlocks, but it's impossible to give   
this patch a proper review without peeking ahead to see how what's actually     
being protected with RCU.                                                       
                                                                                
The combined changelog should also explain why READING_SHADOW_PAGE_TABLES isn't 
a good solution.  I suspect the answer is because the longer-running walks would
disable IRQs for too long, but that should be explicitly documented.

> Reviewed-by: Peter Feiner <pfeiner@google.com>                                
>                                                                               
> Signed-off-by: Ben Gardon <bgardon@google.com>                                
> ---                                                                           
>  arch/x86/kvm/mmu/tdp_mmu.c | 53 ++++++++++++++++++++++++++++++++++++--       
>  1 file changed, 51 insertions(+), 2 deletions(-)                             
>                                                                               
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c          
> index e8f35cd46b4c..662907d374b3 100644                                       
> --- a/arch/x86/kvm/mmu/tdp_mmu.c                                              
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c                                              
> @@ -458,11 +458,14 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>   * Return true if this function yielded, the TLBs were flushed, and the      
>   * iterator's traversal was reset. Return false if a yield was not needed.   
>   */                                                                          
> -static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm,                 
> +             struct tdp_iter *iter)                                          
                                                                                
Unrelated newline.                                                              
                                                                                
>  {                                                                            
>       if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {                 
>               kvm_flush_remote_tlbs(kvm);                                     
> +             rcu_read_unlock();                                              
                                                                                
I'm 99% certain rcu_read_unlock() can be moved before the TLB flush.  IIUC, RCU 
is protecting only the host kernel's software walks; the only true "writer" is  
immediately preceded by a remote TLB flush (in patch 13).                       
                                                                                
        kvm_flush_remote_tlbs_with_address(kvm, gfn,                            
                                           KVM_PAGES_PER_HPAGE(level));         
                                                                                
        call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);                  
                                                                                
That also resolves an inconsistency with zap_gfn_range(), which unlocks before
doing the remote flush.  Ditto for zap_collapsible_spte_range(), and I think a
few other flows.

>  		cond_resched_lock(&kvm->mmu_lock);
> +		rcu_read_lock();
>  		tdp_iter_refresh_walk(iter);
>  		return true;
>  	} else
> @@ -483,7 +486,9 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *it
>  static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>  {
>  	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +		rcu_read_unlock();
>  		cond_resched_lock(&kvm->mmu_lock);
> +		rcu_read_lock();
>  		tdp_iter_refresh_walk(iter);
>  		return true;
>  	} else
> @@ -508,6 +513,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  	gfn_t last_goal_gfn = start;
>  	bool flush_needed = false;
>  
> +	rcu_read_lock();
> +
>  	tdp_root_for_each_pte(iter, root, start, end) {
>  		/* Ensure forward progress has been made before yielding. */
>  		if (can_yield && iter.goal_gfn != last_goal_gfn &&
> @@ -538,6 +545,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  		tdp_mmu_set_spte(kvm, &iter, 0);
>  		flush_needed = true;
>  	}
> +
> +	rcu_read_unlock();

Unlock before TLB flush.  <-------

>  	return flush_needed;
>  }

...

> @@ -844,6 +863,8 @@ static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	u64 new_spte;
>  	int need_flush = 0;
>  
> +	rcu_read_lock();
> +
>  	WARN_ON(pte_huge(*ptep));
>  
>  	new_pfn = pte_pfn(*ptep);
> @@ -872,6 +893,8 @@ static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	if (need_flush)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
>  
> +	rcu_read_unlock();

Unlock before flush?

> +
>  	return 0;
>  }
>  
  
...

> @@ -1277,10 +1322,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>  
>  	*root_level = vcpu->arch.mmu->shadow_root_level;
>  
> +	rcu_read_lock();

Hrm, isn't this an existing bug?  And also not really the correct fix?  mmu_lock
is not held here, so the existing code has no protections.  Using
walk_shadow_page_lockless_begin/end() feels more appropriate for this particular
walk.

> +
>  	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
>  		leaf = iter.level;
>  		sptes[leaf] = iter.old_spte;
>  	}
>  
> +	rcu_read_unlock();
> +
>  	return leaf;
>  }
> -- 
> 2.30.0.284.gd98b1dd5eaa7-goog
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler
  2021-01-12 18:10 ` [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
@ 2021-01-21  0:05   ` Sean Christopherson
  0 siblings, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-21  0:05 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> When the TDP MMU is allowed to handle page faults in parallel there is
> the possiblity of a race where an SPTE is cleared and then imediately
> replaced with a present SPTE pointing to a different PFN, before the
> TLBs can be flushed. This race would violate architectural specs. Ensure
> that the TLBs are flushed properly before other threads are allowed to
> install any present value for the SPTE.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/spte.h    | 16 +++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++++------
>  2 files changed, 68 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 2b3a30bd38b0..ecd9bfbccef4 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -130,6 +130,20 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>  					  PT64_EPT_EXECUTABLE_MASK)
>  #define SHADOW_ACC_TRACK_SAVED_BITS_SHIFT PT64_SECOND_AVAIL_BITS_SHIFT
>  
> +/*
> + * If a thread running without exclusive control of the MMU lock must perform a
> + * multi-part operation on an SPTE, it can set the SPTE to FROZEN_SPTE as a
> + * non-present intermediate value. This will guarantee that other threads will
> + * not modify the spte.
> + *
> + * This constant works because it is considered non-present on both AMD and
> + * Intel CPUs and does not create a L1TF vulnerability because the pfn section
> + * is zeroed out.
> + *
> + * Only used by the TDP MMU.
> + */
> +#define FROZEN_SPTE (1ull << 59)

I dislike FROZEN, for similar reasons that I disliked "disconnected".  The SPTE
isn't frozen in the sense that it's temporarily immutable, rather it's been
removed but hasn't been flushed and so can't yet be reused.  Given that
FROZEN_SPTEs are treated as not-preset SPTEs, there's zero chance that this can
be extended in the future to be a generic temporarily freeze mechanism.

Mabye REMOVED_SPTE to match earlier feedback?

> +
>  /*
>   * In some cases, we need to preserve the GFN of a non-present or reserved
>   * SPTE when we usurp the upper five bits of the physical address space to
> @@ -187,7 +201,7 @@ static inline bool is_access_track_spte(u64 spte)
>  
>  static inline int is_shadow_present_pte(u64 pte)

Waaaay off topic, I'm going to send a patch to have this, and any other pte
helpers that return an int, return a bool.  While futzing around with ideas I
managed to turn this into a nop by doing

	return pte & SPTE_PRESENT;

which is guaranteed to be 0 if SPTE_PRESENT is a bit > 31.  I'm sure others will
point out that I'm a heathen for not doing !!(pte & SPTE_PRESENT), but still...

>  {
> -	return (pte != 0) && !is_mmio_spte(pte);
> +	return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);

For all other checks, I'd strongly prefer to add a helper, e.g. is_removed_spte()
or whatever.  That way changing the implementation won't be as painful, and we
can add assertions and whatnot if we break things.  Especially since FROZEN_SPTE
is a single bit, which makes it look like a flag even though it's used as a full
64-bit constant.

For this, I worry that is_shadow_present_pte() is getting bloated.  It's also a
bit unfortunate that it's bloated for the old MMU, without any benefit. That
being said, most that bloat is from the existing MMIO checks.  Looking
elsewhere, TDX's SEPT also has a similar concept that may or may not need to
hook is_shadow_present_pte().

Rather than bundle MMIO SPTEs into the access-tracking flags and have a bunch of
special cases for not-present SPTEs, what if we add an explicit flag to mark
SPTEs as present (or not-present)?  Defining SPTE_PRESENT instead of
SPTE_NOT_PRESENT might require a few more changes, but it would be the most
optimal for is_shadow_present_pte().

I'm thinking something like this (completely untested):

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index c51ad544f25b..86f6c84569c4 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -38,7 +38,7 @@ static u64 generation_mmio_spte_mask(u64 gen)
        u64 mask;

        WARN_ON(gen & ~MMIO_SPTE_GEN_MASK);
-       BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_SPECIAL_MASK);
+       BUILD_BUG_ON((MMIO_SPTE_GEN_HIGH_MASK | MMIO_SPTE_GEN_LOW_MASK) & SPTE_MMIO);

        mask = (gen << MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_SPTE_GEN_LOW_MASK;
        mask |= (gen << MMIO_SPTE_GEN_HIGH_SHIFT) & MMIO_SPTE_GEN_HIGH_MASK;
@@ -86,7 +86,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
                     bool can_unsync, bool host_writable, bool ad_disabled,
                     u64 *new_spte)
 {
-       u64 spte = 0;
+       u64 spte = SPTE_PRESENT;
        int ret = 0;

        if (ad_disabled)
@@ -247,7 +247,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
        BUG_ON((u64)(unsigned)access_mask != access_mask);
        WARN_ON(mmio_value & (shadow_nonpresent_or_rsvd_mask << SHADOW_NONPRESENT_OR_RSVD_MASK_LEN));
        WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
-       shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
+       shadow_mmio_value = mmio_value | SPTE_MMIO;
        shadow_mmio_access_mask = access_mask;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ecd9bfbccef4..465e43d34034 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -5,18 +5,15 @@

 #include "mmu_internal.h"

+/* Software available bits for present SPTEs. */
 #define PT_FIRST_AVAIL_BITS_SHIFT 10
 #define PT64_SECOND_AVAIL_BITS_SHIFT 54

-/*
- * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
- * Access Tracking SPTEs.
- */
+/* The mask used to denote Access Tracking SPTEs.  Note, val=3 is available. */
 #define SPTE_SPECIAL_MASK (3ULL << 52)
 #define SPTE_AD_ENABLED_MASK (0ULL << 52)
 #define SPTE_AD_DISABLED_MASK (1ULL << 52)
 #define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
-#define SPTE_MMIO_MASK (3ULL << 52)

 #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
 #define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
@@ -55,12 +52,16 @@
 #define SPTE_HOST_WRITEABLE    (1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
 #define SPTE_MMU_WRITEABLE     (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))

+#define SPTE_REMOVED           BIT_ULL(60)
+#define SPTE_MMIO              BIT_ULL(61)
+#define SPTE_PRESENT           BIT_ULL(62)
+
 /*
  * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
  * the memslots generation and is derived as follows:
  *
  * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
- * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62
+ * Bits 9-17 of the MMIO generation are propagated to spte bits 52-60
  *
  * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
  * the MMIO generation number, as doing so would require stealing a bit from
@@ -73,8 +74,8 @@
 #define MMIO_SPTE_GEN_LOW_START                3
 #define MMIO_SPTE_GEN_LOW_END          11

-#define MMIO_SPTE_GEN_HIGH_START       PT64_SECOND_AVAIL_BITS_SHIFT
-#define MMIO_SPTE_GEN_HIGH_END         62
+#define MMIO_SPTE_GEN_HIGH_START       52
+#define MMIO_SPTE_GEN_HIGH_END         60

 #define MMIO_SPTE_GEN_LOW_MASK         GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
                                                    MMIO_SPTE_GEN_LOW_START)
@@ -162,7 +163,7 @@ extern u8 __read_mostly shadow_phys_bits;

 static inline bool is_mmio_spte(u64 spte)
 {
-       return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
+       return spte & SPTE_MMIO;
 }

 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
@@ -199,9 +200,9 @@ static inline bool is_access_track_spte(u64 spte)
        return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
 }

-static inline int is_shadow_present_pte(u64 pte)
+static inline bool is_shadow_present_pte(u64 pte)
 {
-       return (pte != 0) && !is_mmio_spte(pte) && (pte != FROZEN_SPTE);
+       return pte & SPTE_PRESENT;
 }

 static inline int is_large_pte(u64 pte)


>  }
>  
>  static inline int is_large_pte(u64 pte)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b12a87a4124..5c9d053000ad 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -429,15 +429,19 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	 */
>  	if (!was_present && !is_present) {
>  		/*
> -		 * If this change does not involve a MMIO SPTE, it is
> -		 * unexpected. Log the change, though it should not impact the
> -		 * guest since both the former and current SPTEs are nonpresent.
> +		 * If this change does not involve a MMIO SPTE or FROZEN_SPTE,

For comments and error message, I think we should avoid using the exact constant
name, and instead call them "removed SPTE", similar to MMIO SPTE.  That will
help reduce thrash and/or stale comments if the name changes.

> +		 * it is unexpected. Log the change, though it should not
> +		 * impact the guest since both the former and current SPTEs
> +		 * are nonpresent.
>  		 */
> -		if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
> +		if (WARN_ON(!is_mmio_spte(old_spte) &&
> +			    !is_mmio_spte(new_spte) &&
> +			    new_spte != FROZEN_SPTE))
>  			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
>  			       "should not be replaced with another,\n"
>  			       "different nonpresent SPTE, unless one or both\n"
> -			       "are MMIO SPTEs.\n"
> +			       "are MMIO SPTEs, or the new SPTE is\n"
> +			       "FROZEN_SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
>  		return;

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-12 18:10 ` [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak Ben Gardon
@ 2021-01-21  0:19   ` Sean Christopherson
  2021-01-21 20:17     ` Paolo Bonzini
  2021-01-26 14:38     ` Paolo Bonzini
  0 siblings, 2 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-21  0:19 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> Wrap the MMU lock cond_reseched and needbreak operations in a function.
> This will support a refactoring to move the lock into the struct
> kvm_arch(s) so that x86 can change the spinlock to a rwlock without
> affecting the performance of other archs.

IMO, moving the lock to arch-specific code is bad for KVM.  The architectures'
MMUs already diverge pretty horribly, and once things diverge it's really hard
to go the other direction.  And this change, along with all of the wrappers,
thrash a  lot of code and add a fair amount of indirection without any real
benefit to the other architectures.

What if we simply make the common mmu_lock a union?  The rwlock_t is probably a
bit bigger, but that's a few bytes for an entire VM.  And maybe this would
entice/inspire other architectures to move to a similar MMU model.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f3b1013fb22c..bbc8efd4af62 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -451,7 +451,10 @@ struct kvm_memslots {
 };

 struct kvm {
-       spinlock_t mmu_lock;
+       union {
+               rwlock_t mmu_rwlock;
+               spinlock_t mmu_lock;
+       };
        struct mutex slots_lock;
        struct mm_struct *mm; /* userspace tied to this vm */
        struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU
  2021-01-12 18:10 ` [PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU Ben Gardon
@ 2021-01-21  0:45   ` Sean Christopherson
  0 siblings, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-21  0:45 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ba296ad051c3..280d7cd6f94b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5471,6 +5471,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>  
>  	kvm_mmu_init_tdp_mmu(kvm);
>  
> +	if (kvm->arch.tdp_mmu_enabled)
> +		rwlock_init(&kvm->arch.mmu_rwlock);
> +	else
> +		spin_lock_init(&kvm->arch.mmu_lock);

Rather than use different lock types, what if we always use a rwlock, but only
acquire it for read when handling page faults for TDP MMUs?  That would
significantly reduce the amount of boilerplate conditionals.

The fast paths for write_lock() and spin_lock() are nearly identical, and
I would hope any differences in the slow paths are hidden in the noise.

> +
>  	node->track_write = kvm_mmu_pte_write;
>  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>  	kvm_page_track_register_notifier(kvm, node);
> @@ -6074,3 +6079,87 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  	if (kvm->arch.nx_lpage_recovery_thread)
>  		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
>  }
> +
> +void kvm_mmu_lock_shared(struct kvm *kvm)
> +{
> +	WARN_ON(!kvm->arch.tdp_mmu_enabled);
> +	read_lock(&kvm->arch.mmu_rwlock);
> +}
> +
> +void kvm_mmu_unlock_shared(struct kvm *kvm)
> +{
> +	WARN_ON(!kvm->arch.tdp_mmu_enabled);
> +	read_unlock(&kvm->arch.mmu_rwlock);
> +}
> +
> +void kvm_mmu_lock_exclusive(struct kvm *kvm)
> +{
> +	WARN_ON(!kvm->arch.tdp_mmu_enabled);
> +	write_lock(&kvm->arch.mmu_rwlock);
> +}
> +
> +void kvm_mmu_unlock_exclusive(struct kvm *kvm)
> +{
> +	WARN_ON(!kvm->arch.tdp_mmu_enabled);
> +	write_unlock(&kvm->arch.mmu_rwlock);
> +}

I'm not a fan of all of these wrappers.  It's extra layers and WARNs, and
introduces terminology that differs from the kernel's locking terminology,
e.g. read vs. shared.  The WARNs are particularly wasteful as these all have
exactly one caller that explicitly checks kvm->arch.tdp_mmu_enabled.

Even if we don't unconditionally use the rwlock, I think I'd prefer to omit
these rwlock wrappers and instead use read/write_lock directly (and drop the
WARNs). 

> +
> +void kvm_mmu_lock(struct kvm *kvm)
> +{
> +	if (kvm->arch.tdp_mmu_enabled)
> +		kvm_mmu_lock_exclusive(kvm);
> +	else
> +		spin_lock(&kvm->arch.mmu_lock);
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_lock);
> +
> +void kvm_mmu_unlock(struct kvm *kvm)
> +{
> +	if (kvm->arch.tdp_mmu_enabled)
> +		kvm_mmu_unlock_exclusive(kvm);
> +	else
> +		spin_unlock(&kvm->arch.mmu_lock);
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_unlock);

These exports aren't needed, I don't see any callers in kvm_intel or kvm_amd.
That's a moot point if we use rwlock unconditionally.

> +

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-01-12 18:10 ` [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
@ 2021-01-21  0:55   ` Sean Christopherson
  2021-01-26 21:57     ` Ben Gardon
  2021-01-26 13:37   ` Paolo Bonzini
  1 sibling, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2021-01-21  0:55 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> Make the last few changes necessary to enable the TDP MMU to handle page
> faults in parallel while holding the mmu_lock in read mode.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 280d7cd6f94b..fa111ceb67d4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3724,7 +3724,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  		return r;
>  
>  	r = RET_PF_RETRY;
> -	kvm_mmu_lock(vcpu->kvm);
> +
> +	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))

Off topic, what do you think about rewriting is_tdp_mmu_root() to be both more
performant and self-documenting as to when is_tdp_mmu_root() !=
kvm->arch.tdp_mmu_enabled?  E.g. key off is_guest_mode() and then do a thorough
audit/check when CONFIG_KVM_MMU_AUDIT=y?

#ifdef CONFIG_KVM_MMU_AUDIT
bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
{
	struct kvm_mmu_page *sp;

	if (!kvm->arch.tdp_mmu_enabled)
		return false;
	if (WARN_ON(!VALID_PAGE(hpa)))
		return false;

	sp = to_shadow_page(hpa);
	if (WARN_ON(!sp))
		return false;

	return sp->tdp_mmu_page && sp->root_count;
}
#endif

bool is_tdp_mmu(struct kvm_vcpu *vcpu)
{
	bool is_tdp_mmu = kvm->arch.tdp_mmu_enabled && !is_guest_mode(vcpu);

#ifdef CONFIG_KVM_MMU_AUDIT
	WARN_ON(is_tdp_mmu != is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
#endif
	return is_tdp_mmu;
}

> +		kvm_mmu_lock_shared(vcpu->kvm);
> +	else
> +		kvm_mmu_lock(vcpu->kvm);
> +
>  	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
>  		goto out_unlock;
>  	r = make_mmu_pages_available(vcpu);
> @@ -3739,7 +3744,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  				 prefault, is_tdp);
>  
>  out_unlock:
> -	kvm_mmu_unlock(vcpu->kvm);
> +	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +		kvm_mmu_unlock_shared(vcpu->kvm);
> +	else
> +		kvm_mmu_unlock(vcpu->kvm);
>  	kvm_release_pfn_clean(pfn);
>  	return r;
>  }
> -- 
> 2.30.0.284.gd98b1dd5eaa7-goog
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-12 18:10 ` [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock Ben Gardon
@ 2021-01-21 19:22   ` Sean Christopherson
  2021-01-21 21:32     ` Sean Christopherson
  2021-01-26 13:37   ` Paolo Bonzini
  1 sibling, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2021-01-21 19:22 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> Add a lock to protect the data structures that track the page table
> memory used by the TDP MMU. In order to handle multiple TDP MMU
> operations in parallel, pages of PT memory must be added and removed
> without the exclusive protection of the MMU lock. A new lock to protect
> the list(s) of in-use pages will cause some serialization, but only on
> non-leaf page table entries, so the lock is not expected to be very
> contended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 15 ++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 67 +++++++++++++++++++++++++++++----
>  2 files changed, 74 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 92d5340842c8..f8dccb27c722 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1034,6 +1034,21 @@ struct kvm_arch {
>  	 * tdp_mmu_page set and a root_count of 0.
>  	 */
>  	struct list_head tdp_mmu_pages;
> +
> +	/*
> +	 * Protects accesses to the following fields when the MMU lock is
> +	 * not held exclusively:
> +	 *  - tdp_mmu_pages (above)
> +	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
> +	 *    when they are part of tdp_mmu_pages (but not when they are part
> +	 *    of the tdp_mmu_free_list or tdp_mmu_disconnected_list)

Neither tdp_mmu_free_list nor tdp_mmu_disconnected_list exists.

> +	 *  - lpage_disallowed_mmu_pages
> +	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
> +	 *    by the TDP MMU
> +	 *  May be acquired under the MMU lock in read mode or non-overlapping
> +	 *  with the MMU lock.
> +	 */
> +	spinlock_t tdp_mmu_pages_lock;
>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8b61bdb391a0..264594947c3b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -33,6 +33,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  	kvm->arch.tdp_mmu_enabled = true;
>  
>  	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> +	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
>  	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
>  }
>  
> @@ -262,6 +263,58 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
>  	}
>  }
>  
> +/**
> + * tdp_mmu_link_page - Add a new page to the list of pages used by the TDP MMU
> + *
> + * @kvm: kvm instance
> + * @sp: the new page
> + * @atomic: This operation is not running under the exclusive use of the MMU
> + *	    lock and the operation must be atomic with respect to ther threads
> + *	    that might be adding or removing pages.
> + * @account_nx: This page replaces a NX large page and should be marked for
> + *		eventual reclaim.
> + */
> +static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +			      bool atomic, bool account_nx)
> +{
> +	if (atomic)

This is unnecessary, there is exactly one caller and it is always "atomic".

Assuming some of this code lives on (see below), I'd prefer a different name
than "atomic".  Writing the SPTE is atomic (though even that is a bit of a lie,
e.g. tdp_mmu_zap_spte_atomic() is very much not atomic), but all the other
operations are the exact opposite of atomic.

Maybe change it from a bool to an enum with READ/WRITE_LOCKED or something?

> +		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +	else
> +		kvm_mmu_lock_assert_held_exclusive(kvm);
> +
> +	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> +	if (account_nx)
> +		account_huge_nx_page(kvm, sp);
> +
> +	if (atomic)
> +		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +}
> +
> +/**
> + * tdp_mmu_unlink_page - Remove page from the list of pages used by the TDP MMU
> + *
> + * @kvm: kvm instance
> + * @sp: the page to be removed
> + * @atomic: This operation is not running under the exclusive use of the MMU
> + *	    lock and the operation must be atomic with respect to ther threads
> + *	    that might be adding or removing pages.
> + */
> +static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				bool atomic)
> +{
> +	if (atomic)

Summarizing an off-list discussion with Ben:

This path isn't reachable in this series, which means all the RCU stuff is more
or less untestable.  Only the page fault path modifies the MMU while hold a read
lock, and it can't zap non-leaf shadow pages (only zaps large SPTEs and installs
new SPs).

The intent is to convert other zap-happy paths to a read lock, notably
kvm_mmu_zap_collapsible_sptes() and kvm_recover_nx_lpages().  Ben will include
patches to convert at least one of those in the next version of this series so
that there is justification and coverage for the RCU-deferred freeing.

> +		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +	else
> +		kvm_mmu_lock_assert_held_exclusive(kvm);
> +	list_del(&sp->link);
> +	if (sp->lpage_disallowed)
> +		unaccount_huge_nx_page(kvm, sp);
> +
> +	if (atomic)
> +		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +}
> +
>  /**
>   * handle_disconnected_tdp_mmu_page - handle a pt removed from the TDP structure
>   *

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-21  0:19   ` Sean Christopherson
@ 2021-01-21 20:17     ` Paolo Bonzini
  2021-01-26 14:38     ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-21 20:17 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon, Marc Zyngier, Will Deacon,
	Paul Mackerras
  Cc: linux-kernel, kvm, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov

On 21/01/21 01:19, Sean Christopherson wrote:
> IMO, moving the lock to arch-specific code is bad for KVM. The 
> architectures' MMUs already diverge pretty horribly, and once things 
> diverge it's really hard to go the other direction. And this change, 
> along with all of the wrappers, thrash a lot of code and add a fair 
> amount of indirection without any real benefit to the other 
> architectures. What if we simply make the common mmu_lock a union? The 
> rwlock_t is probably a bit bigger, but that's a few bytes for an entire 
> VM. And maybe this would entice/inspire other architectures to move to a 
> similar MMU model.
I agree.  Most architectures don't do the lockless tricks that x86 do, 
and being able to lock for read would be better than nothing.  For 
example, I took a look at ARM and stage2_update_leaf_attrs could be 
changed to operate in cmpxchg-like style while holding the rwlock for read.

Paolo

> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f3b1013fb22c..bbc8efd4af62 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -451,7 +451,10 @@ struct kvm_memslots {
>  };
> 
>  struct kvm {
> -       spinlock_t mmu_lock;
> +       union {
> +               rwlock_t mmu_rwlock;
> +               spinlock_t mmu_lock;
> +       };
>         struct mutex slots_lock;
>         struct mm_struct *mm; /* userspace tied to this vm */
>         struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched
  2021-01-20 18:38   ` Sean Christopherson
@ 2021-01-21 20:22     ` Paolo Bonzini
  2021-01-26 14:11     ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-21 20:22 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 20/01/21 19:38, Sean Christopherson wrote:
> Currently the TDP MMU yield / cond_resched functions either return
> nothing or return true if the TLBs were not flushed. These are confusing
> semantics, especially when making control flow decisions in calling
> functions.
> 
> To clean things up, change both functions to have the same
> return value semantics as cond_resched: true if the thread yielded,
> false if it did not. If the function yielded in the_flush_  version,
> then the TLBs will have been flushed.

My fault here.  The return value was meant to simplify the assignments 
below.  But it's clearer to return true if the cond_resched happened, 
indeed.

>>
>>  
>>  		if (can_yield)
>> -			flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
>> +			flush_needed = !tdp_mmu_iter_flush_cond_resched(kvm,
>> +									&iter);
> 
> As with the existing code, I'd let this poke out.  Alternatively, this could be
> written as:
> 
> 		flush_needed = !can_yield ||
> 			       !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
> 

Yeah, no new line here.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-21 19:22   ` Sean Christopherson
@ 2021-01-21 21:32     ` Sean Christopherson
  2021-01-26 14:27       ` Paolo Bonzini
  0 siblings, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2021-01-21 21:32 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Thu, Jan 21, 2021, Sean Christopherson wrote:
> On Tue, Jan 12, 2021, Ben Gardon wrote:
> > +static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > +				bool atomic)
> > +{
> > +	if (atomic)
> 
> Summarizing an off-list discussion with Ben:
> 
> This path isn't reachable in this series, which means all the RCU stuff is more
> or less untestable.  Only the page fault path modifies the MMU while hold a read
> lock, and it can't zap non-leaf shadow pages (only zaps large SPTEs and installs
> new SPs).

Aha!  I was wrong.  This will be hit when KVM zaps a 4k SPTE and installs a
large SPTE overtop a SP, e.g. if the host migrates a page for compaction and
creates a new THP.

  tdp_mmu_map_handle_target_level()
     tdp_mmu_set_spte_atomic()
       handle_changed_spte()
         __handle_changed_spte()
	   handle_disconnected_tdp_mmu_page()
	     tdp_mmu_unlink_page()

> The intent is to convert other zap-happy paths to a read lock, notably
> kvm_mmu_zap_collapsible_sptes() and kvm_recover_nx_lpages().  Ben will include
> patches to convert at least one of those in the next version of this series so
> that there is justification and coverage for the RCU-deferred freeing.

Somewhat offtopic, zap_collapsible_spte_range() looks wrong.  It zaps non-leaf
SPs, and has several comments that make it quite clear that that's its intent,
but the logic is messed up.  For non-leaf SPs, PFN points at the next table, not
the final PFN that is mapped into the guest.  That absolutely should never be a
reserved PFN, and whether or not its a huge page is irrelevant.  My analysis is
more or less confirmed by looking at Ben's internal code, which explicitly does
the exact opposite in that it explicitly zaps leaf SPTEs.

	tdp_root_for_each_pte(iter, root, start, end) {
		/* Ensure forward progress has been made before yielding. */
		if (iter.goal_gfn != last_goal_gfn &&
		    tdp_mmu_iter_flush_cond_resched(kvm, &iter)) {
			last_goal_gfn = iter.goal_gfn;
			spte_set = false;
			/*
			 * Yielding caused the paging structure walk to be
			 * reset so skip to the next iteration to continue the
			 * walk from the root.
			 */
			continue;
		}

		if (!is_shadow_present_pte(iter.old_spte) ||
		    is_last_spte(iter.old_spte, iter.level)) <--- inverted?
			continue;

		pfn = spte_to_pfn(iter.old_spte); <-- this would be the page table?
		if (kvm_is_reserved_pfn(pfn) ||
		    !PageTransCompoundMap(pfn_to_page(pfn)))
			continue;

		tdp_mmu_set_spte(kvm, &iter, 0);
		spte_set = true;
	}


Coming back to this series, I wonder if the RCU approach is truly necessary to
get the desired scalability.  If both zap_collapsible_sptes() and NX huge page
recovery zap _only_ leaf SPTEs, then the only path that can actually unlink a
shadow page while holding the lock for read is the page fault path that installs
a huge page over an existing shadow page.

Assuming the above analysis is correct, I think it's worth exploring alternatives
to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in
the specific case of overwriting a SP (though that may not exist for rwlocks),
or maybe something entirely different?

I actually do like deferred free concept, but I find it difficult to reason
about exactly what protections are provided by RCU, and what even _needs_ to be
protected.  Maybe we just need to add some __rcu annotations?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU
  2021-01-20 19:28   ` Sean Christopherson
@ 2021-01-22  1:06     ` Ben Gardon
  0 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-22  1:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On Wed, Jan 20, 2021 at 11:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 12, 2021, Ben Gardon wrote:
> > There are two problems with the way the TDP MMU yields in long running
> > functions. 1.) Given certain conditions, the function may not yield
> > reliably / frequently enough. 2.) In some functions the TDP iter risks
> > not making forward progress if two threads livelock yielding to
> > one another.
> >
> > Case 1 is possible if for example, a paging structure was very large
> > but had few, if any writable entries. wrprot_gfn_range could traverse many
> > entries before finding a writable entry and yielding.
> >
> > Case 2 is possible if two threads were trying to execute wrprot_gfn_range.
> > Each could write protect an entry and then yield. This would reset the
> > tdp_iter's walk over the paging structure and the loop would end up
> > repeating the same entry over and over, preventing either thread from
> > making forward progress.
> >
> > Fix these issues by moving the yield to the beginning of the loop,
> > before other checks and only yielding if the loop has made forward
> > progress since the last yield.
>
> I think it'd be best to split this into two patches, e.g. ensure forward
> progress and then yield more agressively.  They are two separate bugs, and I
> don't think that ensuring forward progress would exacerbate case #1.  I'm not
> worried about breaking things so much as getting more helpful shortlogs; "Fix
> yielding in TDP MMU" doesn't provide any insight into what exactly was broken.
> E.g. something like:
>
>   KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter
>   KVM: x86/mmu: Yield in TDU MMU iter even if no real work was done
>
> > Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
> > Reviewed-by: Peter Feiner <pfeiner@google.com>
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 83 +++++++++++++++++++++++++++++++-------
> >  1 file changed, 69 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index b2784514ca2d..1987da0da66e 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -470,9 +470,23 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >                         gfn_t start, gfn_t end, bool can_yield)
> >  {
> >       struct tdp_iter iter;
> > +     gfn_t last_goal_gfn = start;
> >       bool flush_needed = false;
> >
> >       tdp_root_for_each_pte(iter, root, start, end) {
> > +             /* Ensure forward progress has been made before yielding. */
> > +             if (can_yield && iter.goal_gfn != last_goal_gfn &&
>
> Make last_goal_gfn a property of the iterator, that way all this logic can be
> shoved into tdp_mmu_iter_flush_cond_resched(), and the comments about ensuring
> forward progress and effectively invalidating/resetting the iterator (the
> comment below) can be a function comment, as opposed to being copied everywhere.
> E.g. there can be a big scary warning in the function comment stating that the
> caller must restart its loop if the helper yielded.
>
> Tangentially related, the name goal_gfn is quite confusing.  "goal" and "end"
> are synonyms, but "goal" is often initialized with "start", and it's not used to
> terminate the walk.  Maybe next_gfn instead?  And maybe yielded_gfn, since
> last_next_gfn is pretty horrendous.

All these are excellent suggestions and definitely make the code
cleaner. I'll definitely adopt yielded_gfn. While I agree goal_gfn is
a little odd, I think next_gfn could be more misleading because the
goal_gfn is really more of a target than the next step. It might take
4 or 5 steps to actually reach a last-level entry mapping that gfn.
target_last_level_gfn or next_last_level_gfn would probably be the
most accurate option.

>
> > +                 tdp_mmu_iter_flush_cond_resched(kvm, &iter)) {
>
> This isn't quite correct, as tdp_mmu_iter_flush_cond_resched() will do an
> expensive remote TLB flush on every yield, even if no flush is needed.  The
> cleanest solution is likely to drop tdp_mmu_iter_flush_cond_resched() and
> instead add a @flush param to tdp_mmu_iter_cond_resched().  If it's tagged
> __always_inline, then the callers that unconditionally pass true/false will
> optimize out the conditional code.
>
> At that point, I think it would also make sense to fold tdp_iter_refresh_walk()
> into tdp_mmu_iter_cond_resched(), because really we shouldn't be mucking with
> the guts of the iter except for the yield case.
>
> > +                     last_goal_gfn = iter.goal_gfn;
>
> Another argument for both renaming goal_gfn and moving last_*_gfn into the iter:
> it's not at all obvious that updating the last gfn _after_ tdp_iter_refresh_walk()
> is indeed correct.
>
> You can also avoid a local variable by doing max(iter->next_gfn, iter->gfn) when
> calling tdp_iter_refresh_walk().  IMO, that's also a bit easier to understand
> than an open-coded equivalent.
>
> E.g. putting it all together, with yielded_gfn set by tdp_iter_start():
>
> static __always_inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>                                                      struct tdp_iter *iter,
>                                                      bool flush)
> {
>         /* Ensure forward progress has been made since the last yield. */
>         if (iter->next_gfn == iter->yielded_gfn)
>                 return false;
>
>         if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>                 if (flush)
>                         kvm_flush_remote_tlbs(kvm);
>                 cond_resched_lock(&kvm->mmu_lock);
>
>                 /*
>                  * Restart the walk over the paging structure from the root,
>                  * starting from the highest gfn the iterator had previously
>                  * reached.  The entire paging structure, except the root, may
>                  * have been completely torn down and rebuilt while we yielded.
>                  */
>                 tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
>                                iter->root_level, iter->min_level,
>                                max(iter->next_gfn, iter->gfn));
>                 return true;
>         }
>
>         return false;
> }
>
> > +                     flush_needed = false;
> > +                     /*
> > +                      * Yielding caused the paging structure walk to be
> > +                      * reset so skip to the next iteration to continue the
> > +                      * walk from the root.
> > +                      */
> > +                     continue;
> > +             }
> > +
> >               if (!is_shadow_present_pte(iter.old_spte))
> >                       continue;
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links
  2021-01-12 18:10 ` [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links Ben Gardon
@ 2021-01-22 18:32   ` Sean Christopherson
  2021-01-26 18:17     ` Ben Gardon
  0 siblings, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2021-01-22 18:32 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 12, 2021, Ben Gardon wrote:
> In order to protect TDP MMU PT memory with RCU, ensure that page table
> links are properly rcu_derefenced.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_iter.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index 87b7e16911db..82855613ffa0 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -49,6 +49,8 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
>   */
>  u64 *spte_to_child_pt(u64 spte, int level)
>  {
> +	u64 *child_pt;
> +
>  	/*
>  	 * There's no child entry if this entry isn't present or is a
>  	 * last-level entry.
> @@ -56,7 +58,9 @@ u64 *spte_to_child_pt(u64 spte, int level)
>  	if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
>  		return NULL;
>  
> -	return __va(spte_to_pfn(spte) << PAGE_SHIFT);
> +	child_pt = __va(spte_to_pfn(spte) << PAGE_SHIFT);
> +
> +	return rcu_dereference(child_pt);

This is what bugs me the most about the RCU usage.  We're reaping the functional
benefits of RCU without doing the grunt work to truly RCU-ify the TDP MMU.  The
above rcu_dereference() barely scratches the surface of what's being protected
by RCU.  There are already multiple mechanisms that protect the page tables,
throwing RCU into the mix without fully integrating RCU makes for simple code
and avoids reinventing the wheel (big thumbs up), but ends up adding complexity
to an already complex system.  E.g. the lockless walks in the old MMU are
complex on the surface, but I find them easier to think through because they
explicitly rely on the same mechanism (remote TLB flush) that is used to protect
guest usage of the page tables.

Ideally, I think struct kvm_mmu_page's 'u64 *spt' would be annotated with __rcu,
as that would provide a high level of enforcement and would also highlight where
we're using other mechanisms to ensure correctness.  E.g. dereferencing root->spt
in kvm_tdp_mmu_get_vcpu_root_hpa() relies on the root being pinned by
get_tdp_mmu_vcpu_root(), and _that_ in turn relies on hold rwlock for write.
Unfortunately since kvm_mmu_page is shared with the old mmu, annotating ->spt
that doesn't work well.  We could employ a union to make it work, but that'd
probably do more harm than good.

The middle ground would be to annotate pt_path and sptep in struct tdp_iter.
That gets a decent chunk of the enforcement and also helps highlight what's
being protected with RCU.  Assuming we end up going with RCU, I think this
single rcu_dereference should be replace with something like the below patch.

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 82855613ffa0..e000642d938d 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
 {
        iter->sptep = iter->pt_path[iter->level - 1] +
                SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
-       iter->old_spte = READ_ONCE(*iter->sptep);
+       iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
 }

 static gfn_t round_gfn_for_level(gfn_t gfn, int level)
@@ -34,7 +34,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
        iter->root_level = root_level;
        iter->min_level = min_level;
        iter->level = root_level;
-       iter->pt_path[iter->level - 1] = root_pt;
+       iter->pt_path[iter->level - 1] = (tdp_ptep_t)root_pt;

        iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
        tdp_iter_refresh_sptep(iter);
@@ -47,9 +47,9 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
  * address of the child page table referenced by the SPTE. Returns null if
  * there is no such entry.
  */
-u64 *spte_to_child_pt(u64 spte, int level)
+tdp_ptep_t spte_to_child_pt(u64 spte, int level)
 {
-       u64 *child_pt;
+       tdp_ptep_t child_pt;

        /*
         * There's no child entry if this entry isn't present or is a
@@ -58,9 +58,9 @@ u64 *spte_to_child_pt(u64 spte, int level)
        if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
                return NULL;

-       child_pt = __va(spte_to_pfn(spte) << PAGE_SHIFT);
+       child_pt = (tdp_ptep_t)__va(spte_to_pfn(spte) << PAGE_SHIFT);

-       return rcu_dereference(child_pt);
+       return child_pt;
 }

 /*
@@ -69,7 +69,7 @@ u64 *spte_to_child_pt(u64 spte, int level)
  */
 static bool try_step_down(struct tdp_iter *iter)
 {
-       u64 *child_pt;
+       tdp_ptep_t child_pt;

        if (iter->level == iter->min_level)
                return false;
@@ -78,7 +78,7 @@ static bool try_step_down(struct tdp_iter *iter)
         * Reread the SPTE before stepping down to avoid traversing into page
         * tables that are no longer linked from this entry.
         */
-       iter->old_spte = READ_ONCE(*iter->sptep);
+       iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

        child_pt = spte_to_child_pt(iter->old_spte, iter->level);
        if (!child_pt)
@@ -112,7 +112,7 @@ static bool try_step_side(struct tdp_iter *iter)
        iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
        iter->goal_gfn = iter->gfn;
        iter->sptep++;
-       iter->old_spte = READ_ONCE(*iter->sptep);
+       iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

        return true;
 }
@@ -175,11 +175,11 @@ void tdp_iter_refresh_walk(struct tdp_iter *iter)
        if (iter->gfn > goal_gfn)
                goal_gfn = iter->gfn;

-       tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
+       tdp_iter_start(iter, rcu_dereference(iter->pt_path[iter->root_level - 1]),
                       iter->root_level, iter->min_level, goal_gfn);
 }

-u64 *tdp_iter_root_pt(struct tdp_iter *iter)
+tdp_ptep_t tdp_iter_root_pt(struct tdp_iter *iter)
 {
        return iter->pt_path[iter->root_level - 1];
 }
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 47170d0dc98e..bf882dab8ec5 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -7,6 +7,8 @@

 #include "mmu.h"

+typedef u64 __rcu *tdp_ptep_t;
+
 /*
  * A TDP iterator performs a pre-order walk over a TDP paging structure.
  */
@@ -17,9 +19,9 @@ struct tdp_iter {
         */
        gfn_t goal_gfn;
        /* Pointers to the page tables traversed to reach the current SPTE */
-       u64 *pt_path[PT64_ROOT_MAX_LEVEL];
+       tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
        /* A pointer to the current SPTE */
-       u64 *sptep;
+       tdp_ptep_t sptep;
        /* The lowest GFN mapped by the current SPTE */
        gfn_t gfn;
        /* The level of the root page given to the iterator */
@@ -49,12 +51,12 @@ struct tdp_iter {
 #define for_each_tdp_pte(iter, root, root_level, start, end) \
        for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)

-u64 *spte_to_child_pt(u64 pte, int level);
+tdp_ptep_t spte_to_child_pt(u64 pte, int level);

 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
                    int min_level, gfn_t goal_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_refresh_walk(struct tdp_iter *iter);
-u64 *tdp_iter_root_pt(struct tdp_iter *iter);
+tdp_ptep_t tdp_iter_root_pt(struct tdp_iter *iter);

 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 45160ff84e91..27b850904230 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -509,7 +509,7 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
                                           struct tdp_iter *iter,
                                           u64 new_spte)
 {
-       u64 *root_pt = tdp_iter_root_pt(iter);
+       tdp_ptep_t root_pt = tdp_iter_root_pt(iter);
        struct kvm_mmu_page *root = sptep_to_sp(root_pt);
        int as_id = kvm_mmu_page_as_id(root);



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions
  2021-01-20 19:51   ` Sean Christopherson
@ 2021-01-25 23:51     ` Ben Gardon
  0 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-25 23:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On Wed, Jan 20, 2021 at 11:51 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 12, 2021, Ben Gardon wrote:
> > Skip setting SPTEs if no change is expected.
> >
> > Reviewed-by: Peter Feiner <pfeiner@google.com>
> >
> Nit on all of these, can you remove the extra newline between the Reviewed-by
> and SOB?

Yeah, that line is annoying. I'll make sure it's not there on future patches.

>
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 1987da0da66e..2650fa9fe066 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -882,6 +882,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >                   !is_last_spte(iter.old_spte, iter.level))
> >                       continue;
> >
> > +             if (!(iter.old_spte & PT_WRITABLE_MASK))
>
> Include the new check with the existing if statement?  I think it makes sense to
> group all the checks on old_spte.

I agree that' s cleaner. I'll group the checks in the next patch set version.

>
> > +                     continue;
> > +
> >               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> >
> >               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
> > @@ -1079,6 +1082,9 @@ static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >               if (!is_shadow_present_pte(iter.old_spte))
> >                       continue;
> >
> > +             if (iter.old_spte & shadow_dirty_mask)
>
> Same comment here.
>
> > +                     continue;
> > +
>
> Unrelated to this patch, but it got me looking at the code: shouldn't
> clear_dirty_pt_masked() clear the bit in @mask before checking whether or not
> the spte needs to be modified?  That way the early break kicks in after sptes
> are checked, not necessarily written.  E.g.
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 2650fa9fe066..d8eeae910cbf 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1010,21 +1010,21 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
>                     !(mask & (1UL << (iter.gfn - gfn))))
>                         continue;
>
> -               if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
> -                       if (is_writable_pte(iter.old_spte))
> -                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> -                       else
> -                               continue;
> -               } else {
> -                       if (iter.old_spte & shadow_dirty_mask)
> -                               new_spte = iter.old_spte & ~shadow_dirty_mask;
> -                       else
> -                               continue;
> -               }
> -
> -               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
> -
>                 mask &= ~(1UL << (iter.gfn - gfn));
> +
> +               if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
> +                       if (is_writable_pte(iter.old_spte))
> +                               new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> +                       else
> +                               continue;
> +               } else {
> +                       if (iter.old_spte & shadow_dirty_mask)
> +                               new_spte = iter.old_spte & ~shadow_dirty_mask;
> +                       else
> +                               continue;
> +               }
> +
> +               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
>         }
>  }
>

Great point, that doesn't work as intended at all. I'll adopt your
proposed fix and include it in a patch after this one in the next
version of the series.

>
> >               new_spte = iter.old_spte | shadow_dirty_mask;
> >
> >               tdp_mmu_set_spte(kvm, &iter, new_spte);
> > --
> > 2.30.0.284.gd98b1dd5eaa7-goog
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-12 18:10 ` [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock Ben Gardon
  2021-01-21 19:22   ` Sean Christopherson
@ 2021-01-26 13:37   ` Paolo Bonzini
  2021-01-26 21:07     ` Ben Gardon
  1 sibling, 1 reply; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 13:37 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
> +	 *  May be acquired under the MMU lock in read mode or non-overlapping
> +	 *  with the MMU lock.
> +	 */
> +	spinlock_t tdp_mmu_pages_lock;

Is this correct?  My understanding is that:

- you can take tdp_mmu_pages_lock from a shared MMU lock critical section

- you don't need to take tdp_mmu_pages_lock from an exclusive MMU lock 
critical section, because you can't be concurrent with a shared critical 
section

- but then, you can't take tdp_mmu_pages_lock outside the MMU lock, 
because you could have

    write_lock(mmu_lock)
                                      spin_lock(tdp_mmu_pages_lock)
    do tdp_mmu_pages_lock stuff  !!!  do tdp_mmu_pages_lock stuff
    write_unlock(mmu_lock)
                                      spin_unlock(tdp_mmu_pages_lock)

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-01-12 18:10 ` [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
  2021-01-21  0:55   ` Sean Christopherson
@ 2021-01-26 13:37   ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 13:37 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
> +	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +		kvm_mmu_lock_shared(vcpu->kvm);
> +	else
> +		kvm_mmu_lock(vcpu->kvm);

Perhaps the better API would be kvm_mmu_lock/unlock_root; not exposing 
kvm_mmu_lock/unlock_shared and kvm_mmu_lock/unlock_exclusive at all, 
just like you use rwlock_needbreak directly in kvm_mmu_lock_needbreak.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched
  2021-01-20 18:38   ` Sean Christopherson
  2021-01-21 20:22     ` Paolo Bonzini
@ 2021-01-26 14:11     ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:11 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 20/01/21 19:38, Sean Christopherson wrote:
> On Tue, Jan 12, 2021, Ben Gardon wrote:
>> Currently the TDP MMU yield / cond_resched functions either return
>> nothing or return true if the TLBs were not flushed. These are confusing
>> semantics, especially when making control flow decisions in calling
>> functions.
>>
>> To clean things up, change both functions to have the same
>> return value semantics as cond_resched: true if the thread yielded,
>> false if it did not. If the function yielded in the _flush_ version,
>> then the TLBs will have been flushed.
>>
>> Reviewed-by: Peter Feiner <pfeiner@google.com>
>> Signed-off-by: Ben Gardon <bgardon@google.com>
>> ---
>>   arch/x86/kvm/mmu/tdp_mmu.c | 38 +++++++++++++++++++++++++++++---------
>>   1 file changed, 29 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
>> index 2ef8615f9dba..b2784514ca2d 100644
>> --- a/arch/x86/kvm/mmu/tdp_mmu.c
>> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
>> @@ -413,8 +413,15 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>>   			 _mmu->shadow_root_level, _start, _end)
>>   
>>   /*
>> - * Flush the TLB if the process should drop kvm->mmu_lock.
>> - * Return whether the caller still needs to flush the tlb.
>> + * Flush the TLB and yield if the MMU lock is contended or this thread needs to
>> + * return control to the scheduler.
>> + *
>> + * If this function yields, it will also reset the tdp_iter's walk over the
>> + * paging structure and the calling function should allow the iterator to
>> + * continue its traversal from the paging structure root.
>> + *
>> + * Return true if this function yielded, the TLBs were flushed, and the
>> + * iterator's traversal was reset. Return false if a yield was not needed.
>>    */
>>   static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>>   {
>> @@ -422,18 +429,30 @@ static bool tdp_mmu_iter_flush_cond_resched(struct kvm *kvm, struct tdp_iter *it
>>   		kvm_flush_remote_tlbs(kvm);
>>   		cond_resched_lock(&kvm->mmu_lock);
>>   		tdp_iter_refresh_walk(iter);
>> -		return false;
>> -	} else {
>>   		return true;
>> -	}
>> +	} else
>> +		return false;
> 
> Kernel style is to have curly braces on all branches if any branch has 'em.  Or,
> omit the else since the taken branch always returns.  I think I prefer the latter?
> 
>>   }
>>   
>> -static void tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>> +/*
>> + * Yield if the MMU lock is contended or this thread needs to return control
>> + * to the scheduler.
>> + *
>> + * If this function yields, it will also reset the tdp_iter's walk over the
>> + * paging structure and the calling function should allow the iterator to
>> + * continue its traversal from the paging structure root.
>> + *
>> + * Return true if this function yielded and the iterator's traversal was reset.
>> + * Return false if a yield was not needed.
>> + */
>> +static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>>   {
>>   	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>>   		cond_resched_lock(&kvm->mmu_lock);
>>   		tdp_iter_refresh_walk(iter);
>> -	}
>> +		return true;
>> +	} else
>> +		return false;
> 
> Same here.
> 
>>   }
>>   
>>   /*
>> @@ -470,7 +489,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>>   		tdp_mmu_set_spte(kvm, &iter, 0);
>>   
>>   		if (can_yield)
>> -			flush_needed = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
>> +			flush_needed = !tdp_mmu_iter_flush_cond_resched(kvm,
>> +									&iter);
> 
> As with the existing code, I'd let this poke out.  Alternatively, this could be
> written as:
> 
> 		flush_needed = !can_yield ||
> 			       !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
> 
>>   		else
>>   			flush_needed = true;
>>   	}
>> @@ -1072,7 +1092,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>>   
>>   		tdp_mmu_set_spte(kvm, &iter, 0);
>>   
>> -		spte_set = tdp_mmu_iter_flush_cond_resched(kvm, &iter);
>> +		spte_set = !tdp_mmu_iter_flush_cond_resched(kvm, &iter);
>>   	}
>>   
>>   	if (spte_set)
>> -- 
>> 2.30.0.284.gd98b1dd5eaa7-goog
>>
> 

Tweaked and queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte
  2021-01-12 18:10 ` [PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
@ 2021-01-26 14:13   ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:13 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
> __tdp_mmu_set_spte is a very important function in the TDP MMU which
> already accepts several arguments and will take more in future commits.
> To offset this complexity, add a comment to the function describing each
> of the arguemnts.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 2650fa9fe066..b033da8243fc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -357,6 +357,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   				      new_spte, level);
>   }
>   
> +/*
> + * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
> + * @kvm: kvm instance
> + * @iter: a tdp_iter instance currently on the SPTE that should be set
> + * @new_spte: The value the SPTE should be set to
> + * @record_acc_track: Notify the MM subsystem of changes to the accessed state
> + *		      of the page. Should be set unless handling an MMU
> + *		      notifier for access tracking. Leaving record_acc_track
> + *		      unset in that case prevents page accesses from being
> + *		      double counted.
> + * @record_dirty_log: Record the page as dirty in the dirty bitmap if
> + *		      appropriate for the change being made. Should be set
> + *		      unless performing certain dirty logging operations.
> + *		      Leaving record_dirty_log unset in that case prevents page
> + *		      writes from being double counted.
> + */
>   static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>   				      u64 new_spte, bool record_acc_track,
>   				      bool record_dirty_log)
> 

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE
  2021-01-12 18:10 ` [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
  2021-01-20 19:58   ` Sean Christopherson
@ 2021-01-26 14:13   ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:13 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
> Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified
> under the MMU lock. This lockdep will be updated in future commits to
> reflect and validate changes to the TDP MMU's synchronization strategy.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b033da8243fc..411938e97a00 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -381,6 +381,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>   	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
>   	int as_id = kvm_mmu_page_as_id(root);
>   
> +	lockdep_assert_held(&kvm->mmu_lock);
> +
>   	WRITE_ONCE(*iter->sptep, new_spte);
>   
>   	__handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> 

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt
  2021-01-12 18:10 ` [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt Ben Gardon
  2021-01-20 20:30   ` Sean Christopherson
@ 2021-01-26 14:14   ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:14 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
> Factor out the code to handle a disconnected subtree of the TDP paging
> structure from the code to handle the change to an individual SPTE.
> Future commits will build on this to allow asynchronous page freeing.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 75 +++++++++++++++++++++++---------------
>   1 file changed, 46 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 55df596696c7..e8f35cd46b4c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -234,6 +234,49 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
>   	}
>   }
>   
> +/**
> + * handle_disconnected_tdp_mmu_page - handle a pt removed from the TDP structure
> + *
> + * @kvm: kvm instance
> + * @pt: the page removed from the paging structure
> + *
> + * Given a page table that has been removed from the TDP paging structure,
> + * iterates through the page table to clear SPTEs and free child page tables.
> + */
> +static void handle_disconnected_tdp_mmu_page(struct kvm *kvm, u64 *pt)
> +{
> +	struct kvm_mmu_page *sp;
> +	gfn_t gfn;
> +	int level;
> +	u64 old_child_spte;
> +	int i;
> +
> +	sp = sptep_to_sp(pt);
> +	gfn = sp->gfn;
> +	level = sp->role.level;
> +
> +	trace_kvm_mmu_prepare_zap_page(sp);
> +
> +	list_del(&sp->link);
> +
> +	if (sp->lpage_disallowed)
> +		unaccount_huge_nx_page(kvm, sp);
> +
> +	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> +		old_child_spte = READ_ONCE(*(pt + i));
> +		WRITE_ONCE(*(pt + i), 0);
> +		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp),
> +			gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
> +			old_child_spte, 0, level - 1);
> +	}
> +
> +	kvm_flush_remote_tlbs_with_address(kvm, gfn,
> +					   KVM_PAGES_PER_HPAGE(level));
> +
> +	free_page((unsigned long)pt);
> +	kmem_cache_free(mmu_page_header_cache, sp);
> +}
> +
>   /**
>    * handle_changed_spte - handle bookkeeping associated with an SPTE change
>    * @kvm: kvm instance
> @@ -254,10 +297,6 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   	bool was_leaf = was_present && is_last_spte(old_spte, level);
>   	bool is_leaf = is_present && is_last_spte(new_spte, level);
>   	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> -	u64 *pt;
> -	struct kvm_mmu_page *sp;
> -	u64 old_child_spte;
> -	int i;
>   
>   	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
>   	WARN_ON(level < PG_LEVEL_4K);
> @@ -321,31 +360,9 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   	 * Recursively handle child PTs if the change removed a subtree from
>   	 * the paging structure.
>   	 */
> -	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
> -		pt = spte_to_child_pt(old_spte, level);
> -		sp = sptep_to_sp(pt);
> -
> -		trace_kvm_mmu_prepare_zap_page(sp);
> -
> -		list_del(&sp->link);
> -
> -		if (sp->lpage_disallowed)
> -			unaccount_huge_nx_page(kvm, sp);
> -
> -		for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> -			old_child_spte = READ_ONCE(*(pt + i));
> -			WRITE_ONCE(*(pt + i), 0);
> -			handle_changed_spte(kvm, as_id,
> -				gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
> -				old_child_spte, 0, level - 1);
> -		}
> -
> -		kvm_flush_remote_tlbs_with_address(kvm, gfn,
> -						   KVM_PAGES_PER_HPAGE(level));
> -
> -		free_page((unsigned long)pt);
> -		kmem_cache_free(mmu_page_header_cache, sp);
> -	}
> +	if (was_present && !was_leaf && (pfn_changed || !is_present))
> +		handle_disconnected_tdp_mmu_page(kvm,
> +				spte_to_child_pt(old_spte, level));
>   }
>   
>   static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> 

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory
  2021-01-12 18:10 ` [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
  2021-01-20 20:06   ` Sean Christopherson
@ 2021-01-26 14:14   ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:14 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
> The KVM MMU caches already guarantee that shadow page table memory will
> be zeroed, so there is no reason to re-zero the page in the TDP MMU page
> fault handler.
> 
> No functional change intended.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 411938e97a00..55df596696c7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -665,7 +665,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
>   			list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
>   			child_pt = sp->spt;
> -			clear_page(child_pt);
>   			new_spte = make_nonleaf_spte(child_pt,
>   						     !shadow_accessed_mask);
>   
> 

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs
  2021-01-12 18:10 ` [PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs Ben Gardon
@ 2021-01-26 14:21   ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:21 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level);
> +				u64 old_spte, u64 new_spte, int level,
> +				bool atomic);

If you don't mind, I prefer "shared" as the name for the new argument 
(i.e. "this is what you need to know", rathar than "this is what I want 
you to do").

> 
> +/*
> + * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
> + * associated bookkeeping
> + *
> + * @kvm: kvm instance
> + * @iter: a tdp_iter instance currently on the SPTE that should be set
> + * @new_spte: The value the SPTE should be set to
> + * Returns: true if the SPTE was set, false if it was not. If false is returned,
> + *	    this function will have no side-effects.
> + */
> +static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> +					   struct tdp_iter *iter,
> +					   u64 new_spte)
> +{
> +	u64 *root_pt = tdp_iter_root_pt(iter);
> +	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
> +	int as_id = kvm_mmu_page_as_id(root);
> +
> +	kvm_mmu_lock_assert_held_shared(kvm);
> +
> +	if (cmpxchg64(iter->sptep, iter->old_spte, new_spte) != iter->old_spte)
> +		return false;
> +
> +	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> +			    iter->level, true);
> +
> +	return true;
> +}
> +
> +

Still unused as of this patch, so please move it where it's used.

Note that in this case, "atomic" in the name is appropriate, think of 
hypothetical code like this:

	if (!shared)
		tdp_mmu_set_spte(...)
	else if (!tdp_mmu_set_spte_atomic(...)
		

which says "if there could be concurrent changes, be careful and do 
everything with atomic operations".

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-21 21:32     ` Sean Christopherson
@ 2021-01-26 14:27       ` Paolo Bonzini
  2021-01-26 21:47         ` Ben Gardon
  2021-01-26 22:02         ` Sean Christopherson
  0 siblings, 2 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:27 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 21/01/21 22:32, Sean Christopherson wrote:
> Coming back to this series, I wonder if the RCU approach is truly necessary to
> get the desired scalability.  If both zap_collapsible_sptes() and NX huge page
> recovery zap_only_  leaf SPTEs, then the only path that can actually unlink a
> shadow page while holding the lock for read is the page fault path that installs
> a huge page over an existing shadow page.
> 
> Assuming the above analysis is correct, I think it's worth exploring alternatives
> to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in
> the specific case of overwriting a SP (though that may not exist for rwlocks),
> or maybe something entirely different?

You can do the deferred freeing with a short write-side critical section 
to ensure all readers have terminated.

If the bool argument to handle_disconnected_tdp_mmu_page is true(*), the 
pages would be added to an llist, instead of being freed immediately. 
At the end of a shared critical section you would do

	if (!llist_empty(&kvm->arch.tdp_mmu_disconnected_pages)) {
		struct llist_node *first;
		kvm_mmu_lock(kvm);
		first = __list_del_all(&kvm->arch.tdp_mmu_disconnected_pages);
		kvm_mmu_unlock(kvm);

		/*
		 * All vCPUs have already stopped using the pages when
		 * their TLBs were flushed.  The exclusive critical
		 * section above means that there can be no readers
		 * either.
		 */
		tdp_mmu_free_disconnected_pages(first);
	}

So this is still deferred reclamation, but it's done by one of the vCPUs 
rather than a worker RCU thread.  This would replace patches 11/12/13 
and probably would be implemented after patch 18.

Paolo

(*) this idea is what prompted the comment about s/atomic/shared/


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions
  2021-01-12 18:10 ` [PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions Ben Gardon
@ 2021-01-26 14:29   ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:29 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Peter Xu, Sean Christopherson, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 12/01/21 19:10, Ben Gardon wrote:
> Wrap assertions and warnings checking the MMU lock state in a function
> which uses lockdep_assert_held. While the existing checks use a few
> different functions to check the lock state, they are all better off
> using lockdep_assert_held. This will support a refactoring to move the
> mmu_lock to struct kvm_arch so that it can be replaced with an rwlock for
> x86.
> 
> Reviewed-by: Peter Feiner <pfeiner@google.com>
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/arm64/kvm/mmu.c                     | 2 +-
>   arch/powerpc/include/asm/kvm_book3s_64.h | 7 +++----
>   arch/powerpc/kvm/book3s_hv_nested.c      | 3 +--
>   arch/x86/kvm/mmu/mmu_internal.h          | 4 ++--
>   arch/x86/kvm/mmu/tdp_mmu.c               | 8 ++++----
>   include/linux/kvm_host.h                 | 1 +
>   virt/kvm/kvm_main.c                      | 5 +++++
>   7 files changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 57ef1ec23b56..8b54eb58bf47 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -130,7 +130,7 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
>   	struct kvm *kvm = mmu->kvm;
>   	phys_addr_t end = start + size;
>   
> -	assert_spin_locked(&kvm->mmu_lock);
> +	kvm_mmu_lock_assert_held(kvm);
>   	WARN_ON(size & ~PAGE_MASK);
>   	WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
>   				   may_block));
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 9bb9bb370b53..db2e437cd97c 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -650,8 +650,8 @@ static inline pte_t *find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
>   {
>   	pte_t *pte;
>   
> -	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
> -		"%s called with kvm mmu_lock not held \n", __func__);
> +	kvm_mmu_lock_assert_held(kvm);
> +
>   	pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
>   
>   	return pte;
> @@ -662,8 +662,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq,
>   {
>   	pte_t *pte;
>   
> -	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
> -		"%s called with kvm mmu_lock not held \n", __func__);
> +	kvm_mmu_lock_assert_held(kvm);
>   
>   	if (mmu_notifier_retry(kvm, mmu_seq))
>   		return NULL;
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index 18890dca9476..6d5987d1eee7 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -767,8 +767,7 @@ pte_t *find_kvm_nested_guest_pte(struct kvm *kvm, unsigned long lpid,
>   	if (!gp)
>   		return NULL;
>   
> -	VM_WARN(!spin_is_locked(&kvm->mmu_lock),
> -		"%s called with kvm mmu_lock not held \n", __func__);
> +	kvm_mmu_lock_assert_held(kvm);
>   	pte = __find_linux_pte(gp->shadow_pgtable, ea, NULL, hshift);
>   
>   	return pte;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 7f599cc64178..cc8268cf28d2 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -101,14 +101,14 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
>   static inline void kvm_mmu_get_root(struct kvm *kvm, struct kvm_mmu_page *sp)
>   {
>   	BUG_ON(!sp->root_count);
> -	lockdep_assert_held(&kvm->mmu_lock);
> +	kvm_mmu_lock_assert_held(kvm);
>   
>   	++sp->root_count;
>   }
>   
>   static inline bool kvm_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *sp)
>   {
> -	lockdep_assert_held(&kvm->mmu_lock);
> +	kvm_mmu_lock_assert_held(kvm);
>   	--sp->root_count;
>   
>   	return !sp->root_count;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index fb911ca428b2..1d7c01300495 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -117,7 +117,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>   {
>   	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
>   
> -	lockdep_assert_held(&kvm->mmu_lock);
> +	kvm_mmu_lock_assert_held(kvm);
>   
>   	WARN_ON(root->root_count);
>   	WARN_ON(!root->tdp_mmu_page);
> @@ -425,7 +425,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>   	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
>   	int as_id = kvm_mmu_page_as_id(root);
>   
> -	lockdep_assert_held(&kvm->mmu_lock);
> +	kvm_mmu_lock_assert_held(kvm);
>   
>   	WRITE_ONCE(*iter->sptep, new_spte);
>   
> @@ -1139,7 +1139,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>   	struct kvm_mmu_page *root;
>   	int root_as_id;
>   
> -	lockdep_assert_held(&kvm->mmu_lock);
> +	kvm_mmu_lock_assert_held(kvm);
>   	for_each_tdp_mmu_root(kvm, root) {
>   		root_as_id = kvm_mmu_page_as_id(root);
>   		if (root_as_id != slot->as_id)
> @@ -1324,7 +1324,7 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>   	int root_as_id;
>   	bool spte_set = false;
>   
> -	lockdep_assert_held(&kvm->mmu_lock);
> +	kvm_mmu_lock_assert_held(kvm);
>   	for_each_tdp_mmu_root(kvm, root) {
>   		root_as_id = kvm_mmu_page_as_id(root);
>   		if (root_as_id != slot->as_id)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 6e2773fc406c..022e3522788f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1499,5 +1499,6 @@ void kvm_mmu_lock(struct kvm *kvm);
>   void kvm_mmu_unlock(struct kvm *kvm);
>   int kvm_mmu_lock_needbreak(struct kvm *kvm);
>   int kvm_mmu_lock_cond_resched(struct kvm *kvm);
> +void kvm_mmu_lock_assert_held(struct kvm *kvm);

Probably better to make this an empty inline if !defined(CONFIG_LOCKDEP).

Paolo

>   #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b4c49a7e0556..c504f876176b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -452,6 +452,11 @@ int kvm_mmu_lock_cond_resched(struct kvm *kvm)
>   	return cond_resched_lock(&kvm->mmu_lock);
>   }
>   
> +void kvm_mmu_lock_assert_held(struct kvm *kvm)
> +{
> +	lockdep_assert_held(&kvm->mmu_lock);
> +}
> +
>   #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>   static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>   {
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-21  0:19   ` Sean Christopherson
  2021-01-21 20:17     ` Paolo Bonzini
@ 2021-01-26 14:38     ` Paolo Bonzini
  2021-01-26 17:47       ` Ben Gardon
  1 sibling, 1 reply; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 14:38 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On 21/01/21 01:19, Sean Christopherson wrote:
> What if we simply make the common mmu_lock a union? The rwlock_t is 
> probably a bit bigger, but that's a few bytes for an entire VM. And 
> maybe this would entice/inspire other architectures to move to a similar 
> MMU model.

Looking more at this, there is a problem in that MMU notifier functions 
take the MMU lock.

Yes, qrwlock the size is a bit larger than qspinlock.  However, the fast 
path of qrwlocks is small, and if the slow paths are tiny compared to 
the mmu_lock critical sections that are so big as to require 
cond_resched.  So I would consider just changing all architectures to an 
rwlock.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-26 14:38     ` Paolo Bonzini
@ 2021-01-26 17:47       ` Ben Gardon
  2021-01-26 17:55         ` Paolo Bonzini
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-26 17:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021 at 6:38 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 21/01/21 01:19, Sean Christopherson wrote:
> > What if we simply make the common mmu_lock a union? The rwlock_t is
> > probably a bit bigger, but that's a few bytes for an entire VM. And
> > maybe this would entice/inspire other architectures to move to a similar
> > MMU model.
>
> Looking more at this, there is a problem in that MMU notifier functions
> take the MMU lock.
>
> Yes, qrwlock the size is a bit larger than qspinlock.  However, the fast
> path of qrwlocks is small, and if the slow paths are tiny compared to
> the mmu_lock critical sections that are so big as to require
> cond_resched.  So I would consider just changing all architectures to an
> rwlock.

I like the idea of putting the MMU lock union directly in struct KVM
and will make that change in the next version of this series. In my
testing, I found that wholesale replacing the spin lock with an rwlock
had a noticeable negative performance impact on the legacy / shadow
MMU. Enough that it motivated me to implement this more complex union
scheme. While the difference was pronounced in the dirty log perf test
microbenchmark, it's an open question as to whether it would matter in
practice.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-26 17:47       ` Ben Gardon
@ 2021-01-26 17:55         ` Paolo Bonzini
  2021-01-26 18:11           ` Ben Gardon
  0 siblings, 1 reply; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 17:55 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 26/01/21 18:47, Ben Gardon wrote:
> Enough that it motivated me to implement this more complex union
> scheme. While the difference was pronounced in the dirty log perf test
> microbenchmark, it's an open question as to whether it would matter in
> practice.

I'll look at getting some numbers if it's just the dirty log perf test. 
  Did you see anything in the profile pointing specifically at rwlock?

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-26 17:55         ` Paolo Bonzini
@ 2021-01-26 18:11           ` Ben Gardon
  2021-01-26 20:47             ` Paolo Bonzini
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-26 18:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021 at 9:55 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 26/01/21 18:47, Ben Gardon wrote:
> > Enough that it motivated me to implement this more complex union
> > scheme. While the difference was pronounced in the dirty log perf test
> > microbenchmark, it's an open question as to whether it would matter in
> > practice.
>
> I'll look at getting some numbers if it's just the dirty log perf test.
>   Did you see anything in the profile pointing specifically at rwlock?

When I did a strict replacement I found ~10% worse memory population
performance.
Running dirty_log_perf_test -v 96 -b 3g -i 5 with the TDP MMU
disabled, I got 119 sec to populate memory as the baseline and 134 sec
with an earlier version of this series which just replaced the
spinlock with an rwlock. I believe this difference is statistically
significant, but didn't run multiple trials.
I didn't take notes when profiling, but I'm pretty sure the rwlock
slowpath showed up a lot. This was a very high contention scenario, so
it's probably not indicative of real-world performance.
In the slow path, the rwlock is certainly slower than a spin lock.

If the real impact doesn't seem too large, I'd be very happy to just
replace the spinlock.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links
  2021-01-22 18:32   ` Sean Christopherson
@ 2021-01-26 18:17     ` Ben Gardon
  0 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-26 18:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On Fri, Jan 22, 2021 at 10:32 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 12, 2021, Ben Gardon wrote:
> > In order to protect TDP MMU PT memory with RCU, ensure that page table
> > links are properly rcu_derefenced.
> >
> > Reviewed-by: Peter Feiner <pfeiner@google.com>
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_iter.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> > index 87b7e16911db..82855613ffa0 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.c
> > +++ b/arch/x86/kvm/mmu/tdp_iter.c
> > @@ -49,6 +49,8 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> >   */
> >  u64 *spte_to_child_pt(u64 spte, int level)
> >  {
> > +     u64 *child_pt;
> > +
> >       /*
> >        * There's no child entry if this entry isn't present or is a
> >        * last-level entry.
> > @@ -56,7 +58,9 @@ u64 *spte_to_child_pt(u64 spte, int level)
> >       if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
> >               return NULL;
> >
> > -     return __va(spte_to_pfn(spte) << PAGE_SHIFT);
> > +     child_pt = __va(spte_to_pfn(spte) << PAGE_SHIFT);
> > +
> > +     return rcu_dereference(child_pt);
>
> This is what bugs me the most about the RCU usage.  We're reaping the functional
> benefits of RCU without doing the grunt work to truly RCU-ify the TDP MMU.  The
> above rcu_dereference() barely scratches the surface of what's being protected
> by RCU.  There are already multiple mechanisms that protect the page tables,
> throwing RCU into the mix without fully integrating RCU makes for simple code
> and avoids reinventing the wheel (big thumbs up), but ends up adding complexity
> to an already complex system.  E.g. the lockless walks in the old MMU are
> complex on the surface, but I find them easier to think through because they
> explicitly rely on the same mechanism (remote TLB flush) that is used to protect
> guest usage of the page tables.
>
> Ideally, I think struct kvm_mmu_page's 'u64 *spt' would be annotated with __rcu,
> as that would provide a high level of enforcement and would also highlight where
> we're using other mechanisms to ensure correctness.  E.g. dereferencing root->spt
> in kvm_tdp_mmu_get_vcpu_root_hpa() relies on the root being pinned by
> get_tdp_mmu_vcpu_root(), and _that_ in turn relies on hold rwlock for write.
> Unfortunately since kvm_mmu_page is shared with the old mmu, annotating ->spt
> that doesn't work well.  We could employ a union to make it work, but that'd
> probably do more harm than good.
>
> The middle ground would be to annotate pt_path and sptep in struct tdp_iter.
> That gets a decent chunk of the enforcement and also helps highlight what's
> being protected with RCU.  Assuming we end up going with RCU, I think this
> single rcu_dereference should be replace with something like the below patch.

Thank you for explaining your thought process here. You make an
excellent point that this results in code that is substantially less
self-documenting than it could be. It seems like your patch below will
substantially improve the automated checker's ability to validate the
RCU usage as well. I'll happily include it in the next version of this
series. I appreciate the way that the patch below makes all references
to the entries of the page table RCU dereferences. Not doing those
dereferences was certainly an error in the original patch.

>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index 82855613ffa0..e000642d938d 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -12,7 +12,7 @@ static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
>  {
>         iter->sptep = iter->pt_path[iter->level - 1] +
>                 SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> -       iter->old_spte = READ_ONCE(*iter->sptep);
> +       iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
>  }
>
>  static gfn_t round_gfn_for_level(gfn_t gfn, int level)
> @@ -34,7 +34,7 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
>         iter->root_level = root_level;
>         iter->min_level = min_level;
>         iter->level = root_level;
> -       iter->pt_path[iter->level - 1] = root_pt;
> +       iter->pt_path[iter->level - 1] = (tdp_ptep_t)root_pt;
>
>         iter->gfn = round_gfn_for_level(iter->goal_gfn, iter->level);
>         tdp_iter_refresh_sptep(iter);
> @@ -47,9 +47,9 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
>   * address of the child page table referenced by the SPTE. Returns null if
>   * there is no such entry.
>   */
> -u64 *spte_to_child_pt(u64 spte, int level)
> +tdp_ptep_t spte_to_child_pt(u64 spte, int level)
>  {
> -       u64 *child_pt;
> +       tdp_ptep_t child_pt;
>
>         /*
>          * There's no child entry if this entry isn't present or is a
> @@ -58,9 +58,9 @@ u64 *spte_to_child_pt(u64 spte, int level)
>         if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
>                 return NULL;
>
> -       child_pt = __va(spte_to_pfn(spte) << PAGE_SHIFT);
> +       child_pt = (tdp_ptep_t)__va(spte_to_pfn(spte) << PAGE_SHIFT);
>
> -       return rcu_dereference(child_pt);
> +       return child_pt;
>  }
>
>  /*
> @@ -69,7 +69,7 @@ u64 *spte_to_child_pt(u64 spte, int level)
>   */
>  static bool try_step_down(struct tdp_iter *iter)
>  {
> -       u64 *child_pt;
> +       tdp_ptep_t child_pt;
>
>         if (iter->level == iter->min_level)
>                 return false;
> @@ -78,7 +78,7 @@ static bool try_step_down(struct tdp_iter *iter)
>          * Reread the SPTE before stepping down to avoid traversing into page
>          * tables that are no longer linked from this entry.
>          */
> -       iter->old_spte = READ_ONCE(*iter->sptep);
> +       iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
>
>         child_pt = spte_to_child_pt(iter->old_spte, iter->level);
>         if (!child_pt)
> @@ -112,7 +112,7 @@ static bool try_step_side(struct tdp_iter *iter)
>         iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
>         iter->goal_gfn = iter->gfn;
>         iter->sptep++;
> -       iter->old_spte = READ_ONCE(*iter->sptep);
> +       iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
>
>         return true;
>  }
> @@ -175,11 +175,11 @@ void tdp_iter_refresh_walk(struct tdp_iter *iter)
>         if (iter->gfn > goal_gfn)
>                 goal_gfn = iter->gfn;
>
> -       tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
> +       tdp_iter_start(iter, rcu_dereference(iter->pt_path[iter->root_level - 1]),
>                        iter->root_level, iter->min_level, goal_gfn);
>  }
>
> -u64 *tdp_iter_root_pt(struct tdp_iter *iter)
> +tdp_ptep_t tdp_iter_root_pt(struct tdp_iter *iter)
>  {
>         return iter->pt_path[iter->root_level - 1];
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 47170d0dc98e..bf882dab8ec5 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -7,6 +7,8 @@
>
>  #include "mmu.h"
>
> +typedef u64 __rcu *tdp_ptep_t;
> +
>  /*
>   * A TDP iterator performs a pre-order walk over a TDP paging structure.
>   */
> @@ -17,9 +19,9 @@ struct tdp_iter {
>          */
>         gfn_t goal_gfn;
>         /* Pointers to the page tables traversed to reach the current SPTE */
> -       u64 *pt_path[PT64_ROOT_MAX_LEVEL];
> +       tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
>         /* A pointer to the current SPTE */
> -       u64 *sptep;
> +       tdp_ptep_t sptep;
>         /* The lowest GFN mapped by the current SPTE */
>         gfn_t gfn;
>         /* The level of the root page given to the iterator */
> @@ -49,12 +51,12 @@ struct tdp_iter {
>  #define for_each_tdp_pte(iter, root, root_level, start, end) \
>         for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)
>
> -u64 *spte_to_child_pt(u64 pte, int level);
> +tdp_ptep_t spte_to_child_pt(u64 pte, int level);
>
>  void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
>                     int min_level, gfn_t goal_gfn);
>  void tdp_iter_next(struct tdp_iter *iter);
>  void tdp_iter_refresh_walk(struct tdp_iter *iter);
> -u64 *tdp_iter_root_pt(struct tdp_iter *iter);
> +tdp_ptep_t tdp_iter_root_pt(struct tdp_iter *iter);
>
>  #endif /* __KVM_X86_MMU_TDP_ITER_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 45160ff84e91..27b850904230 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -509,7 +509,7 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>                                            struct tdp_iter *iter,
>                                            u64 new_spte)
>  {
> -       u64 *root_pt = tdp_iter_root_pt(iter);
> +       tdp_ptep_t root_pt = tdp_iter_root_pt(iter);
>         struct kvm_mmu_page *root = sptep_to_sp(root_pt);
>         int as_id = kvm_mmu_page_as_id(root);
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-26 18:11           ` Ben Gardon
@ 2021-01-26 20:47             ` Paolo Bonzini
  2021-01-27 20:08               ` Ben Gardon
  0 siblings, 1 reply; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-26 20:47 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 26/01/21 19:11, Ben Gardon wrote:
> When I did a strict replacement I found ~10% worse memory population
> performance.
> Running dirty_log_perf_test -v 96 -b 3g -i 5 with the TDP MMU
> disabled, I got 119 sec to populate memory as the baseline and 134 sec
> with an earlier version of this series which just replaced the
> spinlock with an rwlock. I believe this difference is statistically
> significant, but didn't run multiple trials.
> I didn't take notes when profiling, but I'm pretty sure the rwlock
> slowpath showed up a lot. This was a very high contention scenario, so
> it's probably not indicative of real-world performance.
> In the slow path, the rwlock is certainly slower than a spin lock.
> 
> If the real impact doesn't seem too large, I'd be very happy to just
> replace the spinlock.

Ok, so let's use the union idea and add a "#define KVM_HAVE_MMU_RWLOCK" 
to x86.  The virt/kvm/kvm_main.c MMU notifiers functions can use the 
#define to pick between write_lock and spin_lock.

For x86 I want to switch to tdp_mmu=1 by default as soon as parallel 
page faults are in, so we can use the rwlock unconditionally and drop 
the wrappers, except possibly for some kind of kvm_mmu_lock/unlock_root 
that choose between read_lock for TDP MMU and write_lock for shadow MMU.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-26 13:37   ` Paolo Bonzini
@ 2021-01-26 21:07     ` Ben Gardon
  0 siblings, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-26 21:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021 at 5:37 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 12/01/21 19:10, Ben Gardon wrote:
> > +      *  May be acquired under the MMU lock in read mode or non-overlapping
> > +      *  with the MMU lock.
> > +      */
> > +     spinlock_t tdp_mmu_pages_lock;
>
> Is this correct?  My understanding is that:
>
> - you can take tdp_mmu_pages_lock from a shared MMU lock critical section
>
> - you don't need to take tdp_mmu_pages_lock from an exclusive MMU lock
> critical section, because you can't be concurrent with a shared critical
> section
>
> - but then, you can't take tdp_mmu_pages_lock outside the MMU lock,
> because you could have
>
>     write_lock(mmu_lock)
>                                       spin_lock(tdp_mmu_pages_lock)
>     do tdp_mmu_pages_lock stuff  !!!  do tdp_mmu_pages_lock stuff
>     write_unlock(mmu_lock)
>                                       spin_unlock(tdp_mmu_pages_lock)
>

You're absolutely right, that would cause a problem. I'll amend the
comment to specify that the lock should only be held under the mmu
lock in read mode.

> Paolo
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-26 14:27       ` Paolo Bonzini
@ 2021-01-26 21:47         ` Ben Gardon
  2021-01-26 22:02         ` Sean Christopherson
  1 sibling, 0 replies; 70+ messages in thread
From: Ben Gardon @ 2021-01-26 21:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021 at 6:28 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 21/01/21 22:32, Sean Christopherson wrote:
> > Coming back to this series, I wonder if the RCU approach is truly necessary to
> > get the desired scalability.  If both zap_collapsible_sptes() and NX huge page
> > recovery zap_only_  leaf SPTEs, then the only path that can actually unlink a
> > shadow page while holding the lock for read is the page fault path that installs
> > a huge page over an existing shadow page.
> >
> > Assuming the above analysis is correct, I think it's worth exploring alternatives
> > to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in
> > the specific case of overwriting a SP (though that may not exist for rwlocks),
> > or maybe something entirely different?
>
> You can do the deferred freeing with a short write-side critical section
> to ensure all readers have terminated.
>
> If the bool argument to handle_disconnected_tdp_mmu_page is true(*), the
> pages would be added to an llist, instead of being freed immediately.
> At the end of a shared critical section you would do
>
>         if (!llist_empty(&kvm->arch.tdp_mmu_disconnected_pages)) {
>                 struct llist_node *first;
>                 kvm_mmu_lock(kvm);
>                 first = __list_del_all(&kvm->arch.tdp_mmu_disconnected_pages);
>                 kvm_mmu_unlock(kvm);
>
>                 /*
>                  * All vCPUs have already stopped using the pages when
>                  * their TLBs were flushed.  The exclusive critical
>                  * section above means that there can be no readers
>                  * either.
>                  */
>                 tdp_mmu_free_disconnected_pages(first);
>         }
>
> So this is still deferred reclamation, but it's done by one of the vCPUs
> rather than a worker RCU thread.  This would replace patches 11/12/13
> and probably would be implemented after patch 18.

While I agree that this would work, it could be a major performance
bottleneck as it could result in the MMU lock being acquired in read
mode by a VCPU thread handling a page fault. Even though the critical
section is very short it still has to serialize with the potentially
many overlapping page fault handlers which want the MMU read lock. In
order to perform well with hundreds of vCPUs, the vCPU threads really
cannot be acquiring the MMU lock in write mode. The MMU lock above
could be replaced with the TDP MMU pages lock, but that still adds
serialization where it's not really necessary.
The use of RCU also provides a nice separation of concerns, freeing
the various functions which need to remove pages from the paging
structure from having to follow up on freeing them later.

>
> Paolo
>
> (*) this idea is what prompted the comment about s/atomic/shared/
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-01-21  0:55   ` Sean Christopherson
@ 2021-01-26 21:57     ` Ben Gardon
  2021-01-27 17:14       ` Sean Christopherson
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-26 21:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On Wed, Jan 20, 2021 at 4:56 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 12, 2021, Ben Gardon wrote:
> > Make the last few changes necessary to enable the TDP MMU to handle page
> > faults in parallel while holding the mmu_lock in read mode.
> >
> > Reviewed-by: Peter Feiner <pfeiner@google.com>
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 280d7cd6f94b..fa111ceb67d4 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3724,7 +3724,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >               return r;
> >
> >       r = RET_PF_RETRY;
> > -     kvm_mmu_lock(vcpu->kvm);
> > +
> > +     if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>
> Off topic, what do you think about rewriting is_tdp_mmu_root() to be both more
> performant and self-documenting as to when is_tdp_mmu_root() !=
> kvm->arch.tdp_mmu_enabled?  E.g. key off is_guest_mode() and then do a thorough
> audit/check when CONFIG_KVM_MMU_AUDIT=y?
>
> #ifdef CONFIG_KVM_MMU_AUDIT
> bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> {
>         struct kvm_mmu_page *sp;
>
>         if (!kvm->arch.tdp_mmu_enabled)
>                 return false;
>         if (WARN_ON(!VALID_PAGE(hpa)))
>                 return false;
>
>         sp = to_shadow_page(hpa);
>         if (WARN_ON(!sp))
>                 return false;
>
>         return sp->tdp_mmu_page && sp->root_count;
> }
> #endif
>
> bool is_tdp_mmu(struct kvm_vcpu *vcpu)
> {
>         bool is_tdp_mmu = kvm->arch.tdp_mmu_enabled && !is_guest_mode(vcpu);
>
> #ifdef CONFIG_KVM_MMU_AUDIT
>         WARN_ON(is_tdp_mmu != is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
> #endif
>         return is_tdp_mmu;
> }

Great suggestions. In the interest of keeping this (already enormous)
series small, I'm inclined to make those changes in a future series if
that's alright with you.

>
> > +             kvm_mmu_lock_shared(vcpu->kvm);
> > +     else
> > +             kvm_mmu_lock(vcpu->kvm);
> > +
> >       if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
> >               goto out_unlock;
> >       r = make_mmu_pages_available(vcpu);
> > @@ -3739,7 +3744,10 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >                                prefault, is_tdp);
> >
> >  out_unlock:
> > -     kvm_mmu_unlock(vcpu->kvm);
> > +     if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > +             kvm_mmu_unlock_shared(vcpu->kvm);
> > +     else
> > +             kvm_mmu_unlock(vcpu->kvm);
> >       kvm_release_pfn_clean(pfn);
> >       return r;
> >  }
> > --
> > 2.30.0.284.gd98b1dd5eaa7-goog
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-26 14:27       ` Paolo Bonzini
  2021-01-26 21:47         ` Ben Gardon
@ 2021-01-26 22:02         ` Sean Christopherson
  2021-01-26 22:09           ` Sean Christopherson
  2021-01-27 12:40           ` Paolo Bonzini
  1 sibling, 2 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-26 22:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021, Paolo Bonzini wrote:
> On 21/01/21 22:32, Sean Christopherson wrote:
> > Coming back to this series, I wonder if the RCU approach is truly necessary to
> > get the desired scalability.  If both zap_collapsible_sptes() and NX huge page
> > recovery zap_only_  leaf SPTEs, then the only path that can actually unlink a
> > shadow page while holding the lock for read is the page fault path that installs
> > a huge page over an existing shadow page.
> > 
> > Assuming the above analysis is correct, I think it's worth exploring alternatives
> > to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in
> > the specific case of overwriting a SP (though that may not exist for rwlocks),
> > or maybe something entirely different?
> 
> You can do the deferred freeing with a short write-side critical section to
> ensure all readers have terminated.

Hmm, the most obvious downside I see is that the zap_collapsible_sptes() case
will not scale as well as the RCU approach.  E.g. the lock may be heavily
contested when refaulting all of guest memory to (re)install huge pages after a
failed migration.

Though I wonder, could we do something even more clever for that particular
case?  And I suppose it would apply to NX huge pages as well.  Instead of
zapping the leaf PTEs and letting the fault handler install the huge page, do an
in-place promotion when dirty logging is disabled.  That could all be done under
the read lock, and with Paolo's method for deferred free on the back end.  That
way only the thread doing the memslot update would take mmu_lock for write, and
only once per memslot update.

> If the bool argument to handle_disconnected_tdp_mmu_page is true(*), the
> pages would be added to an llist, instead of being freed immediately. At the
> end of a shared critical section you would do
> 
> 	if (!llist_empty(&kvm->arch.tdp_mmu_disconnected_pages)) {
> 		struct llist_node *first;
> 		kvm_mmu_lock(kvm);
> 		first = __list_del_all(&kvm->arch.tdp_mmu_disconnected_pages);
> 		kvm_mmu_unlock(kvm);
> 
> 		/*
> 		 * All vCPUs have already stopped using the pages when
> 		 * their TLBs were flushed.  The exclusive critical
> 		 * section above means that there can be no readers
> 		 * either.
> 		 */
> 		tdp_mmu_free_disconnected_pages(first);
> 	}
> 
> So this is still deferred reclamation, but it's done by one of the vCPUs
> rather than a worker RCU thread.  This would replace patches 11/12/13 and
> probably would be implemented after patch 18.
> 
> Paolo
> 
> (*) this idea is what prompted the comment about s/atomic/shared/
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-26 22:02         ` Sean Christopherson
@ 2021-01-26 22:09           ` Sean Christopherson
  2021-01-27 12:40           ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-26 22:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021, Sean Christopherson wrote:
> On Tue, Jan 26, 2021, Paolo Bonzini wrote:
> > On 21/01/21 22:32, Sean Christopherson wrote:
> > > Coming back to this series, I wonder if the RCU approach is truly necessary to
> > > get the desired scalability.  If both zap_collapsible_sptes() and NX huge page
> > > recovery zap_only_  leaf SPTEs, then the only path that can actually unlink a
> > > shadow page while holding the lock for read is the page fault path that installs
> > > a huge page over an existing shadow page.
> > > 
> > > Assuming the above analysis is correct, I think it's worth exploring alternatives
> > > to using RCU to defer freeing the SP memory, e.g. promoting to a write lock in
> > > the specific case of overwriting a SP (though that may not exist for rwlocks),
> > > or maybe something entirely different?
> > 
> > You can do the deferred freeing with a short write-side critical section to
> > ensure all readers have terminated.
> 
> Hmm, the most obvious downside I see is that the zap_collapsible_sptes() case
> will not scale as well as the RCU approach.  E.g. the lock may be heavily
> contested when refaulting all of guest memory to (re)install huge pages after a
> failed migration.
> 
> Though I wonder, could we do something even more clever for that particular
> case?  And I suppose it would apply to NX huge pages as well.  Instead of
> zapping the leaf PTEs and letting the fault handler install the huge page, do an
> in-place promotion when dirty logging is disabled.  That could all be done under
> the read lock, and with Paolo's method for deferred free on the back end.  That
> way only the thread doing the memslot update would take mmu_lock for write, and
> only once per memslot update.

Oh, and we could even skip the remote TLB flush in that case since the GPA->HPA
translation is unchanged.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock
  2021-01-26 22:02         ` Sean Christopherson
  2021-01-26 22:09           ` Sean Christopherson
@ 2021-01-27 12:40           ` Paolo Bonzini
  1 sibling, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-27 12:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, linux-kernel, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 26/01/21 23:02, Sean Christopherson wrote:
>> You can do the deferred freeing with a short write-side critical section to
>> ensure all readers have terminated.
>
> Hmm, the most obvious downside I see is that the zap_collapsible_sptes() case
> will not scale as well as the RCU approach.  E.g. the lock may be heavily
> contested when refaulting all of guest memory to (re)install huge pages after a
> failed migration.

The simplest solution is to use a write_trylock on the read_unlock() 
path; if it fails, schedule a delayed work item 1 second in the future 
so that it's possible to do some batching.

(The work item would also have to re-check the llist after each iteration.)

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU
  2021-01-26 21:57     ` Ben Gardon
@ 2021-01-27 17:14       ` Sean Christopherson
  0 siblings, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2021-01-27 17:14 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Paolo Bonzini, Peter Xu, Peter Shier, Peter Feiner,
	Junaid Shahid, Jim Mattson, Yulei Zhang, Wanpeng Li,
	Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021, Ben Gardon wrote:
> On Wed, Jan 20, 2021 at 4:56 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jan 12, 2021, Ben Gardon wrote:
> > > Make the last few changes necessary to enable the TDP MMU to handle page
> > > faults in parallel while holding the mmu_lock in read mode.
> > >
> > > Reviewed-by: Peter Feiner <pfeiner@google.com>
> > >
> > > Signed-off-by: Ben Gardon <bgardon@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 280d7cd6f94b..fa111ceb67d4 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3724,7 +3724,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > >               return r;
> > >
> > >       r = RET_PF_RETRY;
> > > -     kvm_mmu_lock(vcpu->kvm);
> > > +
> > > +     if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> >
> > Off topic, what do you think about rewriting is_tdp_mmu_root() to be both more
> > performant and self-documenting as to when is_tdp_mmu_root() !=
> > kvm->arch.tdp_mmu_enabled?  E.g. key off is_guest_mode() and then do a thorough
> > audit/check when CONFIG_KVM_MMU_AUDIT=y?
> >
> > #ifdef CONFIG_KVM_MMU_AUDIT
> > bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> > {
> >         struct kvm_mmu_page *sp;
> >
> >         if (!kvm->arch.tdp_mmu_enabled)
> >                 return false;
> >         if (WARN_ON(!VALID_PAGE(hpa)))
> >                 return false;
> >
> >         sp = to_shadow_page(hpa);
> >         if (WARN_ON(!sp))
> >                 return false;
> >
> >         return sp->tdp_mmu_page && sp->root_count;
> > }
> > #endif
> >
> > bool is_tdp_mmu(struct kvm_vcpu *vcpu)
> > {
> >         bool is_tdp_mmu = kvm->arch.tdp_mmu_enabled && !is_guest_mode(vcpu);
> >
> > #ifdef CONFIG_KVM_MMU_AUDIT
> >         WARN_ON(is_tdp_mmu != is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa));
> > #endif
> >         return is_tdp_mmu;
> > }
> 
> Great suggestions. In the interest of keeping this (already enormous)
> series small, I'm inclined to make those changes in a future series if
> that's alright with you.

Yep, definitely a different series.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-26 20:47             ` Paolo Bonzini
@ 2021-01-27 20:08               ` Ben Gardon
  2021-01-27 20:55                 ` Paolo Bonzini
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-27 20:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Jan 26, 2021 at 12:48 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 26/01/21 19:11, Ben Gardon wrote:
> > When I did a strict replacement I found ~10% worse memory population
> > performance.
> > Running dirty_log_perf_test -v 96 -b 3g -i 5 with the TDP MMU
> > disabled, I got 119 sec to populate memory as the baseline and 134 sec
> > with an earlier version of this series which just replaced the
> > spinlock with an rwlock. I believe this difference is statistically
> > significant, but didn't run multiple trials.
> > I didn't take notes when profiling, but I'm pretty sure the rwlock
> > slowpath showed up a lot. This was a very high contention scenario, so
> > it's probably not indicative of real-world performance.
> > In the slow path, the rwlock is certainly slower than a spin lock.
> >
> > If the real impact doesn't seem too large, I'd be very happy to just
> > replace the spinlock.
>
> Ok, so let's use the union idea and add a "#define KVM_HAVE_MMU_RWLOCK"
> to x86.  The virt/kvm/kvm_main.c MMU notifiers functions can use the
> #define to pick between write_lock and spin_lock.

I'm not entirely sure I understand this suggestion. Are you suggesting
we'd have the spinlock and rwlock in a union in struct kvm but then
use a static define to choose which one is used by other functions? It
seems like if we're using static defines the union doesn't add value.
If we do use the union, I think the advantages offered by __weak
wrapper functions, overridden on a per-arch basis, are worthwhile.

>
> For x86 I want to switch to tdp_mmu=1 by default as soon as parallel
> page faults are in, so we can use the rwlock unconditionally and drop
> the wrappers, except possibly for some kind of kvm_mmu_lock/unlock_root
> that choose between read_lock for TDP MMU and write_lock for shadow MMU.
>
> Thanks!
>
> Paolo
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-27 20:08               ` Ben Gardon
@ 2021-01-27 20:55                 ` Paolo Bonzini
  2021-01-27 21:20                   ` Ben Gardon
  0 siblings, 1 reply; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-27 20:55 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 27/01/21 21:08, Ben Gardon wrote:
> I'm not entirely sure I understand this suggestion. Are you suggesting
> we'd have the spinlock and rwlock in a union in struct kvm but then
> use a static define to choose which one is used by other functions? It
> seems like if we're using static defines the union doesn't add value.

Of course you're right.  You'd just place the #ifdef in the struct kvm 
definition.

You can place static inline functions for lock/unlock in 
virt/kvm/mmu_lock.h, in order to avoid a proliferation of #ifdefs.

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-27 20:55                 ` Paolo Bonzini
@ 2021-01-27 21:20                   ` Ben Gardon
  2021-01-28  8:18                     ` Paolo Bonzini
  0 siblings, 1 reply; 70+ messages in thread
From: Ben Gardon @ 2021-01-27 21:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Jan 27, 2021 at 12:55 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 27/01/21 21:08, Ben Gardon wrote:
> > I'm not entirely sure I understand this suggestion. Are you suggesting
> > we'd have the spinlock and rwlock in a union in struct kvm but then
> > use a static define to choose which one is used by other functions? It
> > seems like if we're using static defines the union doesn't add value.
>
> Of course you're right.  You'd just place the #ifdef in the struct kvm
> definition.

Ah okay, thanks for clarifying.

>
> You can place static inline functions for lock/unlock in
> virt/kvm/mmu_lock.h, in order to avoid a proliferation of #ifdefs.

Would you prefer to make that change in this series or at a later
date? I'm assuming this would replace all the wrapper functions and
mean that x86 is rwlock only.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak
  2021-01-27 21:20                   ` Ben Gardon
@ 2021-01-28  8:18                     ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2021-01-28  8:18 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Sean Christopherson, LKML, kvm, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 27/01/21 22:20, Ben Gardon wrote:
> On Wed, Jan 27, 2021 at 12:55 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 27/01/21 21:08, Ben Gardon wrote:
>>> I'm not entirely sure I understand this suggestion. Are you suggesting
>>> we'd have the spinlock and rwlock in a union in struct kvm but then
>>> use a static define to choose which one is used by other functions? It
>>> seems like if we're using static defines the union doesn't add value.
>>
>> Of course you're right.  You'd just place the #ifdef in the struct kvm
>> definition.
> 
> Ah okay, thanks for clarifying.
> 
>>
>> You can place static inline functions for lock/unlock in
>> virt/kvm/mmu_lock.h, in order to avoid a proliferation of #ifdefs.
> 
> Would you prefer to make that change in this series or at a later
> date? I'm assuming this would replace all the wrapper functions and
> mean that x86 is rwlock only.

Yes, exactly.  I would like to make tdp_mmu=1 the default as soon as 
parallel page faults are in (and thus scalability should be on par with 
the shadow MMU).

Paolo


^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2021-01-28  8:20 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-12 18:10 [PATCH 00/24] Allow parallel page faults with TDP MMU Ben Gardon
2021-01-12 18:10 ` [PATCH 01/24] locking/rwlocks: Add contention detection for rwlocks Ben Gardon
2021-01-12 18:10 ` [PATCH 02/24] sched: Add needbreak " Ben Gardon
2021-01-12 18:10 ` [PATCH 03/24] sched: Add cond_resched_rwlock Ben Gardon
2021-01-12 18:10 ` [PATCH 04/24] kvm: x86/mmu: change TDP MMU yield function returns to match cond_resched Ben Gardon
2021-01-20 18:38   ` Sean Christopherson
2021-01-21 20:22     ` Paolo Bonzini
2021-01-26 14:11     ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 05/24] kvm: x86/mmu: Fix yielding in TDP MMU Ben Gardon
2021-01-20 19:28   ` Sean Christopherson
2021-01-22  1:06     ` Ben Gardon
2021-01-12 18:10 ` [PATCH 06/24] kvm: x86/mmu: Skip no-op changes in TDP MMU functions Ben Gardon
2021-01-20 19:51   ` Sean Christopherson
2021-01-25 23:51     ` Ben Gardon
2021-01-12 18:10 ` [PATCH 07/24] kvm: x86/mmu: Add comment on __tdp_mmu_set_spte Ben Gardon
2021-01-26 14:13   ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 08/24] kvm: x86/mmu: Add lockdep when setting a TDP MMU SPTE Ben Gardon
2021-01-20 19:58   ` Sean Christopherson
2021-01-26 14:13   ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 09/24] kvm: x86/mmu: Don't redundantly clear TDP MMU pt memory Ben Gardon
2021-01-20 20:06   ` Sean Christopherson
2021-01-26 14:14   ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 10/24] kvm: x86/mmu: Factor out handle disconnected pt Ben Gardon
2021-01-20 20:30   ` Sean Christopherson
2021-01-26 14:14   ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 11/24] kvm: x86/mmu: Put TDP MMU PT walks in RCU read-critical section Ben Gardon
2021-01-20 22:19   ` Sean Christopherson
2021-01-12 18:10 ` [PATCH 12/24] kvm: x86/kvm: RCU dereference tdp mmu page table links Ben Gardon
2021-01-22 18:32   ` Sean Christopherson
2021-01-26 18:17     ` Ben Gardon
2021-01-12 18:10 ` [PATCH 13/24] kvm: x86/mmu: Only free tdp_mmu pages after a grace period Ben Gardon
2021-01-12 18:10 ` [PATCH 14/24] kvm: mmu: Wrap mmu_lock lock / unlock in a function Ben Gardon
2021-01-12 18:10 ` [PATCH 15/24] kvm: mmu: Wrap mmu_lock cond_resched and needbreak Ben Gardon
2021-01-21  0:19   ` Sean Christopherson
2021-01-21 20:17     ` Paolo Bonzini
2021-01-26 14:38     ` Paolo Bonzini
2021-01-26 17:47       ` Ben Gardon
2021-01-26 17:55         ` Paolo Bonzini
2021-01-26 18:11           ` Ben Gardon
2021-01-26 20:47             ` Paolo Bonzini
2021-01-27 20:08               ` Ben Gardon
2021-01-27 20:55                 ` Paolo Bonzini
2021-01-27 21:20                   ` Ben Gardon
2021-01-28  8:18                     ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 16/24] kvm: mmu: Wrap mmu_lock assertions Ben Gardon
2021-01-26 14:29   ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 17/24] kvm: mmu: Move mmu_lock to struct kvm_arch Ben Gardon
2021-01-12 18:10 ` [PATCH 18/24] kvm: x86/mmu: Use an rwlock for the x86 TDP MMU Ben Gardon
2021-01-21  0:45   ` Sean Christopherson
2021-01-12 18:10 ` [PATCH 19/24] kvm: x86/mmu: Protect tdp_mmu_pages with a lock Ben Gardon
2021-01-21 19:22   ` Sean Christopherson
2021-01-21 21:32     ` Sean Christopherson
2021-01-26 14:27       ` Paolo Bonzini
2021-01-26 21:47         ` Ben Gardon
2021-01-26 22:02         ` Sean Christopherson
2021-01-26 22:09           ` Sean Christopherson
2021-01-27 12:40           ` Paolo Bonzini
2021-01-26 13:37   ` Paolo Bonzini
2021-01-26 21:07     ` Ben Gardon
2021-01-12 18:10 ` [PATCH 20/24] kvm: x86/mmu: Add atomic option for setting SPTEs Ben Gardon
2021-01-26 14:21   ` Paolo Bonzini
2021-01-12 18:10 ` [PATCH 21/24] kvm: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map Ben Gardon
2021-01-12 18:10 ` [PATCH 22/24] kvm: x86/mmu: Flush TLBs after zap in TDP MMU PF handler Ben Gardon
2021-01-21  0:05   ` Sean Christopherson
2021-01-12 18:10 ` [PATCH 23/24] kvm: x86/mmu: Freeze SPTEs in disconnected pages Ben Gardon
2021-01-12 18:10 ` [PATCH 24/24] kvm: x86/mmu: Allow parallel page faults for the TDP MMU Ben Gardon
2021-01-21  0:55   ` Sean Christopherson
2021-01-26 21:57     ` Ben Gardon
2021-01-27 17:14       ` Sean Christopherson
2021-01-26 13:37   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).