All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU
@ 2021-12-13 22:59 David Matlack
  2021-12-13 22:59 ` [PATCH v1 01/13] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

This series implements Eager Page Splitting for the TDP MMU.

This is a follow-up to the RFC implementation [1] that incorporates
review feedback and bug fixes discovered during testing. See the "v1"
section below for a list of all changes.

"Eager Page Splitting" is an optimization that has been in use in Google
Cloud since 2016 to reduce the performance impact of live migration on
customer workloads. It was originally designed and implemented by Peter
Feiner <pfeiner@google.com>.

For background and performance motivation for this feature, please
see "RFC: KVM: x86/mmu: Eager Page Splitting" [2].

Implementation
==============

This series implements support for splitting all huge pages mapped by
the TDP MMU. Pages mapped by the shadow MMU are not split, although I
plan to add the support in a future patchset.

Eager page splitting is triggered in two ways:

- KVM_SET_USER_MEMORY_REGION ioctl: If this ioctl is invoked to enable
  dirty logging on a memslot and KVM_DIRTY_LOG_INITIALLY_SET is not
  enabled, KVM will attempt to split all huge pages in the memslot down
  to the 4K level.

- KVM_CLEAR_DIRTY_LOG ioctl: If this ioctl is invoked and
  KVM_DIRTY_LOG_INITIALLY_SET is enabled, KVM will attempt to split all
  huge pages cleared by the ioctl down to the 4K level before attempting
  to write-protect them.

Eager page splitting is enabled by default in both paths but can be
disabled with the writable module parameter
eagerly_split_huge_pages_for_dirty_logging.

Splitting for pages mapped by the TDP MMU is done under the MMU lock in
read mode. The lock is dropped and the thread rescheduled if contention
or need_resched() is detected.

To allocate memory for the lower level page tables, we attempt to
allocate without dropping the MMU lock using GFP_NOWAIT to avoid doing
direct reclaim or invoking filesystem callbacks. If that fails we drop
the lock and perform a normal GFP_KERNEL allocation.

Performance
===========

Eager page splitting moves the cost of splitting huge pages off of the
vCPU thread and onto the thread invoking one of the aforementioned
ioctls. This is useful because:

 - Splitting on the vCPU thread interrupts vCPUs execution and is
   disruptive to customers whereas splitting on VM ioctl threads can
   run in parallel with vCPU execution.

 - Splitting on the VM ioctl thread is more efficient because it does
   no require performing VM-exit handling and page table walks for every
   4K page.

The measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB
HugeTLB memory.

When KVM_DIRTY_LOG_INITIALLY_SET is set, we can see that the first
KVM_CLEAR_DIRTY_LOG iteration gets longer because KVM is splitting
huge pages. But the time it takes for vCPUs to dirty their memory
is significantly shorter since they do not have to take write-
protection faults.

           | Iteration 1 clear dirty log time | Iteration 2 dirty memory time
---------- | -------------------------------- | -----------------------------
Before     | 0.049572219s                     | 2.751442902s
After      | 1.667811687s                     | 0.127016504s

Eager page splitting does make subsequent KVM_CLEAR_DIRTY_LOG ioctls
about 4% slower since it always walks the page tables looking for pages
to split.  This can be avoided but will require extra memory and/or code
complexity to track when splitting can be skipped.

           | Iteration 3 clear dirty log time
---------- | --------------------------------
Before     | 1.374501209s
After      | 1.422478617s

When not using KVM_DIRTY_LOG_INITIALLY_SET, KVM performs splitting on
the entire memslot during the KVM_SET_USER_MEMORY_REGION ioctl that
enables dirty logging. We can see that as an increase in the time it
takes to enable dirty logging. This allows vCPUs to avoid taking
write-protection faults which we again see in the dirty memory time.

           | Enabling dirty logging time      | Iteration 1 dirty memory time
---------- | -------------------------------- | -----------------------------
Before     | 0.001683739s                     | 2.943733325s
After      | 1.546904175s                     | 0.145979748s

Testing
=======

- Ran all kvm-unit-tests and KVM selftests on debug and non-debug kernels.

- Ran dirty_log_perf_test with different backing sources (anonymous,
  anonymous_thp, anonymous_hugetlb_2mb, anonymous_hugetlb_1gb) with and
  without Eager Page Splitting enabled.

- Added a tracepoint locally to time the GFP_NOWAIT allocations. Across
  40 runs of dirty_log_perf_test using 1GiB HugeTLB with 96 vCPUs there
  were only 4 allocations that took longer than 20 microseconds and the
  longest was 60 microseconds. None of the GFP_NOWAIT allocations
  failed.

- I have not been able to trigger a GFP_NOWAIT allocation failure (to
  exercise the fallback path). However I did manually modify the code
  to force every allocation to fallback by removing the GFP_NOWAIT
  allocation altogether to make sure the logic works correctly.

Version Log
===========

v1:

[Overall Changes]
 - Use "huge page" instead of "large page" [Sean Christopherson]

[RFC PATCH 02/15] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
 - Add Ben's Reviewed-by.
 - Add Peter's Reviewed-by.

[RFC PATCH 03/15] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
 - Add comment when updating old_spte [Ben Gardon]
 - Follow kernel style of else case in zap_gfn_range [Ben Gardon]
 - Don't delete old_spte update after zapping in kvm_tdp_mmu_map [me]

[RFC PATCH 04/15] KVM: x86/mmu: Factor out logic to atomically install a new page table
 - Add blurb to commit message describing intentional drop of tracepoint [Ben Gardon]
 - Consolidate "u64 spte = make_nonleaf_spte(...);" onto one line [Sean Christopherson]
 - Do not free the sp if set fails  [Sean Christopherson]

[RFC PATCH 05/15] KVM: x86/mmu: Abstract mmu caches out to a separate struct
 - Drop to adopt Sean's proposed allocation scheme.

[RFC PATCH 06/15] KVM: x86/mmu: Derive page role from parent
 - No changes.

[RFC PATCH 07/15] KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
 - Drop to adopt Sean's proposed allocation scheme.

[RFC PATCH 08/15] KVM: x86/mmu: Helper method to check for large and present sptes
 - Drop this commit and the helper function [Sean Christopherson]

[RFC PATCH 09/15] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
 - Add Ben's Reviewed-by.

[RFC PATCH 10/15] KVM: x86/mmu: Abstract need_resched logic from tdp_mmu_iter_cond_resched
 - Drop to adopt Sean's proposed allocation scheme.

[RFC PATCH 11/15] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
 - Add Ben's Reviewed-by.

[RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled
 - Add a module parameter to control Eager Page Splitting [Peter Xu]
 - Change level to large_spte_level [Ben Gardon]
 - Get rid of BUG_ONs [Ben Gardon]
 - Change += to |= and add a comment [Ben Gardon]
 - Do not flush TLBs when dropping the MMU lock. [Sean Christopherson]
 - Allocate memory directly from the kernel instead of using mmu_caches [Sean Christopherson]

[RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
 - Fix deadlock by refactoring MMU locking and dropping write lock before splitting. [kernel test robot]
 - Did not follow Sean's suggestion of skipping write-protection if splitting
   succeeds as it would require extra complexity since we aren't splitting
   pages in the shadow MMU yet.

[RFC PATCH 14/15] KVM: x86/mmu: Add tracepoint for splitting large pages
 - No changes.

[RFC PATCH 15/15] KVM: x86/mmu: Update page stats when splitting large pages
 - Squash into patch that first introduces page splitting.

Note: I opted not to change TDP MMU functions to return int instead of
bool per Sean's suggestion. I agree this change should be done but can
be left to a separate series.

RFC: https://lore.kernel.org/kvm/20211119235759.1304274-1-dmatlack@google.com/

[1] https://lore.kernel.org/kvm/20211119235759.1304274-1-dmatlack@google.com/
[2] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t

David Matlack (13):
  KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  KVM: x86/mmu: Factor out logic to atomically install a new page table
  KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  KVM: x86/mmu: Derive page role from parent
  KVM: x86/mmu: Refactor TDP MMU child page initialization
  KVM: x86/mmu: Split huge pages when dirty logging is enabled
  KVM: Push MMU locking down into
    kvm_arch_mmu_enable_log_dirty_pt_masked
  KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
  KVM: x86/mmu: Add tracepoint for splitting huge pages
  KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and
    INITIALLY_SET

 arch/arm64/kvm/mmu.c                          |   2 +
 arch/mips/kvm/mmu.c                           |   5 +-
 arch/riscv/kvm/mmu.c                          |   2 +
 arch/x86/include/asm/kvm_host.h               |   7 +
 arch/x86/kvm/mmu/mmu.c                        |  78 ++--
 arch/x86/kvm/mmu/mmutrace.h                   |  20 ++
 arch/x86/kvm/mmu/spte.c                       |  77 ++++
 arch/x86/kvm/mmu/spte.h                       |   2 +
 arch/x86/kvm/mmu/tdp_iter.c                   |   5 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  10 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    | 340 ++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h                    |   5 +
 arch/x86/kvm/x86.c                            |  10 +
 arch/x86/kvm/x86.h                            |   2 +
 .../selftests/kvm/dirty_log_perf_test.c       |  10 +-
 virt/kvm/dirty_ring.c                         |   2 -
 virt/kvm/kvm_main.c                           |   4 -
 17 files changed, 465 insertions(+), 116 deletions(-)


base-commit: 1c10f4b4877ffaed602d12ff8cbbd5009e82c970
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v1 01/13] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-06  0:35   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 02/13] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

rmap_write_protect is a poor name because we may not even touch the rmap
if the TDP MMU is in use. It is also confusing that rmap_write_protect
is not a simpler wrapper around __rmap_write_protect, since that is the
typical flow for functions with double-underscore names.

Rename it to kvm_vcpu_write_protect_gfn to convey that we are
write-protecting a specific gfn in the context of a vCPU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1ccee4d17481..87c3135222b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1421,7 +1421,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	return write_protected;
 }
 
-static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+static bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
 
@@ -2024,7 +2024,7 @@ static int mmu_sync_children(struct kvm_vcpu *vcpu,
 		bool protected = false;
 
 		for_each_sp(pages, sp, parents, i)
-			protected |= rmap_write_protect(vcpu, sp->gfn);
+			protected |= kvm_vcpu_write_protect_gfn(vcpu, sp->gfn);
 
 		if (protected) {
 			kvm_mmu_remote_flush_or_zap(vcpu->kvm, &invalid_list, true);
@@ -2149,7 +2149,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && rmap_write_protect(vcpu, gfn))
+		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);

base-commit: 1c10f4b4877ffaed602d12ff8cbbd5009e82c970
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 02/13] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
  2021-12-13 22:59 ` [PATCH v1 01/13] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-06  0:35   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Now that rmap_write_protect has been renamed, there is no need for the
double underscores in front of __rmap_write_protect.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 87c3135222b3..8b702f2b6a70 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1229,9 +1229,9 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm,
-				 struct kvm_rmap_head *rmap_head,
-				 bool pt_protect)
+static bool rmap_write_protect(struct kvm *kvm,
+			       struct kvm_rmap_head *rmap_head,
+			       bool pt_protect)
 {
 	u64 *sptep;
 	struct rmap_iterator iter;
@@ -1311,7 +1311,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	while (mask) {
 		rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					PG_LEVEL_4K, slot);
-		__rmap_write_protect(kvm, rmap_head, false);
+		rmap_write_protect(kvm, rmap_head, false);
 
 		/* clear the first set bit */
 		mask &= mask - 1;
@@ -1410,7 +1410,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	if (kvm_memslots_have_rmaps(kvm)) {
 		for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 			rmap_head = gfn_to_rmap(gfn, i, slot);
-			write_protected |= __rmap_write_protect(kvm, rmap_head, true);
+			write_protected |= rmap_write_protect(kvm, rmap_head, true);
 		}
 	}
 
@@ -5787,7 +5787,7 @@ static bool slot_rmap_write_protect(struct kvm *kvm,
 				    struct kvm_rmap_head *rmap_head,
 				    const struct kvm_memory_slot *slot)
 {
-	return __rmap_write_protect(kvm, rmap_head, false);
+	return rmap_write_protect(kvm, rmap_head, false);
 }
 
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
  2021-12-13 22:59 ` [PATCH v1 01/13] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
  2021-12-13 22:59 ` [PATCH v1 02/13] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-04 10:13   ` Peter Xu
  2022-01-06  0:54   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Consolidate a bunch of code that was manually re-reading the spte if the
cmpxchg fails. There is no extra cost of doing this because we already
have the spte value as a result of the cmpxchg (and in fact this
eliminates re-reading the spte), and none of the call sites depend on
iter->old_spte retaining the stale spte value.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++----------------------
 1 file changed, 21 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b69e47e68307..656ebf5b20dc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -492,16 +492,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
  * and handle the associated bookkeeping.  Do not mark the page dirty
  * in KVM's dirty bitmaps.
  *
+ * If setting the SPTE fails because it has changed, iter->old_spte will be
+ * updated with the updated value of the spte.
+ *
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @new_spte: The value the SPTE should be set to
  * Returns: true if the SPTE was set, false if it was not. If false is returned,
- *	    this function will have no side-effects.
+ *          this function will have no side-effects other than updating
+ *          iter->old_spte to the latest value of spte.
  */
 static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					   struct tdp_iter *iter,
 					   u64 new_spte)
 {
+	u64 old_spte;
+
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	/*
@@ -515,9 +521,15 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
 	 * does not hold the mmu_lock.
 	 */
-	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
-		      new_spte) != iter->old_spte)
+	old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);
+	if (old_spte != iter->old_spte) {
+		/*
+		 * The cmpxchg failed because the spte was updated by another
+		 * thread so record the updated spte in old_spte.
+		 */
+		iter->old_spte = old_spte;
 		return false;
+	}
 
 	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
 			      new_spte, iter->level, true);
@@ -748,11 +760,6 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			tdp_mmu_set_spte(kvm, &iter, 0);
 			flush = true;
 		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
 			goto retry;
 		}
 	}
@@ -985,6 +992,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			 * path below.
 			 */
 			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
@@ -1190,14 +1198,9 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
-		}
+
 		spte_set = true;
 	}
 
@@ -1258,14 +1261,9 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 				continue;
 		}
 
-		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
-		}
+
 		spte_set = true;
 	}
 
@@ -1389,14 +1387,8 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		/* Note, a successful atomic zap also does a remote TLB flush. */
-		if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
-			/*
-			 * The iter must explicitly re-read the SPTE because
-			 * the atomic cmpxchg failed.
-			 */
-			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		if (!tdp_mmu_zap_spte_atomic(kvm, &iter))
 			goto retry;
-		}
 	}
 
 	rcu_read_unlock();
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (2 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-04 10:32   ` Peter Xu
  2022-01-06 20:12   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Factor out the logic to atomically replace an SPTE with an SPTE that
points to a new page table. This will be used in a follow-up commit to
split a large page SPTE into one level lower.

Opportunistically drop the kvm_mmu_get_page tracepoint in
kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
alloc_tdp_mmu_page().

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 48 +++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 656ebf5b20dc..dbd07c10d11a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -950,6 +950,36 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+/*
+ * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
+ * spte pointing to the provided page table.
+ *
+ * @kvm: kvm instance
+ * @iter: a tdp_iter instance currently on the SPTE that should be set
+ * @sp: The new TDP page table to install.
+ * @account_nx: True if this page table is being installed to split a
+ *              non-executable huge page.
+ *
+ * Returns: True if the new page table was installed. False if spte being
+ *          replaced changed, causing the atomic compare-exchange to fail.
+ *          If this function returns false the sp will be freed before
+ *          returning.
+ */
+static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
+				      struct tdp_iter *iter,
+				      struct kvm_mmu_page *sp,
+				      bool account_nx)
+{
+	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
+
+	if (!tdp_mmu_set_spte_atomic(kvm, iter, spte))
+		return false;
+
+	tdp_mmu_link_page(kvm, sp, account_nx);
+
+	return true;
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -959,8 +989,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	u64 *child_pt;
-	u64 new_spte;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -996,6 +1024,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
+			bool account_nx = fault->huge_page_disallowed &&
+					  fault->req_level >= iter.level;
+
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -1005,18 +1036,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 				break;
 
 			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
-			child_pt = sp->spt;
-
-			new_spte = make_nonleaf_spte(child_pt,
-						     !shadow_accessed_mask);
-
-			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
-				tdp_mmu_link_page(vcpu->kvm, sp,
-						  fault->huge_page_disallowed &&
-						  fault->req_level >= iter.level);
-
-				trace_kvm_mmu_get_page(sp, true);
-			} else {
+			if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
 				tdp_mmu_free_sp(sp);
 				break;
 			}
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (3 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-04 10:33   ` Peter Xu
  2022-01-06 20:27   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

restore_acc_track_spte is purely an SPTE manipulation, making it a good
fit for spte.c. It is also needed in spte.c in a follow-up commit so we
can construct child SPTEs during large page splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c  | 18 ------------------
 arch/x86/kvm/mmu/spte.c | 18 ++++++++++++++++++
 arch/x86/kvm/mmu/spte.h |  1 +
 3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8b702f2b6a70..3c2cb4dd1f11 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -646,24 +646,6 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
-/* Restore an acc-track PTE back to a regular PTE */
-static u64 restore_acc_track_spte(u64 spte)
-{
-	u64 new_spte = spte;
-	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
-			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
-
-	WARN_ON_ONCE(spte_ad_enabled(spte));
-	WARN_ON_ONCE(!is_access_track_spte(spte));
-
-	new_spte &= ~shadow_acc_track_mask;
-	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
-		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
-	new_spte |= saved_bits;
-
-	return new_spte;
-}
-
 /* Returns the Accessed status of the PTE and resets it at the same time. */
 static bool mmu_spte_age(u64 *sptep)
 {
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 8a7b03207762..fd34ae5d6940 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -268,6 +268,24 @@ u64 mark_spte_for_access_track(u64 spte)
 	return spte;
 }
 
+/* Restore an acc-track PTE back to a regular PTE */
+u64 restore_acc_track_spte(u64 spte)
+{
+	u64 new_spte = spte;
+	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
+			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
+
+	WARN_ON_ONCE(spte_ad_enabled(spte));
+	WARN_ON_ONCE(!is_access_track_spte(spte));
+
+	new_spte &= ~shadow_acc_track_mask;
+	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
+		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
+	new_spte |= saved_bits;
+
+	return new_spte;
+}
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 {
 	BUG_ON((u64)(unsigned)access_mask != access_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a4af2a42695c..9b0c7b27f23f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -337,6 +337,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
+u64 restore_acc_track_spte(u64 spte);
 u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
 
 void kvm_mmu_reset_all_pte_masks(void);
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (4 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-04 10:35   ` Peter Xu
  2022-01-06 20:34   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent David Matlack
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Instead of passing a pointer to the root page table and the root level
separately, pass in a pointer to the kvm_mmu_page that backs the root.
This reduces the number of arguments by 1, cutting down on line lengths.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c |  5 ++++-
 arch/x86/kvm/mmu/tdp_iter.h | 10 +++++-----
 arch/x86/kvm/mmu/tdp_mmu.c  | 14 +++++---------
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index b3ed302c1a35..92b3a075525a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -39,9 +39,12 @@ void tdp_iter_restart(struct tdp_iter *iter)
  * Sets a TDP iterator to walk a pre-order traversal of the paging structure
  * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
-void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn)
 {
+	u64 *root_pt = root->spt;
+	int root_level = root->role.level;
+
 	WARN_ON(root_level < 1);
 	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b1748b988d3a..ec1f58013428 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -51,17 +51,17 @@ struct tdp_iter {
  * Iterates over every SPTE mapping the GFN range [start, end) in a
  * preorder traversal.
  */
-#define for_each_tdp_pte_min_level(iter, root, root_level, min_level, start, end) \
-	for (tdp_iter_start(&iter, root, root_level, min_level, start); \
+#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
+	for (tdp_iter_start(&iter, root, min_level, start); \
 	     iter.valid && iter.gfn < end;		     \
 	     tdp_iter_next(&iter))
 
-#define for_each_tdp_pte(iter, root, root_level, start, end) \
-	for_each_tdp_pte_min_level(iter, root, root_level, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, root, start, end) \
+	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
 
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
-void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dbd07c10d11a..2fb2d7677fbf 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -632,7 +632,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 }
 
 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
-	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
+	for_each_tdp_pte(_iter, _root, _start, _end)
 
 #define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
 	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
@@ -642,8 +642,7 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 		else
 
 #define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
-			 _mmu->shadow_root_level, _start, _end)
+	for_each_tdp_pte(_iter, to_shadow_page(_mmu->root_hpa), _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -733,8 +732,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
 retry:
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
@@ -1205,8 +1203,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1445,8 +1442,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   min_level, gfn, gfn + 1) {
+	for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (5 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-05  7:51   ` Peter Xu
  2022-01-06 20:45   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization David Matlack
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Derive the page role from the parent shadow page, since the only thing
that changes is the level. This is in preparation for eagerly splitting
large pages during VM-ioctls which does not have access to the vCPU
MMU context.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 43 ++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2fb2d7677fbf..582d9a798899 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -157,23 +157,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
-						   int level)
-{
-	union kvm_mmu_page_role role;
-
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = true;
-	role.has_4_byte_gpte = false;
-	role.access = ACC_ALL;
-	role.ad_disabled = !shadow_accessed_mask;
-
-	return role;
-}
-
 static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					       int level)
+					       union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
@@ -181,7 +166,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	sp->role.word = page_role_for_level(vcpu, level).word;
+	sp->role = role;
 	sp->gfn = gfn;
 	sp->tdp_mmu_page = true;
 
@@ -190,6 +175,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return sp;
 }
 
+static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
+{
+	struct kvm_mmu_page *parent_sp;
+	union kvm_mmu_page_role role;
+
+	parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
+
+	role = parent_sp->role;
+	role.level--;
+
+	return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
+}
+
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 {
 	union kvm_mmu_page_role role;
@@ -198,7 +196,12 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = vcpu->arch.mmu->shadow_root_level;
+	role.direct = true;
+	role.has_4_byte_gpte = false;
+	role.access = ACC_ALL;
+	role.ad_disabled = !shadow_accessed_mask;
 
 	/* Check for an existing root before allocating a new one. */
 	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
@@ -207,7 +210,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+	root = alloc_tdp_mmu_page(vcpu, 0, role);
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1033,7 +1036,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
+			sp = alloc_child_tdp_mmu_page(vcpu, &iter);
 			if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
 				tdp_mmu_free_sp(sp);
 				break;
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (6 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-05  7:51   ` Peter Xu
  2022-01-06 20:59   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled David Matlack
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Separate the allocation of child pages from the initialization. This is
in preparation for doing page splitting outside of the vCPU fault
context which requires a different allocation mechanism.

No functional changed intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 582d9a798899..a8354d8578f1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -157,13 +157,18 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					       union kvm_mmu_page_role role)
+static struct kvm_mmu_page *alloc_tdp_mmu_page_from_caches(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+
+	return sp;
+}
+
+static void init_tdp_mmu_page(struct kvm_mmu_page *sp, gfn_t gfn, union kvm_mmu_page_role role)
+{
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	sp->role = role;
@@ -171,11 +176,9 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->tdp_mmu_page = true;
 
 	trace_kvm_mmu_get_page(sp, true);
-
-	return sp;
 }
 
-static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
+static void init_child_tdp_mmu_page(struct kvm_mmu_page *child_sp, struct tdp_iter *iter)
 {
 	struct kvm_mmu_page *parent_sp;
 	union kvm_mmu_page_role role;
@@ -185,7 +188,17 @@ static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, stru
 	role = parent_sp->role;
 	role.level--;
 
-	return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
+	init_tdp_mmu_page(child_sp, iter->gfn, role);
+}
+
+static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
+{
+	struct kvm_mmu_page *child_sp;
+
+	child_sp = alloc_tdp_mmu_page_from_caches(vcpu);
+	init_child_tdp_mmu_page(child_sp, iter);
+
+	return child_sp;
 }
 
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
@@ -210,7 +223,10 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = alloc_tdp_mmu_page(vcpu, 0, role);
+	root = alloc_tdp_mmu_page_from_caches(vcpu);
+
+	init_tdp_mmu_page(root, 0, role);
+
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (7 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-05  7:54   ` Peter Xu
                     ` (2 more replies)
  2021-12-13 22:59 ` [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked David Matlack
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

When dirty logging is enabled without initially-all-set, attempt to
split all huge pages in the memslot down to 4KB pages so that vCPUs
do not have to take expensive write-protection faults to split huge
pages.

Huge page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because we still always follow it up by write-
protecting any remaining huge pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/mmu/mmu.c          |  14 +++
 arch/x86/kvm/mmu/spte.c         |  59 ++++++++++++
 arch/x86/kvm/mmu/spte.h         |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 165 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   5 +
 arch/x86/kvm/x86.c              |  10 ++
 7 files changed, 257 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e863d569c89a..4a507109e886 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1573,6 +1573,9 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level);
+void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
+				       const struct kvm_memory_slot *memslot,
+				       int target_level);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c2cb4dd1f11..9116c6a4ced1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5807,6 +5807,20 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
+				       const struct kvm_memory_slot *memslot,
+				       int target_level)
+{
+	u64 start = memslot->base_gfn;
+	u64 end = start + memslot->npages;
+
+	if (is_tdp_mmu_enabled(kvm)) {
+		read_lock(&kvm->mmu_lock);
+		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		read_unlock(&kvm->mmu_lock);
+	}
+}
+
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 					 struct kvm_rmap_head *rmap_head,
 					 const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index fd34ae5d6940..11d0b3993ba5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -191,6 +191,65 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	return wrprot;
 }
 
+static u64 mark_spte_executable(u64 spte)
+{
+	bool is_access_track = is_access_track_spte(spte);
+
+	if (is_access_track)
+		spte = restore_acc_track_spte(spte);
+
+	spte &= ~shadow_nx_mask;
+	spte |= shadow_x_mask;
+
+	if (is_access_track)
+		spte = mark_spte_for_access_track(spte);
+
+	return spte;
+}
+
+/*
+ * Construct an SPTE that maps a sub-page of the given huge page SPTE where
+ * `index` identifies which sub-page.
+ *
+ * This is used during huge page splitting to build the SPTEs that make up the
+ * new page table.
+ */
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access)
+{
+	u64 child_spte;
+	int child_level;
+
+	if (WARN_ON(is_mmio_spte(huge_spte)))
+		return 0;
+
+	if (WARN_ON(!is_shadow_present_pte(huge_spte)))
+		return 0;
+
+	if (WARN_ON(!is_large_pte(huge_spte)))
+		return 0;
+
+	child_spte = huge_spte;
+	child_level = huge_level - 1;
+
+	/*
+	 * The child_spte already has the base address of the huge page being
+	 * split. So we just have to OR in the offset to the page at the next
+	 * lower level for the given index.
+	 */
+	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
+
+	if (child_level == PG_LEVEL_4K) {
+		child_spte &= ~PT_PAGE_SIZE_MASK;
+
+		/* Allow execution for 4K pages if it was disabled for NX HugePages. */
+		if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)
+			child_spte = mark_spte_executable(child_spte);
+	}
+
+	return child_spte;
+}
+
+
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
 	u64 spte = SPTE_MMU_PRESENT_MASK;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 9b0c7b27f23f..e13f335b4fef 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -334,6 +334,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a8354d8578f1..be5eb74ac053 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1264,6 +1264,171 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
+static struct kvm_mmu_page *alloc_tdp_mmu_page_from_kernel(gfp_t gfp)
+{
+	struct kvm_mmu_page *sp;
+
+	gfp |= __GFP_ZERO;
+
+	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
+	if (!sp)
+		return NULL;
+
+	sp->spt = (void *)__get_free_page(gfp);
+	if (!sp->spt) {
+		kmem_cache_free(mmu_page_header_cache, sp);
+		return NULL;
+	}
+
+	return sp;
+}
+
+static struct kvm_mmu_page *alloc_tdp_mmu_page_for_split(struct kvm *kvm, bool *dropped_lock)
+{
+	struct kvm_mmu_page *sp;
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	*dropped_lock = false;
+
+	/*
+	 * Since we are allocating while under the MMU lock we have to be
+	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
+	 * reclaim and to avoid making any filesystem callbacks (which can end
+	 * up invoking KVM MMU notifiers, resulting in a deadlock).
+	 *
+	 * If this allocation fails we drop the lock and retry with reclaim
+	 * allowed.
+	 */
+	sp = alloc_tdp_mmu_page_from_kernel(GFP_NOWAIT | __GFP_ACCOUNT);
+	if (sp)
+		return sp;
+
+	rcu_read_unlock();
+	read_unlock(&kvm->mmu_lock);
+
+	*dropped_lock = true;
+
+	sp = alloc_tdp_mmu_page_from_kernel(GFP_KERNEL_ACCOUNT);
+
+	read_lock(&kvm->mmu_lock);
+	rcu_read_lock();
+
+	return sp;
+}
+
+static bool
+tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter, struct kvm_mmu_page *sp)
+{
+	const u64 huge_spte = iter->old_spte;
+	const int level = iter->level;
+	u64 child_spte;
+	int i;
+
+	init_child_tdp_mmu_page(sp, iter);
+
+	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+		child_spte = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
+
+		/*
+		 * No need for atomics since child_sp has not been installed
+		 * in the table yet and thus is not reachable by any other
+		 * thread.
+		 */
+		sp->spt[i] = child_spte;
+	}
+
+	if (!tdp_mmu_install_sp_atomic(kvm, iter, sp, false))
+		return false;
+
+	/*
+	 * tdp_mmu_install_sp_atomic will handle subtracting the split huge
+	 * page from stats, but we have to manually update the new present child
+	 * pages.
+	 */
+	kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
+
+	return true;
+}
+
+static int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
+					 gfn_t start, gfn_t end, int target_level)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct tdp_iter iter;
+	bool dropped_lock;
+
+	rcu_read_lock();
+
+	/*
+	 * Traverse the page table splitting all huge pages above the target
+	 * level into one lower level. For example, if we encounter a 1GB page
+	 * we split it into 512 2MB pages.
+	 *
+	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
+	 * to visit an SPTE before ever visiting its children, which means we
+	 * will correctly recursively split huge pages that are more than one
+	 * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
+	 */
+	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+retry:
+		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
+			continue;
+
+		if (!sp) {
+			sp = alloc_tdp_mmu_page_for_split(kvm, &dropped_lock);
+			if (!sp)
+				return -ENOMEM;
+
+			if (dropped_lock) {
+				tdp_iter_restart(&iter);
+				continue;
+			}
+		}
+
+		if (!tdp_mmu_split_huge_page_atomic(kvm, &iter, sp))
+			goto retry;
+
+		sp = NULL;
+	}
+
+	/*
+	 * It's possible to exit the loop having never used the last sp if, for
+	 * example, a vCPU doing HugePage NX splitting wins the race and
+	 * installs its own sp in place of the last sp we tried to split.
+	 */
+	if (sp)
+		tdp_mmu_free_sp(sp);
+
+	rcu_read_unlock();
+
+	return 0;
+}
+
+int kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
+				     const struct kvm_memory_slot *slot,
+				     gfn_t start, gfn_t end,
+				     int target_level)
+{
+	struct kvm_mmu_page *root;
+	int r = 0;
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+
+	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) {
+		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level);
+		if (r) {
+			kvm_tdp_mmu_put_root(kvm, root, true);
+			break;
+		}
+	}
+
+	return r;
+}
+
 /*
  * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
  * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3899004a5d91..3557a7fcf927 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -71,6 +71,11 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
+int kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
+				     const struct kvm_memory_slot *slot,
+				     gfn_t start, gfn_t end,
+				     int target_level);
+
 static inline void kvm_tdp_mmu_walk_lockless_begin(void)
 {
 	rcu_read_lock();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 85127b3e3690..fb5592bf2eee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -187,6 +187,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO);
 int __read_mostly pi_inject_timer = -1;
 module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
 
+static bool __read_mostly eagerly_split_huge_pages_for_dirty_logging = true;
+module_param(eagerly_split_huge_pages_for_dirty_logging, bool, 0644);
+
 /*
  * Restoring the host value for MSRs that are only consumed when running in
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
@@ -11837,6 +11840,13 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
 			return;
 
+		/*
+		 * Attempt to split all large pages into 4K pages so that vCPUs
+		 * do not have to take write-protection faults.
+		 */
+		if (READ_ONCE(eagerly_split_huge_pages_for_dirty_logging))
+			kvm_mmu_slot_try_split_huge_pages(kvm, new, PG_LEVEL_4K);
+
 		if (kvm_x86_ops.cpu_dirty_log_size) {
 			kvm_mmu_slot_leaf_clear_dirty(kvm, new);
 			kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (8 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2021-12-13 22:59 ` [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG David Matlack
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Instead of acquiring the MMU lock in the arch-generic code, force each
implementation of kvm_arch_mmu_enable_log_dirty_pt_masked to acquire the
MMU lock as needed. This is in preparation for performing eager page
splitting in the x86 implementation of
kvm_arch_mmu_enable_log_dirty_pt_masked, which involves dropping the MMU
lock in write-mode and re-acquiring it in read mode (and possibly
rescheduling) during splitting. Pushing the MMU lock down into the
arch code makes the x86 synchronization much easier to reason about, and
does not harm readability of other architectures.

This should be a safe change because:

* No architecture requires a TLB flush before dropping the MMU lock.
* The dirty bitmap does not need to be synchronized with changes to the
  page tables by the MMU lock as evidenced by the fact that x86 modifies
  the dirty bitmap without acquiring the MMU lock in fast_page_fault.

This change does increase the number of times the MMU lock is acquired
and released during KVM_CLEAR_DIRTY_LOG, but this is not a performance
critical path and breaking up the lock duration may reduce contention
on vCPU threads.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/kvm/mmu.c   | 2 ++
 arch/mips/kvm/mmu.c    | 5 +++--
 arch/riscv/kvm/mmu.c   | 2 ++
 arch/x86/kvm/mmu/mmu.c | 4 ++++
 virt/kvm/dirty_ring.c  | 2 --
 virt/kvm/kvm_main.c    | 4 ----
 6 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index e65acf35cee3..48085cb534d5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -749,7 +749,9 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
 		gfn_t gfn_offset, unsigned long mask)
 {
+	spin_lock(&kvm->mmu_lock);
 	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+	spin_unlock(&kvm->mmu_lock);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 1bfd1b501d82..7e67edcd5aae 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -409,8 +409,7 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn)
  * @mask:	The mask of dirty pages at offset 'gfn_offset' in this memory
  *		slot to be write protected
  *
- * Walks bits set in mask write protects the associated pte's. Caller must
- * acquire @kvm->mmu_lock.
+ * Walks bits set in mask write protects the associated pte's.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
@@ -420,7 +419,9 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	gfn_t start = base_gfn +  __ffs(mask);
 	gfn_t end = base_gfn + __fls(mask);
 
+	spin_lock(&kvm->mmu_lock);
 	kvm_mips_mkclean_gpa_pt(kvm, start, end);
+	spin_unlock(&kvm->mmu_lock);
 }
 
 /*
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 7d884b15cf5e..d084ac939b0f 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -424,7 +424,9 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
 	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
 
+	spin_lock(&kvm->mmu_lock);
 	stage2_wp_range(kvm, start, end);
+	spin_unlock(&kvm->mmu_lock);
 }
 
 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9116c6a4ced1..c9e5fe290714 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1347,6 +1347,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 				struct kvm_memory_slot *slot,
 				gfn_t gfn_offset, unsigned long mask)
 {
+	write_lock(&kvm->mmu_lock);
+
 	/*
 	 * Huge pages are NOT write protected when we start dirty logging in
 	 * initially-all-set mode; must write protect them here so that they
@@ -1374,6 +1376,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
 	else
 		kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+
+	write_unlock(&kvm->mmu_lock);
 }
 
 int kvm_cpu_dirty_log_size(void)
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 88f4683198ea..6b26ec60c96a 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -61,9 +61,7 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
 		return;
 
-	KVM_MMU_LOCK(kvm);
 	kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
-	KVM_MMU_UNLOCK(kvm);
 }
 
 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3595eddd476a..da4850fb2982 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2048,7 +2048,6 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 		dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot);
 		memset(dirty_bitmap_buffer, 0, n);
 
-		KVM_MMU_LOCK(kvm);
 		for (i = 0; i < n / sizeof(long); i++) {
 			unsigned long mask;
 			gfn_t offset;
@@ -2064,7 +2063,6 @@ static int kvm_get_dirty_log_protect(struct kvm *kvm, struct kvm_dirty_log *log)
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
 								offset, mask);
 		}
-		KVM_MMU_UNLOCK(kvm);
 	}
 
 	if (flush)
@@ -2159,7 +2157,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 	if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n))
 		return -EFAULT;
 
-	KVM_MMU_LOCK(kvm);
 	for (offset = log->first_page, i = offset / BITS_PER_LONG,
 		 n = DIV_ROUND_UP(log->num_pages, BITS_PER_LONG); n--;
 	     i++, offset += BITS_PER_LONG) {
@@ -2182,7 +2179,6 @@ static int kvm_clear_dirty_log_protect(struct kvm *kvm,
 								offset, mask);
 		}
 	}
-	KVM_MMU_UNLOCK(kvm);
 
 	if (flush)
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (9 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-05  9:02   ` Peter Xu
  2021-12-13 22:59 ` [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
  2021-12-13 22:59 ` [PATCH v1 13/13] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET David Matlack
  12 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

When using initially-all-set, huge pages are not write-protected when
dirty logging is enabled on the memslot. Instead they are
write-protected once userspace invokes CLEAR_DIRTY_LOG for the first
time and only for the specific sub-region being cleared.

Enhance CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls rather than splitting the entire
memslot when not using initially-all-set.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/mmu/mmu.c          | 36 +++++++++++++++++++++++++++------
 arch/x86/kvm/x86.c              |  2 +-
 arch/x86/kvm/x86.h              |  2 ++
 4 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a507109e886..3e537e261562 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1576,6 +1576,10 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 				       const struct kvm_memory_slot *memslot,
 				       int target_level);
+void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
+				  const struct kvm_memory_slot *memslot,
+				  u64 start, u64 end,
+				  int target_level);
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c9e5fe290714..55640d73df5a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1362,6 +1362,20 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
 		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
 
+		/*
+		 * Try to proactively split any huge pages down to 4KB so that
+		 * vCPUs don't have to take write-protection faults.
+		 *
+		 * Drop the MMU lock since huge page splitting uses its own
+		 * locking scheme and does not require the write lock in all
+		 * cases.
+		 */
+		if (READ_ONCE(eagerly_split_huge_pages_for_dirty_logging)) {
+			write_unlock(&kvm->mmu_lock);
+			kvm_mmu_try_split_huge_pages(kvm, slot, start, end, PG_LEVEL_4K);
+			write_lock(&kvm->mmu_lock);
+		}
+
 		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
 
 		/* Cross two large pages? */
@@ -5811,13 +5825,11 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
-void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
-				       const struct kvm_memory_slot *memslot,
-				       int target_level)
+void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
+				   const struct kvm_memory_slot *memslot,
+				   u64 start, u64 end,
+				   int target_level)
 {
-	u64 start = memslot->base_gfn;
-	u64 end = start + memslot->npages;
-
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
@@ -5825,6 +5837,18 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	}
 }
 
+void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
+					const struct kvm_memory_slot *memslot,
+					int target_level)
+{
+	u64 start, end;
+
+	start = memslot->base_gfn;
+	end = start + memslot->npages;
+
+	kvm_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
+}
+
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 					 struct kvm_rmap_head *rmap_head,
 					 const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb5592bf2eee..e27a3d6e3978 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -187,7 +187,7 @@ module_param(force_emulation_prefix, bool, S_IRUGO);
 int __read_mostly pi_inject_timer = -1;
 module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
 
-static bool __read_mostly eagerly_split_huge_pages_for_dirty_logging = true;
+bool __read_mostly eagerly_split_huge_pages_for_dirty_logging = true;
 module_param(eagerly_split_huge_pages_for_dirty_logging, bool, 0644);
 
 /*
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4abcd8d9836d..825e47451875 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -352,6 +352,8 @@ extern int pi_inject_timer;
 
 extern bool report_ignored_msrs;
 
+extern bool eagerly_split_huge_pages_for_dirty_logging;
+
 static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 {
 	return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (10 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-05  8:38   ` Peter Xu
  2022-01-06 23:14   ` Sean Christopherson
  2021-12-13 22:59 ` [PATCH v1 13/13] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET David Matlack
  12 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Add a tracepoint that records whenever KVM eagerly splits a huge page.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmutrace.h | 20 ++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c  |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index de5e8e4e1aa7..4feabf773387 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -416,6 +416,26 @@ TRACE_EVENT(
 	)
 );
 
+TRACE_EVENT(
+	kvm_mmu_split_huge_page,
+	TP_PROTO(u64 gfn, u64 spte, int level),
+	TP_ARGS(gfn, spte, level),
+
+	TP_STRUCT__entry(
+		__field(u64, gfn)
+		__field(u64, spte)
+		__field(int, level)
+	),
+
+	TP_fast_assign(
+		__entry->gfn = gfn;
+		__entry->spte = spte;
+		__entry->level = level;
+	),
+
+	TP_printk("gfn %llx spte %llx level %d", __entry->gfn, __entry->spte, __entry->level)
+);
+
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index be5eb74ac053..e6910b9b5c12 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1325,6 +1325,8 @@ tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter, struct kv
 	u64 child_spte;
 	int i;
 
+	trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level);
+
 	init_child_tdp_mmu_page(sp, iter);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v1 13/13] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET
  2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
                   ` (11 preceding siblings ...)
  2021-12-13 22:59 ` [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
@ 2021-12-13 22:59 ` David Matlack
  2022-01-05  8:38   ` Peter Xu
  12 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2021-12-13 22:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Ben Gardon, Joerg Roedel, Jim Mattson, Wanpeng Li,
	Vitaly Kuznetsov, Sean Christopherson, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania, David Matlack

Add an option to dirty_log_perf_test to disable MANUAL_PROTECT_ENABLE
and INITIALLY_SET so the legacy dirty logging code path can be tested.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 1954b964d1cf..a0c2247855f6 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -298,12 +298,15 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-i iterations] [-p offset] "
+	printf("usage: %s [-h] [-i iterations] [-p offset] [-g]"
 	       "[-m mode] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]"
 	       "[-x memslots]\n", name);
 	puts("");
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
+	printf(" -g: Use the legacy dirty logging mode where KVM_GET_DIRTY_LOG\n"
+	       "     fetches and *clears* the dirty log. By default the test will\n"
+	       "     use MANUAL_PROTECT_ENABLE and INITIALLY_SET.\n");
 	printf(" -p: specify guest physical test memory offset\n"
 	       "     Warning: a low offset can conflict with the loaded test code.\n");
 	guest_modes_help();
@@ -343,8 +346,11 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:os:x:")) != -1) {
+	while ((opt = getopt(argc, argv, "ghi:p:m:b:f:v:os:x:")) != -1) {
 		switch (opt) {
+		case 'g':
+			dirty_log_manual_caps = 0;
+			break;
 		case 'i':
 			p.iterations = atoi(optarg);
 			break;
-- 
2.34.1.173.g76aa8bc2d0-goog


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2021-12-13 22:59 ` [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
@ 2022-01-04 10:13   ` Peter Xu
  2022-01-04 17:29     ` Ben Gardon
  2022-01-06  0:54   ` Sean Christopherson
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Xu @ 2022-01-04 10:13 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:08PM +0000, David Matlack wrote:
> @@ -985,6 +992,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			 * path below.
>  			 */
>  			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> +

Useless empty line?

Other than that:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-12-13 22:59 ` [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
@ 2022-01-04 10:32   ` Peter Xu
  2022-01-04 18:26     ` David Matlack
  2022-01-06 20:12   ` Sean Christopherson
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Xu @ 2022-01-04 10:32 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:09PM +0000, David Matlack wrote:
> +/*
> + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> + * spte pointing to the provided page table.
> + *
> + * @kvm: kvm instance
> + * @iter: a tdp_iter instance currently on the SPTE that should be set
> + * @sp: The new TDP page table to install.
> + * @account_nx: True if this page table is being installed to split a
> + *              non-executable huge page.
> + *
> + * Returns: True if the new page table was installed. False if spte being
> + *          replaced changed, causing the atomic compare-exchange to fail.
> + *          If this function returns false the sp will be freed before

s/will/will not/?

> + *          returning.
> + */

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  2021-12-13 22:59 ` [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
@ 2022-01-04 10:33   ` Peter Xu
  2022-01-06 20:27   ` Sean Christopherson
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-04 10:33 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:10PM +0000, David Matlack wrote:
> restore_acc_track_spte is purely an SPTE manipulation, making it a good
> fit for spte.c. It is also needed in spte.c in a follow-up commit so we
> can construct child SPTEs during large page splitting.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  2021-12-13 22:59 ` [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
@ 2022-01-04 10:35   ` Peter Xu
  2022-01-06 20:34   ` Sean Christopherson
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-04 10:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:11PM +0000, David Matlack wrote:
> Instead of passing a pointer to the root page table and the root level
> separately, pass in a pointer to the kvm_mmu_page that backs the root.
> This reduces the number of arguments by 1, cutting down on line lengths.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2022-01-04 10:13   ` Peter Xu
@ 2022-01-04 17:29     ` Ben Gardon
  0 siblings, 0 replies; 55+ messages in thread
From: Ben Gardon @ 2022-01-04 17:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Matlack, Paolo Bonzini, kvm, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Tue, Jan 4, 2022 at 2:13 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Dec 13, 2021 at 10:59:08PM +0000, David Matlack wrote:
> > @@ -985,6 +992,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                        * path below.
> >                        */
> >                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> > +
>
> Useless empty line?
>
> Other than that:
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>

Looks good to me too.

Reviewed-by: Ben Gardon <bgardon@google.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2022-01-04 10:32   ` Peter Xu
@ 2022-01-04 18:26     ` David Matlack
  2022-01-05  1:00       ` Peter Xu
  0 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2022-01-04 18:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Tue, Jan 4, 2022 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Dec 13, 2021 at 10:59:09PM +0000, David Matlack wrote:
> > +/*
> > + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> > + * spte pointing to the provided page table.
> > + *
> > + * @kvm: kvm instance
> > + * @iter: a tdp_iter instance currently on the SPTE that should be set
> > + * @sp: The new TDP page table to install.
> > + * @account_nx: True if this page table is being installed to split a
> > + *              non-executable huge page.
> > + *
> > + * Returns: True if the new page table was installed. False if spte being
> > + *          replaced changed, causing the atomic compare-exchange to fail.
> > + *          If this function returns false the sp will be freed before
>
> s/will/will not/?

Good catch. This comment is leftover from the RFC patch where it did
free the sp.

>
> > + *          returning.
> > + */
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2022-01-04 18:26     ` David Matlack
@ 2022-01-05  1:00       ` Peter Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-05  1:00 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Tue, Jan 04, 2022 at 10:26:15AM -0800, David Matlack wrote:
> On Tue, Jan 4, 2022 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Dec 13, 2021 at 10:59:09PM +0000, David Matlack wrote:
> > > +/*
> > > + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> > > + * spte pointing to the provided page table.
> > > + *
> > > + * @kvm: kvm instance
> > > + * @iter: a tdp_iter instance currently on the SPTE that should be set
> > > + * @sp: The new TDP page table to install.
> > > + * @account_nx: True if this page table is being installed to split a
> > > + *              non-executable huge page.
> > > + *
> > > + * Returns: True if the new page table was installed. False if spte being
> > > + *          replaced changed, causing the atomic compare-exchange to fail.
> > > + *          If this function returns false the sp will be freed before
> >
> > s/will/will not/?
> 
> Good catch. This comment is leftover from the RFC patch where it did
> free the sp.

With that fixed, feel free to add:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
  2021-12-13 22:59 ` [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent David Matlack
@ 2022-01-05  7:51   ` Peter Xu
  2022-01-06 20:45   ` Sean Christopherson
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-05  7:51 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:12PM +0000, David Matlack wrote:
> Derive the page role from the parent shadow page, since the only thing
> that changes is the level. This is in preparation for eagerly splitting
> large pages during VM-ioctls which does not have access to the vCPU
> MMU context.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
  2021-12-13 22:59 ` [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization David Matlack
@ 2022-01-05  7:51   ` Peter Xu
  2022-01-06 20:59   ` Sean Christopherson
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-05  7:51 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:13PM +0000, David Matlack wrote:
> Separate the allocation of child pages from the initialization. This is
> in preparation for doing page splitting outside of the vCPU fault
> context which requires a different allocation mechanism.
> 
> No functional changed intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2021-12-13 22:59 ` [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled David Matlack
@ 2022-01-05  7:54   ` Peter Xu
  2022-01-05 17:49     ` David Matlack
  2022-01-06 21:28   ` Sean Christopherson
  2022-01-07  2:06   ` Peter Xu
  2 siblings, 1 reply; 55+ messages in thread
From: Peter Xu @ 2022-01-05  7:54 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:14PM +0000, David Matlack wrote:
> When dirty logging is enabled without initially-all-set, attempt to
> split all huge pages in the memslot down to 4KB pages so that vCPUs
> do not have to take expensive write-protection faults to split huge
> pages.
> 
> Huge page splitting is best-effort only. This commit only adds the
> support for the TDP MMU, and even there splitting may fail due to out
> of memory conditions. Failures to split a huge page is fine from a
> correctness standpoint because we still always follow it up by write-
> protecting any remaining huge pages.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Thanks for adding the knob.

Reviewed-by: Peter Xu <peterx@redhat.com>

One trivial nitpick below:

> +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access)
> +{
> +	u64 child_spte;
> +	int child_level;
> +
> +	if (WARN_ON(is_mmio_spte(huge_spte)))
> +		return 0;
> +
> +	if (WARN_ON(!is_shadow_present_pte(huge_spte)))
> +		return 0;
> +
> +	if (WARN_ON(!is_large_pte(huge_spte)))
> +		return 0;
> +
> +	child_spte = huge_spte;
> +	child_level = huge_level - 1;
> +
> +	/*
> +	 * The child_spte already has the base address of the huge page being
> +	 * split. So we just have to OR in the offset to the page at the next
> +	 * lower level for the given index.
> +	 */
> +	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
> +
> +	if (child_level == PG_LEVEL_4K) {
> +		child_spte &= ~PT_PAGE_SIZE_MASK;
> +
> +		/* Allow execution for 4K pages if it was disabled for NX HugePages. */
> +		if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)

IMHO clearer to use brackets ("A && (B & C)").

I don't even see anywhere that the tdp mmu disables the EXEC bit for 4K.. if
that's true then perhaps we can even drop "access" and this check?  But I could
have missed something.

> +			child_spte = mark_spte_executable(child_spte);
> +	}
> +
> +	return child_spte;
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
  2021-12-13 22:59 ` [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
@ 2022-01-05  8:38   ` Peter Xu
  2022-01-06 23:14   ` Sean Christopherson
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-05  8:38 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:17PM +0000, David Matlack wrote:
> Add a tracepoint that records whenever KVM eagerly splits a huge page.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 13/13] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET
  2021-12-13 22:59 ` [PATCH v1 13/13] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET David Matlack
@ 2022-01-05  8:38   ` Peter Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-05  8:38 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:18PM +0000, David Matlack wrote:
> Add an option to dirty_log_perf_test to disable MANUAL_PROTECT_ENABLE
> and INITIALLY_SET so the legacy dirty logging code path can be tested.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
  2021-12-13 22:59 ` [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG David Matlack
@ 2022-01-05  9:02   ` Peter Xu
  2022-01-05 17:55     ` David Matlack
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Xu @ 2022-01-05  9:02 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:16PM +0000, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c9e5fe290714..55640d73df5a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1362,6 +1362,20 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
>  		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
>  
> +		/*
> +		 * Try to proactively split any huge pages down to 4KB so that
> +		 * vCPUs don't have to take write-protection faults.
> +		 *
> +		 * Drop the MMU lock since huge page splitting uses its own
> +		 * locking scheme and does not require the write lock in all
> +		 * cases.
> +		 */
> +		if (READ_ONCE(eagerly_split_huge_pages_for_dirty_logging)) {
> +			write_unlock(&kvm->mmu_lock);
> +			kvm_mmu_try_split_huge_pages(kvm, slot, start, end, PG_LEVEL_4K);
> +			write_lock(&kvm->mmu_lock);
> +		}
> +
>  		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);

Would it be easier to just allow passing in shared=true/false for the new
kvm_mmu_try_split_huge_pages(), then previous patch will not be needed?  Or is
it intended to do it for performance reasons?

IOW, I think this patch does two things: (1) support clear-log on eager split,
and (2) allow lock degrade during eager split.

It's just that imho (2) may still need some justification on necessity since
this function only operates on a very small range of guest mem (at most
4K*64KB=256KB range), so it's not clear to me whether the extra lock operations
are needed at all; after all it'll make the code slightly harder to follow.
Not to mention the previous patch is preparing for this, and both patches will
add lock operations.

I think dirty_log_perf_test didn't cover lock contention case, because clear
log was run after vcpu threads stopped, so lock access should be mostly hitting
the cachelines there, afaict.  While in real life, clear log is run with vcpus
running.  Not sure whether that'll be a problem, so raising this question up.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2022-01-05  7:54   ` Peter Xu
@ 2022-01-05 17:49     ` David Matlack
  2022-01-06 22:48       ` Sean Christopherson
  0 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2022-01-05 17:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Tue, Jan 4, 2022 at 11:55 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Dec 13, 2021 at 10:59:14PM +0000, David Matlack wrote:
> > When dirty logging is enabled without initially-all-set, attempt to
> > split all huge pages in the memslot down to 4KB pages so that vCPUs
> > do not have to take expensive write-protection faults to split huge
> > pages.
> >
> > Huge page splitting is best-effort only. This commit only adds the
> > support for the TDP MMU, and even there splitting may fail due to out
> > of memory conditions. Failures to split a huge page is fine from a
> > correctness standpoint because we still always follow it up by write-
> > protecting any remaining huge pages.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
>
> Thanks for adding the knob.
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> One trivial nitpick below:
>
> > +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access)
> > +{
> > +     u64 child_spte;
> > +     int child_level;
> > +
> > +     if (WARN_ON(is_mmio_spte(huge_spte)))
> > +             return 0;
> > +
> > +     if (WARN_ON(!is_shadow_present_pte(huge_spte)))
> > +             return 0;
> > +
> > +     if (WARN_ON(!is_large_pte(huge_spte)))
> > +             return 0;
> > +
> > +     child_spte = huge_spte;
> > +     child_level = huge_level - 1;
> > +
> > +     /*
> > +      * The child_spte already has the base address of the huge page being
> > +      * split. So we just have to OR in the offset to the page at the next
> > +      * lower level for the given index.
> > +      */
> > +     child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
> > +
> > +     if (child_level == PG_LEVEL_4K) {
> > +             child_spte &= ~PT_PAGE_SIZE_MASK;
> > +
> > +             /* Allow execution for 4K pages if it was disabled for NX HugePages. */
> > +             if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)
>
> IMHO clearer to use brackets ("A && (B & C)").

Agreed.

>
> I don't even see anywhere that the tdp mmu disables the EXEC bit for 4K.. if
> that's true then perhaps we can even drop "access" and this check?  But I could
> have missed something.

TDP MMU always passes ACC_ALL so the access check could be omitted
from this patch. But it will be needed to support eager splitting for
the shadow MMU, which does not always allow execution.



>
> > +                     child_spte = mark_spte_executable(child_spte);
> > +     }
> > +
> > +     return child_spte;
> > +}
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
  2022-01-05  9:02   ` Peter Xu
@ 2022-01-05 17:55     ` David Matlack
  2022-01-05 17:57       ` David Matlack
  0 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2022-01-05 17:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Wed, Jan 5, 2022 at 1:02 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Dec 13, 2021 at 10:59:16PM +0000, David Matlack wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c9e5fe290714..55640d73df5a 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1362,6 +1362,20 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> >
> > +             /*
> > +              * Try to proactively split any huge pages down to 4KB so that
> > +              * vCPUs don't have to take write-protection faults.
> > +              *
> > +              * Drop the MMU lock since huge page splitting uses its own
> > +              * locking scheme and does not require the write lock in all
> > +              * cases.
> > +              */
> > +             if (READ_ONCE(eagerly_split_huge_pages_for_dirty_logging)) {
> > +                     write_unlock(&kvm->mmu_lock);
> > +                     kvm_mmu_try_split_huge_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > +                     write_lock(&kvm->mmu_lock);
> > +             }
> > +
> >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>
> Would it be easier to just allow passing in shared=true/false for the new
> kvm_mmu_try_split_huge_pages(), then previous patch will not be needed?  Or is
> it intended to do it for performance reasons?
>
> IOW, I think this patch does two things: (1) support clear-log on eager split,
> and (2) allow lock degrade during eager split.
>
> It's just that imho (2) may still need some justification on necessity since
> this function only operates on a very small range of guest mem (at most
> 4K*64KB=256KB range), so it's not clear to me whether the extra lock operations
> are needed at all; after all it'll make the code slightly harder to follow.
> Not to mention the previous patch is preparing for this, and both patches will
> add lock operations.
>
> I think dirty_log_perf_test didn't cover lock contention case, because clear
> log was run after vcpu threads stopped, so lock access should be mostly hitting
> the cachelines there, afaict.  While in real life, clear log is run with vcpus
> running.  Not sure whether that'll be a problem, so raising this question up.

Good point. Dropping the write lock to acquire the read lock is
probably not necessary since we're splitting a small region of memory
here. Plus the splitting code detects contention and will drop the
lock if necessary. And the value of dropping the lock is dubious since
it adds a lot more lock operations.

I'll try your suggestion in v3.

>
> Thanks,


>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG
  2022-01-05 17:55     ` David Matlack
@ 2022-01-05 17:57       ` David Matlack
  0 siblings, 0 replies; 55+ messages in thread
From: David Matlack @ 2022-01-05 17:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Wed, Jan 5, 2022 at 9:55 AM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Jan 5, 2022 at 1:02 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Dec 13, 2021 at 10:59:16PM +0000, David Matlack wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index c9e5fe290714..55640d73df5a 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1362,6 +1362,20 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > >
> > > +             /*
> > > +              * Try to proactively split any huge pages down to 4KB so that
> > > +              * vCPUs don't have to take write-protection faults.
> > > +              *
> > > +              * Drop the MMU lock since huge page splitting uses its own
> > > +              * locking scheme and does not require the write lock in all
> > > +              * cases.
> > > +              */
> > > +             if (READ_ONCE(eagerly_split_huge_pages_for_dirty_logging)) {
> > > +                     write_unlock(&kvm->mmu_lock);
> > > +                     kvm_mmu_try_split_huge_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > +                     write_lock(&kvm->mmu_lock);
> > > +             }
> > > +
> > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >
> > Would it be easier to just allow passing in shared=true/false for the new
> > kvm_mmu_try_split_huge_pages(), then previous patch will not be needed?  Or is
> > it intended to do it for performance reasons?
> >
> > IOW, I think this patch does two things: (1) support clear-log on eager split,
> > and (2) allow lock degrade during eager split.
> >
> > It's just that imho (2) may still need some justification on necessity since
> > this function only operates on a very small range of guest mem (at most
> > 4K*64KB=256KB range), so it's not clear to me whether the extra lock operations
> > are needed at all; after all it'll make the code slightly harder to follow.
> > Not to mention the previous patch is preparing for this, and both patches will
> > add lock operations.
> >
> > I think dirty_log_perf_test didn't cover lock contention case, because clear
> > log was run after vcpu threads stopped, so lock access should be mostly hitting
> > the cachelines there, afaict.  While in real life, clear log is run with vcpus
> > running.  Not sure whether that'll be a problem, so raising this question up.
>
> Good point. Dropping the write lock to acquire the read lock is
> probably not necessary since we're splitting a small region of memory
> here. Plus the splitting code detects contention and will drop the
> lock if necessary. And the value of dropping the lock is dubious since
> it adds a lot more lock operations.

I wasn't very clear here. I meant the value of "dropping the write
lock to switch the read lock every time we split" is dubious since it
adds more lock operations. Dropping the lock and yielding when there's
contention detected is not dubious :).

>
> I'll try your suggestion in v3.
>
> >
> > Thanks,
>
>
> >
> > --
> > Peter Xu
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 01/13] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  2021-12-13 22:59 ` [PATCH v1 01/13] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
@ 2022-01-06  0:35   ` Sean Christopherson
  0 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06  0:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> rmap_write_protect is a poor name because we may not even touch the rmap
> if the TDP MMU is in use. It is also confusing that rmap_write_protect
> is not a simpler wrapper around __rmap_write_protect, since that is the
> typical flow for functions with double-underscore names.
> 
> Rename it to kvm_vcpu_write_protect_gfn to convey that we are
> write-protecting a specific gfn in the context of a vCPU.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---

Hopping on the R-b train...

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 02/13] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  2021-12-13 22:59 ` [PATCH v1 02/13] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
@ 2022-01-06  0:35   ` Sean Christopherson
  0 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06  0:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> Now that rmap_write_protect has been renamed, there is no need for the
> double underscores in front of __rmap_write_protect.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2021-12-13 22:59 ` [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
  2022-01-04 10:13   ` Peter Xu
@ 2022-01-06  0:54   ` Sean Christopherson
  2022-01-06 18:04     ` David Matlack
  1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06  0:54 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> Consolidate a bunch of code that was manually re-reading the spte if the
> cmpxchg fails. There is no extra cost of doing this because we already
> have the spte value as a result of the cmpxchg (and in fact this
> eliminates re-reading the spte), and none of the call sites depend on
> iter->old_spte retaining the stale spte value.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++----------------------
>  1 file changed, 21 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b69e47e68307..656ebf5b20dc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -492,16 +492,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   * and handle the associated bookkeeping.  Do not mark the page dirty
>   * in KVM's dirty bitmaps.
>   *
> + * If setting the SPTE fails because it has changed, iter->old_spte will be
> + * updated with the updated value of the spte.

First updated=>refreshed, second updated=>current?  More below.

> + *
>   * @kvm: kvm instance
>   * @iter: a tdp_iter instance currently on the SPTE that should be set
>   * @new_spte: The value the SPTE should be set to
>   * Returns: true if the SPTE was set, false if it was not. If false is returned,
> - *	    this function will have no side-effects.
> + *          this function will have no side-effects other than updating

s/updating/setting

> + *          iter->old_spte to the latest value of spte.

Strictly speaking, "latest" may not be true if yet another thread modifies the
SPTE.  Maybe this?

		iter->old_spte to the last known value of the SPTE.

>   */
>  static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  					   struct tdp_iter *iter,
>  					   u64 new_spte)
>  {
> +	u64 old_spte;
> +
>  	lockdep_assert_held_read(&kvm->mmu_lock);
>  
>  	/*
> @@ -515,9 +521,15 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
>  	 * does not hold the mmu_lock.
>  	 */
> -	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> -		      new_spte) != iter->old_spte)
> +	old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);

To make this a bit easier to read, and to stay under 80 chars, how about
opportunistically doing this as well?

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 656ebf5b20dc..64f1369f8638 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -506,6 +506,7 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
                                           struct tdp_iter *iter,
                                           u64 new_spte)
 {
+       u64 *sptep = rcu_dereference(iter->sptep);
        u64 old_spte;
 
        lockdep_assert_held_read(&kvm->mmu_lock);
@@ -521,7 +522,7 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
         * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
         * does not hold the mmu_lock.
         */
-       old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);
+       old_spte = cmpxchg64(sptep, iter->old_spte, new_spte);
        if (old_spte != iter->old_spte) {
                /*
                 * The cmpxchg failed because the spte was updated by another

> +	if (old_spte != iter->old_spte) {
> +		/*
> +		 * The cmpxchg failed because the spte was updated by another
> +		 * thread so record the updated spte in old_spte.
> +		 */

Hmm, this is a bit awkward.  I think it's the double use of "updated" and the
somewhat ambiguous reference to "old_spte".  I'd also avoid "thread", as this
requires interference from not only a different task, but a different logical CPU
since iter->old_spte is refreshed if mmu_lock is dropped and reacquired.  And
"record" is an odd choice of word since it sounds like storing the current value
is only for logging/debugging.

Something like this?

		/*
		 * The entry was modified by a different logical CPU, refresh
		 * iter->old_spte with the current value so the caller operates
		 * on fresh data, e.g. if it retries tdp_mmu_set_spte_atomic().
		 */

Nits aside,

Reviewed-by: Sean Christopherson <seanjc@google.com>

> +		iter->old_spte = old_spte;
>  		return false;
> +	}
>  
>  	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
>  			      new_spte, iter->level, true);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  2022-01-06  0:54   ` Sean Christopherson
@ 2022-01-06 18:04     ` David Matlack
  0 siblings, 0 replies; 55+ messages in thread
From: David Matlack @ 2022-01-06 18:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Wed, Jan 5, 2022 at 4:54 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > Consolidate a bunch of code that was manually re-reading the spte if the
> > cmpxchg fails. There is no extra cost of doing this because we already
> > have the spte value as a result of the cmpxchg (and in fact this
> > eliminates re-reading the spte), and none of the call sites depend on
> > iter->old_spte retaining the stale spte value.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++----------------------
> >  1 file changed, 21 insertions(+), 29 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index b69e47e68307..656ebf5b20dc 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -492,16 +492,22 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> >   * and handle the associated bookkeeping.  Do not mark the page dirty
> >   * in KVM's dirty bitmaps.
> >   *
> > + * If setting the SPTE fails because it has changed, iter->old_spte will be
> > + * updated with the updated value of the spte.
>
> First updated=>refreshed, second updated=>current?  More below.
>
> > + *
> >   * @kvm: kvm instance
> >   * @iter: a tdp_iter instance currently on the SPTE that should be set
> >   * @new_spte: The value the SPTE should be set to
> >   * Returns: true if the SPTE was set, false if it was not. If false is returned,
> > - *       this function will have no side-effects.
> > + *          this function will have no side-effects other than updating
>
> s/updating/setting
>
> > + *          iter->old_spte to the latest value of spte.
>
> Strictly speaking, "latest" may not be true if yet another thread modifies the
> SPTE.  Maybe this?
>
>                 iter->old_spte to the last known value of the SPTE.
>
> >   */
> >  static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >                                          struct tdp_iter *iter,
> >                                          u64 new_spte)
> >  {
> > +     u64 old_spte;
> > +
> >       lockdep_assert_held_read(&kvm->mmu_lock);
> >
> >       /*
> > @@ -515,9 +521,15 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >        * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> >        * does not hold the mmu_lock.
> >        */
> > -     if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> > -                   new_spte) != iter->old_spte)
> > +     old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);
>
> To make this a bit easier to read, and to stay under 80 chars, how about
> opportunistically doing this as well?
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 656ebf5b20dc..64f1369f8638 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -506,6 +506,7 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>                                            struct tdp_iter *iter,
>                                            u64 new_spte)
>  {
> +       u64 *sptep = rcu_dereference(iter->sptep);
>         u64 old_spte;
>
>         lockdep_assert_held_read(&kvm->mmu_lock);
> @@ -521,7 +522,7 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>          * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
>          * does not hold the mmu_lock.
>          */
> -       old_spte = cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, new_spte);
> +       old_spte = cmpxchg64(sptep, iter->old_spte, new_spte);
>         if (old_spte != iter->old_spte) {
>                 /*
>                  * The cmpxchg failed because the spte was updated by another
>
> > +     if (old_spte != iter->old_spte) {
> > +             /*
> > +              * The cmpxchg failed because the spte was updated by another
> > +              * thread so record the updated spte in old_spte.
> > +              */
>
> Hmm, this is a bit awkward.  I think it's the double use of "updated" and the
> somewhat ambiguous reference to "old_spte".  I'd also avoid "thread", as this
> requires interference from not only a different task, but a different logical CPU
> since iter->old_spte is refreshed if mmu_lock is dropped and reacquired.  And
> "record" is an odd choice of word since it sounds like storing the current value
> is only for logging/debugging.
>
> Something like this?
>
>                 /*
>                  * The entry was modified by a different logical CPU, refresh
>                  * iter->old_spte with the current value so the caller operates
>                  * on fresh data, e.g. if it retries tdp_mmu_set_spte_atomic().
>                  */
>
> Nits aside,
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>

Thanks for the review. I'll incorporate these into v2 (which I'm
holding off on until you have a chance to finish reviewing v1).

>
> > +             iter->old_spte = old_spte;
> >               return false;
> > +     }
> >
> >       __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> >                             new_spte, iter->level, true);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2021-12-13 22:59 ` [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
  2022-01-04 10:32   ` Peter Xu
@ 2022-01-06 20:12   ` Sean Christopherson
  2022-01-06 22:56     ` David Matlack
  1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 20:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> Factor out the logic to atomically replace an SPTE with an SPTE that
> points to a new page table. This will be used in a follow-up commit to
> split a large page SPTE into one level lower.
> 
> Opportunistically drop the kvm_mmu_get_page tracepoint in
> kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
> alloc_tdp_mmu_page().
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 48 +++++++++++++++++++++++++++-----------
>  1 file changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 656ebf5b20dc..dbd07c10d11a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -950,6 +950,36 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	return ret;
>  }
>  
> +/*
> + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> + * spte pointing to the provided page table.
> + *
> + * @kvm: kvm instance
> + * @iter: a tdp_iter instance currently on the SPTE that should be set
> + * @sp: The new TDP page table to install.
> + * @account_nx: True if this page table is being installed to split a
> + *              non-executable huge page.
> + *
> + * Returns: True if the new page table was installed. False if spte being
> + *          replaced changed, causing the atomic compare-exchange to fail.

I'd prefer to return an int with 0/-EBUSY on success/fail.  Ditto for the existing
tdp_mmu_set_spte_atomic().  Actually, if you add a prep patch to make that happen,
then this can be:

	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
	int ret;

	ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
	if (ret)
		return ret;

	tdp_mmu_link_page(kvm, sp, account_nx);
	return 0;



> + *          If this function returns false the sp will be freed before
> + *          returning.

Uh, no it's not?  The call to tdp_mmu_free_sp() is still done by kvm_tdp_mmu_map().

> + */
> +static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,

Hmm, so this helper is the only user of tdp_mmu_link_page(), and _that_ helper
is rather tiny.  And this would also be a good opportunity to clean up the
"(un)link_page" verbiage, as the bare "page" doesn't communicate to the reader
that it's for linking shadow pages, e.g. not struct page.

So, what about folding in tdp_mmu_link_page(), naming this helper either
tdp_mmu_link_sp_atomic() or tdp_mmu_link_shadow_page_atomic(), and then renaming
tdp_mmu_unlink_page() accordingly?  And for bonus points, add a blurb in the
function comment like:

	* Note the lack of a non-atomic variant!  The TDP MMU always builds its
	* page tables while holding mmu_lock for read.

> +				      struct tdp_iter *iter,
> +				      struct kvm_mmu_page *sp,
> +				      bool account_nx)
> +{
> +	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> +
> +	if (!tdp_mmu_set_spte_atomic(kvm, iter, spte))
> +		return false;
> +
> +	tdp_mmu_link_page(kvm, sp, account_nx);
> +
> +	return true;
> +}
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -959,8 +989,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	struct tdp_iter iter;
>  	struct kvm_mmu_page *sp;
> -	u64 *child_pt;
> -	u64 new_spte;
>  	int ret;
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -996,6 +1024,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		}
>  
>  		if (!is_shadow_present_pte(iter.old_spte)) {
> +			bool account_nx = fault->huge_page_disallowed &&
> +					  fault->req_level >= iter.level;
> +
>  			/*
>  			 * If SPTE has been frozen by another thread, just
>  			 * give up and retry, avoiding unnecessary page table
> @@ -1005,18 +1036,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  				break;
>  
>  			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> -			child_pt = sp->spt;
> -
> -			new_spte = make_nonleaf_spte(child_pt,
> -						     !shadow_accessed_mask);
> -
> -			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
> -				tdp_mmu_link_page(vcpu->kvm, sp,
> -						  fault->huge_page_disallowed &&
> -						  fault->req_level >= iter.level);
> -
> -				trace_kvm_mmu_get_page(sp, true);
> -			} else {
> +			if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
>  				tdp_mmu_free_sp(sp);
>  				break;
>  			}
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  2021-12-13 22:59 ` [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
  2022-01-04 10:33   ` Peter Xu
@ 2022-01-06 20:27   ` Sean Christopherson
  2022-01-06 22:58     ` David Matlack
  1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 20:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> restore_acc_track_spte is purely an SPTE manipulation, making it a good
> fit for spte.c. It is also needed in spte.c in a follow-up commit so we
> can construct child SPTEs during large page splitting.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c  | 18 ------------------
>  arch/x86/kvm/mmu/spte.c | 18 ++++++++++++++++++
>  arch/x86/kvm/mmu/spte.h |  1 +
>  3 files changed, 19 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8b702f2b6a70..3c2cb4dd1f11 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -646,24 +646,6 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
>  	return __get_spte_lockless(sptep);
>  }
>  
> -/* Restore an acc-track PTE back to a regular PTE */
> -static u64 restore_acc_track_spte(u64 spte)
> -{
> -	u64 new_spte = spte;
> -	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
> -			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
> -
> -	WARN_ON_ONCE(spte_ad_enabled(spte));
> -	WARN_ON_ONCE(!is_access_track_spte(spte));
> -
> -	new_spte &= ~shadow_acc_track_mask;
> -	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
> -		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
> -	new_spte |= saved_bits;
> -
> -	return new_spte;
> -}
> -
>  /* Returns the Accessed status of the PTE and resets it at the same time. */
>  static bool mmu_spte_age(u64 *sptep)
>  {
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 8a7b03207762..fd34ae5d6940 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -268,6 +268,24 @@ u64 mark_spte_for_access_track(u64 spte)
>  	return spte;
>  }
>  
> +/* Restore an acc-track PTE back to a regular PTE */
> +u64 restore_acc_track_spte(u64 spte)
> +{
> +	u64 new_spte = spte;
> +	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
> +			 & SHADOW_ACC_TRACK_SAVED_BITS_MASK;

Obviously not your code, but this could be:

	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) &
			 SHADOW_ACC_TRACK_SAVED_BITS_MASK;

	WARN_ON_ONCE(spte_ad_enabled(spte));
	WARN_ON_ONCE(!is_access_track_spte(spte));

	spte &= ~shadow_acc_track_mask;
	spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
		  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
	spte |= saved_bits;

	return spte;

which is really just an excuse to move the ampersand up a line :-)

And looking at the two callers, the WARNs are rather silly.  The spte_ad_enabled()
WARN is especially pointless because that's also checked by is_access_track_spte().
I like paranoid WARNs as much as anyone, but I don't see why this code warrants
extra checking relative to the other SPTE helpers that have more subtle requirements.

At that point, make make this an inline helper?

  static inline u64 restore_acc_track_spte(u64 spte)
  {
	u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) &
			 SHADOW_ACC_TRACK_SAVED_BITS_MASK;

	spte &= ~shadow_acc_track_mask;
	spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
		  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
	spte |= saved_bits;

	return spte;
  }

> +	WARN_ON_ONCE(spte_ad_enabled(spte));
> +	WARN_ON_ONCE(!is_access_track_spte(spte));
> +
> +	new_spte &= ~shadow_acc_track_mask;
> +	new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
> +		      SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
> +	new_spte |= saved_bits;
> +
> +	return new_spte;
> +}
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
>  {
>  	BUG_ON((u64)(unsigned)access_mask != access_mask);
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index a4af2a42695c..9b0c7b27f23f 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -337,6 +337,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> +u64 restore_acc_track_spte(u64 spte);
>  u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
>  
>  void kvm_mmu_reset_all_pte_masks(void);
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  2021-12-13 22:59 ` [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
  2022-01-04 10:35   ` Peter Xu
@ 2022-01-06 20:34   ` Sean Christopherson
  2022-01-06 22:57     ` David Matlack
  1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 20:34 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> Instead of passing a pointer to the root page table and the root level
> separately, pass in a pointer to the kvm_mmu_page that backs the root.
> This reduces the number of arguments by 1, cutting down on line lengths.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_iter.c |  5 ++++-
>  arch/x86/kvm/mmu/tdp_iter.h | 10 +++++-----
>  arch/x86/kvm/mmu/tdp_mmu.c  | 14 +++++---------
>  3 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b3ed302c1a35..92b3a075525a 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -39,9 +39,12 @@ void tdp_iter_restart(struct tdp_iter *iter)
>   * Sets a TDP iterator to walk a pre-order traversal of the paging structure
>   * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
>   */
> -void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> +void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
>  		    int min_level, gfn_t next_last_level_gfn)
>  {
> +	u64 *root_pt = root->spt;
> +	int root_level = root->role.level;

Uber nit, arch/x86/ prefers reverse fir tree, though I've yet to get Paolo fully
on board :-)

But looking at the usage of root_pt, even better would be

  void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
		      int min_level, gfn_t next_last_level_gfn)
  {
	int root_level = root->role.level;

	WARN_ON(root_level < 1);
	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);

	iter->next_last_level_gfn = next_last_level_gfn;
	iter->root_level = root_level;
	iter->min_level = min_level;
	iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
	iter->as_id = kvm_mmu_page_as_id(root);

	tdp_iter_restart(iter);
  }

to avoid the pointless sptep_to_sp() and eliminate the motivation for root_pt.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
  2021-12-13 22:59 ` [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent David Matlack
  2022-01-05  7:51   ` Peter Xu
@ 2022-01-06 20:45   ` Sean Christopherson
  2022-01-06 23:00     ` David Matlack
  1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 20:45 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

Please include "TDP MMU" somewhere in the shortlog.  It's a nice to have, e.g. not
worth forcing if there's more interesting info to put in the shortlog, but in this
case there are plenty of chars to go around.  E.g.

  KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent

On Mon, Dec 13, 2021, David Matlack wrote:
> Derive the page role from the parent shadow page, since the only thing
> that changes is the level. This is in preparation for eagerly splitting
> large pages during VM-ioctls which does not have access to the vCPU

s/does/do since VM-ioctls is plural.

> MMU context.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 43 ++++++++++++++++++++------------------
>  1 file changed, 23 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 2fb2d7677fbf..582d9a798899 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -157,23 +157,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>  		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
>  		} else
>  
> -static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> -						   int level)
> -{
> -	union kvm_mmu_page_role role;
> -
> -	role = vcpu->arch.mmu->mmu_role.base;
> -	role.level = level;
> -	role.direct = true;
> -	role.has_4_byte_gpte = false;
> -	role.access = ACC_ALL;
> -	role.ad_disabled = !shadow_accessed_mask;
> -
> -	return role;
> -}
> -
>  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -					       int level)
> +					       union kvm_mmu_page_role role)
>  {
>  	struct kvm_mmu_page *sp;
>  
> @@ -181,7 +166,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
> -	sp->role.word = page_role_for_level(vcpu, level).word;
> +	sp->role = role;
>  	sp->gfn = gfn;
>  	sp->tdp_mmu_page = true;
>  
> @@ -190,6 +175,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	return sp;
>  }
>  
> +static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)

Newline please, this is well over 80 chars.

static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu,
						     struct tdp_iter *iter)
> +{
> +	struct kvm_mmu_page *parent_sp;
> +	union kvm_mmu_page_role role;
> +
> +	parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
> +
> +	role = parent_sp->role;
> +	role.level--;
> +
> +	return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
> +}
> +
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  {
>  	union kvm_mmu_page_role role;
> @@ -198,7 +196,12 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  
> -	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
> +	role = vcpu->arch.mmu->mmu_role.base;
> +	role.level = vcpu->arch.mmu->shadow_root_level;
> +	role.direct = true;
> +	role.has_4_byte_gpte = false;
> +	role.access = ACC_ALL;
> +	role.ad_disabled = !shadow_accessed_mask;

Hmm, so _all_ of this unnecessary, i.e. this can simply be:

	role = vcpu->arch.mmu->mmu_role.base;

Probably better to handle everything except .level in a separate prep commit.

I'm not worried about the cost, I want to avoid potential confusion as to why the
TDP MMU is apparently "overriding" these fields.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
  2021-12-13 22:59 ` [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization David Matlack
  2022-01-05  7:51   ` Peter Xu
@ 2022-01-06 20:59   ` Sean Christopherson
  2022-01-06 22:08     ` David Matlack
  1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 20:59 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> Separate the allocation of child pages from the initialization. This is

"from their initialization" so that it's not a dangling sentence.

> in preparation for doing page splitting outside of the vCPU fault
> context which requires a different allocation mechanism.
> 
> No functional changed intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 30 +++++++++++++++++++++++-------
>  1 file changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 582d9a798899..a8354d8578f1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -157,13 +157,18 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>  		if (kvm_mmu_page_as_id(_root) != _as_id) {		\
>  		} else
>  
> -static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -					       union kvm_mmu_page_role role)
> +static struct kvm_mmu_page *alloc_tdp_mmu_page_from_caches(struct kvm_vcpu *vcpu)

Hrm, this ends up being a rather poor name because the "from_kernel" variant also
allocates from a cache, it's just a different cache:

  static struct kvm_mmu_page *alloc_tdp_mmu_page_from_kernel(gfp_t gfp)
  {
	struct kvm_mmu_page *sp;

	gfp |= __GFP_ZERO;

	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
	if (!sp)
		return NULL;

	...
  }

Given that the !vcpu path is the odd one, and the only user of the from_kernel
variant is the split, maybe this?  I.e. punt on naming until another user of the
"split" variant comes along.

  static struct kvm_mmu_page *__alloc_tdp_mmu_page(struct kvm_vcpu *vcpu)

and

  static struct kvm_mmu_page *__alloc_tdp_mmu_page_for_split(gfp_t gfp)

>  {
>  	struct kvm_mmu_page *sp;
>  
>  	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>  	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +
> +	return sp;
> +}
> +
> +static void init_tdp_mmu_page(struct kvm_mmu_page *sp, gfn_t gfn, union kvm_mmu_page_role role)

Newline.  I'm all in favor of running over when doing so improves readability, but
that's not the case here.

> +{
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
>  	sp->role = role;
> @@ -171,11 +176,9 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	sp->tdp_mmu_page = true;
>  
>  	trace_kvm_mmu_get_page(sp, true);
> -
> -	return sp;
>  }
>  
> -static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
> +static void init_child_tdp_mmu_page(struct kvm_mmu_page *child_sp, struct tdp_iter *iter)

Newline.

>  {
>  	struct kvm_mmu_page *parent_sp;
>  	union kvm_mmu_page_role role;
> @@ -185,7 +188,17 @@ static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, stru
>  	role = parent_sp->role;
>  	role.level--;
>  
> -	return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
> +	init_tdp_mmu_page(child_sp, iter->gfn, role);
> +}
> +
> +static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)

Newline.

> +{
> +	struct kvm_mmu_page *child_sp;
> +
> +	child_sp = alloc_tdp_mmu_page_from_caches(vcpu);
> +	init_child_tdp_mmu_page(child_sp, iter);
> +
> +	return child_sp;
>  }
>  
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> @@ -210,7 +223,10 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  			goto out;
>  	}
>  
> -	root = alloc_tdp_mmu_page(vcpu, 0, role);
> +	root = alloc_tdp_mmu_page_from_caches(vcpu);
> +
> +	init_tdp_mmu_page(root, 0, role);
> +
>  	refcount_set(&root->tdp_mmu_root_count, 1);
>  
>  	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2021-12-13 22:59 ` [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled David Matlack
  2022-01-05  7:54   ` Peter Xu
@ 2022-01-06 21:28   ` Sean Christopherson
  2022-01-06 22:20     ` David Matlack
  2022-01-07  2:06   ` Peter Xu
  2 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 21:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> When dirty logging is enabled without initially-all-set, attempt to
> split all huge pages in the memslot down to 4KB pages so that vCPUs
> do not have to take expensive write-protection faults to split huge
> pages.
> 
> Huge page splitting is best-effort only. This commit only adds the
> support for the TDP MMU, and even there splitting may fail due to out
> of memory conditions. Failures to split a huge page is fine from a
> correctness standpoint because we still always follow it up by write-
> protecting any remaining huge pages.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   3 +
>  arch/x86/kvm/mmu/mmu.c          |  14 +++
>  arch/x86/kvm/mmu/spte.c         |  59 ++++++++++++
>  arch/x86/kvm/mmu/spte.h         |   1 +
>  arch/x86/kvm/mmu/tdp_mmu.c      | 165 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h      |   5 +
>  arch/x86/kvm/x86.c              |  10 ++
>  7 files changed, 257 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e863d569c89a..4a507109e886 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1573,6 +1573,9 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  				      const struct kvm_memory_slot *memslot,
>  				      int start_level);
> +void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> +				       const struct kvm_memory_slot *memslot,
> +				       int target_level);
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  				   const struct kvm_memory_slot *memslot);
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c2cb4dd1f11..9116c6a4ced1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5807,6 +5807,20 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> +				       const struct kvm_memory_slot *memslot,
> +				       int target_level)
> +{
> +	u64 start = memslot->base_gfn;
> +	u64 end = start + memslot->npages;
> +
> +	if (is_tdp_mmu_enabled(kvm)) {
> +		read_lock(&kvm->mmu_lock);
> +		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> +		read_unlock(&kvm->mmu_lock);
> +	}
> +}
> +
>  static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  					 struct kvm_rmap_head *rmap_head,
>  					 const struct kvm_memory_slot *slot)
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index fd34ae5d6940..11d0b3993ba5 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -191,6 +191,65 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	return wrprot;
>  }
>  
> +static u64 mark_spte_executable(u64 spte)
> +{
> +	bool is_access_track = is_access_track_spte(spte);
> +
> +	if (is_access_track)
> +		spte = restore_acc_track_spte(spte);
> +
> +	spte &= ~shadow_nx_mask;
> +	spte |= shadow_x_mask;
> +
> +	if (is_access_track)
> +		spte = mark_spte_for_access_track(spte);
> +
> +	return spte;
> +}
> +
> +/*
> + * Construct an SPTE that maps a sub-page of the given huge page SPTE where
> + * `index` identifies which sub-page.
> + *
> + * This is used during huge page splitting to build the SPTEs that make up the

Nit, to be consistent with the kernel, s/huge page/hugepage.

> + * new page table.
> + */
> +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access)

Newline.  Actually, just drop @access since there's exactly one caller that
unconditionally passes ACC_ALL.

> +{
> +	u64 child_spte;
> +	int child_level;
> +
> +	if (WARN_ON(is_mmio_spte(huge_spte)))
> +		return 0;

Unnecessary, this is covered by is_shadow_present_pte().

> +
> +	if (WARN_ON(!is_shadow_present_pte(huge_spte)))
> +		return 0;
> +
> +	if (WARN_ON(!is_large_pte(huge_spte)))

Probably best to make these WARN_ON_ONCE, I gotta image if KVM screws up then this
will flood dmesg.

> +		return 0;
> +
> +	child_spte = huge_spte;
> +	child_level = huge_level - 1;
> +
> +	/*
> +	 * The child_spte already has the base address of the huge page being
> +	 * split. So we just have to OR in the offset to the page at the next
> +	 * lower level for the given index.
> +	 */
> +	child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
> +
> +	if (child_level == PG_LEVEL_4K) {
> +		child_spte &= ~PT_PAGE_SIZE_MASK;
> +
> +		/* Allow execution for 4K pages if it was disabled for NX HugePages. */

Nit, this just reiterates the "what".  Even though the "why" is fairly obvious,
maybe this instead?

		/*
		 * When splitting to a 4K page, make the new SPTE executable as
		 * the NX hugepage mitigation no longer applies.
		 */
		 
> +		if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)

Beacuse the caller always passes @access=ACC_ALL, the "access & ACC_EXEC_MASK"
part goes away (which addresses other good feedback about parantheses).

> +			child_spte = mark_spte_executable(child_spte);

Maybe s/mark/make?  KVM usually uses "mark" when making modifications for tracking
purposes.  This is simply making a SPTE executable, there's no tracking involved.

> +	}
> +
> +	return child_spte;
> +}
> +
> +
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
>  {
>  	u64 spte = SPTE_MMU_PRESENT_MASK;
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 9b0c7b27f23f..e13f335b4fef 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -334,6 +334,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>  	       u64 old_spte, bool prefetch, bool can_unsync,
>  	       bool host_writable, u64 *new_spte);
> +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access);
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index a8354d8578f1..be5eb74ac053 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1264,6 +1264,171 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>  	return spte_set;
>  }
>  
> +static struct kvm_mmu_page *alloc_tdp_mmu_page_from_kernel(gfp_t gfp)
> +{
> +	struct kvm_mmu_page *sp;
> +
> +	gfp |= __GFP_ZERO;
> +
> +	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
> +	if (!sp)
> +		return NULL;
> +
> +	sp->spt = (void *)__get_free_page(gfp);
> +	if (!sp->spt) {
> +		kmem_cache_free(mmu_page_header_cache, sp);
> +		return NULL;
> +	}
> +
> +	return sp;
> +}
> +
> +static struct kvm_mmu_page *alloc_tdp_mmu_page_for_split(struct kvm *kvm, bool *dropped_lock)
> +{
> +	struct kvm_mmu_page *sp;
> +
> +	lockdep_assert_held_read(&kvm->mmu_lock);
> +
> +	*dropped_lock = false;
> +
> +	/*
> +	 * Since we are allocating while under the MMU lock we have to be
> +	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
> +	 * reclaim and to avoid making any filesystem callbacks (which can end
> +	 * up invoking KVM MMU notifiers, resulting in a deadlock).
> +	 *
> +	 * If this allocation fails we drop the lock and retry with reclaim
> +	 * allowed.
> +	 */
> +	sp = alloc_tdp_mmu_page_from_kernel(GFP_NOWAIT | __GFP_ACCOUNT);
> +	if (sp)
> +		return sp;
> +
> +	rcu_read_unlock();
> +	read_unlock(&kvm->mmu_lock);
> +
> +	*dropped_lock = true;
> +
> +	sp = alloc_tdp_mmu_page_from_kernel(GFP_KERNEL_ACCOUNT);
> +
> +	read_lock(&kvm->mmu_lock);
> +	rcu_read_lock();
> +
> +	return sp;
> +}
> +
> +static bool

Please put the return type and attributes on the same line as the function name,
splitting them makes grep sad.  And separating these rarely saves more than a line,
e.g. even with conservative wrapping, this goes from 2=>3 lines.

static bool tdp_mmu_split_huge_page_atomic(struct kvm *kvm,
					   struct tdp_iter *iter,
					   struct kvm_mmu_page *sp)

And it doesn't save anything if we want to run over by a few chars, which IMO
isn't worth it in this case, but that's not a sticking point.

static bool tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter,
					   struct kvm_mmu_page *sp)

> +tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter, struct kvm_mmu_page *sp)
> +{
> +	const u64 huge_spte = iter->old_spte;
> +	const int level = iter->level;
> +	u64 child_spte;
> +	int i;
> +
> +	init_child_tdp_mmu_page(sp, iter);
> +
> +	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> +		child_spte = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);

No need for child_spte, not saving any chars versus setting sp->spt[i] directly.
And then if you hoist the comment above the for-loop you can drop the braces and
shave a line from the comment:

	/*
	 * No need for atomics since child_sp has not been installed in the
	 * table yet and thus is not reachable by any other thread.
	 */
	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);

> +
> +		/*
> +		 * No need for atomics since child_sp has not been installed
> +		 * in the table yet and thus is not reachable by any other
> +		 * thread.
> +		 */
> +		sp->spt[i] = child_spte;
> +	}
> +
> +	if (!tdp_mmu_install_sp_atomic(kvm, iter, sp, false))
> +		return false;
> +
> +	/*
> +	 * tdp_mmu_install_sp_atomic will handle subtracting the split huge

Please add () when referencing functions, it helps readers visually identify that
the comment refers to a different function.

> +	 * page from stats, but we have to manually update the new present child
> +	 * pages.
> +	 */
> +	kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
> +
> +	return true;
> +}
> +
> +static int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +					 gfn_t start, gfn_t end, int target_level)
> +{
> +	struct kvm_mmu_page *sp = NULL;
> +	struct tdp_iter iter;
> +	bool dropped_lock;
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * Traverse the page table splitting all huge pages above the target
> +	 * level into one lower level. For example, if we encounter a 1GB page
> +	 * we split it into 512 2MB pages.
> +	 *
> +	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> +	 * to visit an SPTE before ever visiting its children, which means we
> +	 * will correctly recursively split huge pages that are more than one
> +	 * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
> +	 */
> +	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
> +retry:
> +		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> +			continue;
> +
> +		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
> +			continue;
> +
> +		if (!sp) {
> +			sp = alloc_tdp_mmu_page_for_split(kvm, &dropped_lock);
> +			if (!sp)
> +				return -ENOMEM;
> +
> +			if (dropped_lock) {
> +				tdp_iter_restart(&iter);

Ah, this needs to be rebased to play nice with commit 3a0f64de479c ("KVM: x86/mmu:
Don't advance iterator after restart due to yielding").

With that in play, I think that best approach would be to drop dropped_lock (ha!)
and just do:

			sp = alloc_tdp_mmu_page_for_split(kvm, &iter);
			if (!sp)
				return -ENOMEM;

			if (iter.yielded)
				continue;

> +				continue;
> +			}
> +		}
> +
> +		if (!tdp_mmu_split_huge_page_atomic(kvm, &iter, sp))
> +			goto retry;
> +
> +		sp = NULL;
> +	}
> +
> +	/*
> +	 * It's possible to exit the loop having never used the last sp if, for
> +	 * example, a vCPU doing HugePage NX splitting wins the race and
> +	 * installs its own sp in place of the last sp we tried to split.
> +	 */
> +	if (sp)
> +		tdp_mmu_free_sp(sp);
> +
> +	rcu_read_unlock();

Uber nit, an unusued sp can be freed after dropping RCU.

> +
> +	return 0;
> +}
> +
> +int kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> +				     const struct kvm_memory_slot *slot,
> +				     gfn_t start, gfn_t end,
> +				     int target_level)
> +{
> +	struct kvm_mmu_page *root;
> +	int r = 0;
> +
> +	lockdep_assert_held_read(&kvm->mmu_lock);
> +
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) {
> +		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level);
> +		if (r) {
> +			kvm_tdp_mmu_put_root(kvm, root, true);
> +			break;
> +		}
> +	}
> +
> +	return r;
> +}
> +
>  /*
>   * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
>   * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3899004a5d91..3557a7fcf927 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -71,6 +71,11 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>  				   struct kvm_memory_slot *slot, gfn_t gfn,
>  				   int min_level);
>  
> +int kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> +				     const struct kvm_memory_slot *slot,
> +				     gfn_t start, gfn_t end,
> +				     int target_level);
> +
>  static inline void kvm_tdp_mmu_walk_lockless_begin(void)
>  {
>  	rcu_read_lock();
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 85127b3e3690..fb5592bf2eee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -187,6 +187,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO);
>  int __read_mostly pi_inject_timer = -1;
>  module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
>  
> +static bool __read_mostly eagerly_split_huge_pages_for_dirty_logging = true;
> +module_param(eagerly_split_huge_pages_for_dirty_logging, bool, 0644);

Heh, can we use a shorter name for the module param?  There's 0% chance I'll ever
type that correctly.  Maybe eager_hugepage_splitting?  Though even that is a bit
too long for my tastes.

This should also be documented somewhere, which is where we can/should explain
exactly what the param does instead of trying to shove all that info into the name.

> +
>  /*
>   * Restoring the host value for MSRs that are only consumed when running in
>   * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
> @@ -11837,6 +11840,13 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
>  			return;
>  
> +		/*
> +		 * Attempt to split all large pages into 4K pages so that vCPUs
> +		 * do not have to take write-protection faults.
> +		 */
> +		if (READ_ONCE(eagerly_split_huge_pages_for_dirty_logging))
> +			kvm_mmu_slot_try_split_huge_pages(kvm, new, PG_LEVEL_4K);
> +
>  		if (kvm_x86_ops.cpu_dirty_log_size) {
>  			kvm_mmu_slot_leaf_clear_dirty(kvm, new);
>  			kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
  2022-01-06 20:59   ` Sean Christopherson
@ 2022-01-06 22:08     ` David Matlack
  2022-01-06 23:02       ` Sean Christopherson
  0 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2022-01-06 22:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 12:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > Separate the allocation of child pages from the initialization. This is
>
> "from their initialization" so that it's not a dangling sentence.
>
> > in preparation for doing page splitting outside of the vCPU fault
> > context which requires a different allocation mechanism.
> >
> > No functional changed intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 30 +++++++++++++++++++++++-------
> >  1 file changed, 23 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 582d9a798899..a8354d8578f1 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -157,13 +157,18 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >               if (kvm_mmu_page_as_id(_root) != _as_id) {              \
> >               } else
> >
> > -static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > -                                            union kvm_mmu_page_role role)
> > +static struct kvm_mmu_page *alloc_tdp_mmu_page_from_caches(struct kvm_vcpu *vcpu)
>
> Hrm, this ends up being a rather poor name because the "from_kernel" variant also
> allocates from a cache, it's just a different cache:
>
>   static struct kvm_mmu_page *alloc_tdp_mmu_page_from_kernel(gfp_t gfp)
>   {
>         struct kvm_mmu_page *sp;
>
>         gfp |= __GFP_ZERO;
>
>         sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
>         if (!sp)
>                 return NULL;
>
>         ...
>   }
>
> Given that the !vcpu path is the odd one, and the only user of the from_kernel
> variant is the split, maybe this?  I.e. punt on naming until another user of the
> "split" variant comes along.
>
>   static struct kvm_mmu_page *__alloc_tdp_mmu_page(struct kvm_vcpu *vcpu)
>
> and
>
>   static struct kvm_mmu_page *__alloc_tdp_mmu_page_for_split(gfp_t gfp)

Will do.

>
> >  {
> >       struct kvm_mmu_page *sp;
> >
> >       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> >       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +
> > +     return sp;
> > +}
> > +
> > +static void init_tdp_mmu_page(struct kvm_mmu_page *sp, gfn_t gfn, union kvm_mmu_page_role role)
>
> Newline.  I'm all in favor of running over when doing so improves readability, but
> that's not the case here.

Ah shoot. I had configured my editor to use a 100 char line limit for
kernel code, but reading the kernel style guide more closely I see
that 80 is still the preferred limit. I'll go back to preferring 80 and
only go over when it explicitly makes the code more readable.


>
> > +{
> >       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> >       sp->role = role;
> > @@ -171,11 +176,9 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >       sp->tdp_mmu_page = true;
> >
> >       trace_kvm_mmu_get_page(sp, true);
> > -
> > -     return sp;
> >  }
> >
> > -static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
> > +static void init_child_tdp_mmu_page(struct kvm_mmu_page *child_sp, struct tdp_iter *iter)
>
> Newline.
>
> >  {
> >       struct kvm_mmu_page *parent_sp;
> >       union kvm_mmu_page_role role;
> > @@ -185,7 +188,17 @@ static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, stru
> >       role = parent_sp->role;
> >       role.level--;
> >
> > -     return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
> > +     init_tdp_mmu_page(child_sp, iter->gfn, role);
> > +}
> > +
> > +static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
>
> Newline.
>
> > +{
> > +     struct kvm_mmu_page *child_sp;
> > +
> > +     child_sp = alloc_tdp_mmu_page_from_caches(vcpu);
> > +     init_child_tdp_mmu_page(child_sp, iter);
> > +
> > +     return child_sp;
> >  }
> >
> >  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > @@ -210,7 +223,10 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >                       goto out;
> >       }
> >
> > -     root = alloc_tdp_mmu_page(vcpu, 0, role);
> > +     root = alloc_tdp_mmu_page_from_caches(vcpu);
> > +
> > +     init_tdp_mmu_page(root, 0, role);
> > +
> >       refcount_set(&root->tdp_mmu_root_count, 1);
> >
> >       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > --
> > 2.34.1.173.g76aa8bc2d0-goog
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2022-01-06 21:28   ` Sean Christopherson
@ 2022-01-06 22:20     ` David Matlack
  2022-01-06 22:56       ` Sean Christopherson
  2022-01-07  2:02       ` Peter Xu
  0 siblings, 2 replies; 55+ messages in thread
From: David Matlack @ 2022-01-06 22:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 1:28 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > When dirty logging is enabled without initially-all-set, attempt to
> > split all huge pages in the memslot down to 4KB pages so that vCPUs
> > do not have to take expensive write-protection faults to split huge
> > pages.
> >
> > Huge page splitting is best-effort only. This commit only adds the
> > support for the TDP MMU, and even there splitting may fail due to out
> > of memory conditions. Failures to split a huge page is fine from a
> > correctness standpoint because we still always follow it up by write-
> > protecting any remaining huge pages.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   3 +
> >  arch/x86/kvm/mmu/mmu.c          |  14 +++
> >  arch/x86/kvm/mmu/spte.c         |  59 ++++++++++++
> >  arch/x86/kvm/mmu/spte.h         |   1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 165 ++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu/tdp_mmu.h      |   5 +
> >  arch/x86/kvm/x86.c              |  10 ++
> >  7 files changed, 257 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index e863d569c89a..4a507109e886 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1573,6 +1573,9 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                                     const struct kvm_memory_slot *memslot,
> >                                     int start_level);
> > +void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> > +                                    const struct kvm_memory_slot *memslot,
> > +                                    int target_level);
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >                                  const struct kvm_memory_slot *memslot);
> >  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c2cb4dd1f11..9116c6a4ced1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5807,6 +5807,20 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >               kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> > +                                    const struct kvm_memory_slot *memslot,
> > +                                    int target_level)
> > +{
> > +     u64 start = memslot->base_gfn;
> > +     u64 end = start + memslot->npages;
> > +
> > +     if (is_tdp_mmu_enabled(kvm)) {
> > +             read_lock(&kvm->mmu_lock);
> > +             kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level);
> > +             read_unlock(&kvm->mmu_lock);
> > +     }
> > +}
> > +
> >  static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >                                        struct kvm_rmap_head *rmap_head,
> >                                        const struct kvm_memory_slot *slot)
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index fd34ae5d6940..11d0b3993ba5 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -191,6 +191,65 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >       return wrprot;
> >  }
> >
> > +static u64 mark_spte_executable(u64 spte)
> > +{
> > +     bool is_access_track = is_access_track_spte(spte);
> > +
> > +     if (is_access_track)
> > +             spte = restore_acc_track_spte(spte);
> > +
> > +     spte &= ~shadow_nx_mask;
> > +     spte |= shadow_x_mask;
> > +
> > +     if (is_access_track)
> > +             spte = mark_spte_for_access_track(spte);
> > +
> > +     return spte;
> > +}
> > +
> > +/*
> > + * Construct an SPTE that maps a sub-page of the given huge page SPTE where
> > + * `index` identifies which sub-page.
> > + *
> > + * This is used during huge page splitting to build the SPTEs that make up the
>
> Nit, to be consistent with the kernel, s/huge page/hugepage.

The kernel seems pretty split on which to use actually:

$ git grep --extended "huge[ _]page" | wc -l
1663
$ git grep --extended "hugepage" | wc -l
1558

The former has a slight edge so I went with that throughout the series.

>
> > + * new page table.
> > + */
> > +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access)
>
> Newline.  Actually, just drop @access since there's exactly one caller that
> unconditionally passes ACC_ALL.
>
> > +{
> > +     u64 child_spte;
> > +     int child_level;
> > +
> > +     if (WARN_ON(is_mmio_spte(huge_spte)))
> > +             return 0;
>
> Unnecessary, this is covered by is_shadow_present_pte().
>
> > +
> > +     if (WARN_ON(!is_shadow_present_pte(huge_spte)))
> > +             return 0;
> > +
> > +     if (WARN_ON(!is_large_pte(huge_spte)))
>
> Probably best to make these WARN_ON_ONCE, I gotta image if KVM screws up then this
> will flood dmesg.
>
> > +             return 0;
> > +
> > +     child_spte = huge_spte;
> > +     child_level = huge_level - 1;
> > +
> > +     /*
> > +      * The child_spte already has the base address of the huge page being
> > +      * split. So we just have to OR in the offset to the page at the next
> > +      * lower level for the given index.
> > +      */
> > +     child_spte |= (index * KVM_PAGES_PER_HPAGE(child_level)) << PAGE_SHIFT;
> > +
> > +     if (child_level == PG_LEVEL_4K) {
> > +             child_spte &= ~PT_PAGE_SIZE_MASK;
> > +
> > +             /* Allow execution for 4K pages if it was disabled for NX HugePages. */
>
> Nit, this just reiterates the "what".  Even though the "why" is fairly obvious,
> maybe this instead?
>
>                 /*
>                  * When splitting to a 4K page, make the new SPTE executable as
>                  * the NX hugepage mitigation no longer applies.
>                  */
>
> > +             if (is_nx_huge_page_enabled() && access & ACC_EXEC_MASK)
>
> Beacuse the caller always passes @access=ACC_ALL, the "access & ACC_EXEC_MASK"
> part goes away (which addresses other good feedback about parantheses).
>
> > +                     child_spte = mark_spte_executable(child_spte);
>
> Maybe s/mark/make?  KVM usually uses "mark" when making modifications for tracking
> purposes.  This is simply making a SPTE executable, there's no tracking involved.
>
> > +     }
> > +
> > +     return child_spte;
> > +}
> > +
> > +
> >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
> >  {
> >       u64 spte = SPTE_MMU_PRESENT_MASK;
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index 9b0c7b27f23f..e13f335b4fef 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -334,6 +334,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >              unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> >              u64 old_spte, bool prefetch, bool can_unsync,
> >              bool host_writable, u64 *new_spte);
> > +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index, unsigned int access);
> >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
> >  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
> >  u64 mark_spte_for_access_track(u64 spte);
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index a8354d8578f1..be5eb74ac053 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1264,6 +1264,171 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> >       return spte_set;
> >  }
> >
> > +static struct kvm_mmu_page *alloc_tdp_mmu_page_from_kernel(gfp_t gfp)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +
> > +     gfp |= __GFP_ZERO;
> > +
> > +     sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
> > +     if (!sp)
> > +             return NULL;
> > +
> > +     sp->spt = (void *)__get_free_page(gfp);
> > +     if (!sp->spt) {
> > +             kmem_cache_free(mmu_page_header_cache, sp);
> > +             return NULL;
> > +     }
> > +
> > +     return sp;
> > +}
> > +
> > +static struct kvm_mmu_page *alloc_tdp_mmu_page_for_split(struct kvm *kvm, bool *dropped_lock)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +
> > +     lockdep_assert_held_read(&kvm->mmu_lock);
> > +
> > +     *dropped_lock = false;
> > +
> > +     /*
> > +      * Since we are allocating while under the MMU lock we have to be
> > +      * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
> > +      * reclaim and to avoid making any filesystem callbacks (which can end
> > +      * up invoking KVM MMU notifiers, resulting in a deadlock).
> > +      *
> > +      * If this allocation fails we drop the lock and retry with reclaim
> > +      * allowed.
> > +      */
> > +     sp = alloc_tdp_mmu_page_from_kernel(GFP_NOWAIT | __GFP_ACCOUNT);
> > +     if (sp)
> > +             return sp;
> > +
> > +     rcu_read_unlock();
> > +     read_unlock(&kvm->mmu_lock);
> > +
> > +     *dropped_lock = true;
> > +
> > +     sp = alloc_tdp_mmu_page_from_kernel(GFP_KERNEL_ACCOUNT);
> > +
> > +     read_lock(&kvm->mmu_lock);
> > +     rcu_read_lock();
> > +
> > +     return sp;
> > +}
> > +
> > +static bool
>
> Please put the return type and attributes on the same line as the function name,
> splitting them makes grep sad.

Sure will do. Out of curiosity, what sort of workflow expects to be
able to grep the return type, attributes, and function name on the
same line?

>  And separating these rarely saves more than a line,
> e.g. even with conservative wrapping, this goes from 2=>3 lines.
>
> static bool tdp_mmu_split_huge_page_atomic(struct kvm *kvm,
>                                            struct tdp_iter *iter,
>                                            struct kvm_mmu_page *sp)
>
> And it doesn't save anything if we want to run over by a few chars, which IMO
> isn't worth it in this case, but that's not a sticking point.
>
> static bool tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter,
>                                            struct kvm_mmu_page *sp)
>
> > +tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter, struct kvm_mmu_page *sp)
> > +{
> > +     const u64 huge_spte = iter->old_spte;
> > +     const int level = iter->level;
> > +     u64 child_spte;
> > +     int i;
> > +
> > +     init_child_tdp_mmu_page(sp, iter);
> > +
> > +     for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> > +             child_spte = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
>
> No need for child_spte, not saving any chars versus setting sp->spt[i] directly.
> And then if you hoist the comment above the for-loop you can drop the braces and
> shave a line from the comment:
>
>         /*
>          * No need for atomics since child_sp has not been installed in the
>          * table yet and thus is not reachable by any other thread.
>          */
>         for (i = 0; i < PT64_ENT_PER_PAGE; i++)
>                 sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
>
> > +
> > +             /*
> > +              * No need for atomics since child_sp has not been installed
> > +              * in the table yet and thus is not reachable by any other
> > +              * thread.
> > +              */
> > +             sp->spt[i] = child_spte;
> > +     }
> > +
> > +     if (!tdp_mmu_install_sp_atomic(kvm, iter, sp, false))
> > +             return false;
> > +
> > +     /*
> > +      * tdp_mmu_install_sp_atomic will handle subtracting the split huge
>
> Please add () when referencing functions, it helps readers visually identify that
> the comment refers to a different function.
>
> > +      * page from stats, but we have to manually update the new present child
> > +      * pages.
> > +      */
> > +     kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
> > +
> > +     return true;
> > +}
> > +
> > +static int tdp_mmu_split_huge_pages_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > +                                      gfn_t start, gfn_t end, int target_level)
> > +{
> > +     struct kvm_mmu_page *sp = NULL;
> > +     struct tdp_iter iter;
> > +     bool dropped_lock;
> > +
> > +     rcu_read_lock();
> > +
> > +     /*
> > +      * Traverse the page table splitting all huge pages above the target
> > +      * level into one lower level. For example, if we encounter a 1GB page
> > +      * we split it into 512 2MB pages.
> > +      *
> > +      * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> > +      * to visit an SPTE before ever visiting its children, which means we
> > +      * will correctly recursively split huge pages that are more than one
> > +      * level above the target level (e.g. splitting 1GB to 2MB to 4KB).
> > +      */
> > +     for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
> > +retry:
> > +             if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> > +                     continue;
> > +
> > +             if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
> > +                     continue;
> > +
> > +             if (!sp) {
> > +                     sp = alloc_tdp_mmu_page_for_split(kvm, &dropped_lock);
> > +                     if (!sp)
> > +                             return -ENOMEM;
> > +
> > +                     if (dropped_lock) {
> > +                             tdp_iter_restart(&iter);
>
> Ah, this needs to be rebased to play nice with commit 3a0f64de479c ("KVM: x86/mmu:
> Don't advance iterator after restart due to yielding").
>
> With that in play, I think that best approach would be to drop dropped_lock (ha!)
> and just do:
>
>                         sp = alloc_tdp_mmu_page_for_split(kvm, &iter);
>                         if (!sp)
>                                 return -ENOMEM;
>
>                         if (iter.yielded)
>                                 continue;
>
> > +                             continue;
> > +                     }
> > +             }
> > +
> > +             if (!tdp_mmu_split_huge_page_atomic(kvm, &iter, sp))
> > +                     goto retry;
> > +
> > +             sp = NULL;
> > +     }
> > +
> > +     /*
> > +      * It's possible to exit the loop having never used the last sp if, for
> > +      * example, a vCPU doing HugePage NX splitting wins the race and
> > +      * installs its own sp in place of the last sp we tried to split.
> > +      */
> > +     if (sp)
> > +             tdp_mmu_free_sp(sp);
> > +
> > +     rcu_read_unlock();
>
> Uber nit, an unusued sp can be freed after dropping RCU.
>
> > +
> > +     return 0;
> > +}
> > +
> > +int kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> > +                                  const struct kvm_memory_slot *slot,
> > +                                  gfn_t start, gfn_t end,
> > +                                  int target_level)
> > +{
> > +     struct kvm_mmu_page *root;
> > +     int r = 0;
> > +
> > +     lockdep_assert_held_read(&kvm->mmu_lock);
> > +
> > +     for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) {
> > +             r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level);
> > +             if (r) {
> > +                     kvm_tdp_mmu_put_root(kvm, root, true);
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return r;
> > +}
> > +
> >  /*
> >   * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
> >   * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 3899004a5d91..3557a7fcf927 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -71,6 +71,11 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> >                                  struct kvm_memory_slot *slot, gfn_t gfn,
> >                                  int min_level);
> >
> > +int kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> > +                                  const struct kvm_memory_slot *slot,
> > +                                  gfn_t start, gfn_t end,
> > +                                  int target_level);
> > +
> >  static inline void kvm_tdp_mmu_walk_lockless_begin(void)
> >  {
> >       rcu_read_lock();
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 85127b3e3690..fb5592bf2eee 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -187,6 +187,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO);
> >  int __read_mostly pi_inject_timer = -1;
> >  module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
> >
> > +static bool __read_mostly eagerly_split_huge_pages_for_dirty_logging = true;
> > +module_param(eagerly_split_huge_pages_for_dirty_logging, bool, 0644);
>
> Heh, can we use a shorter name for the module param?  There's 0% chance I'll ever
> type that correctly.  Maybe eager_hugepage_splitting?  Though even that is a bit
> too long for my tastes.

Yeah I'll pick a shorter name :). I was going back and forth on a few.
The other contender was "eager_page_splitting", since that's what I've
been calling this feature throughout the discussion of this series.
Although I can see the argument for adding "huge" in there.

>
> This should also be documented somewhere, which is where we can/should explain
> exactly what the param does instead of trying to shove all that info into the name.

Ack will do.

And all the other feedback SGTM. I'll incorporate them in the next
version. Thanks for the review!

>
> > +
> >  /*
> >   * Restoring the host value for MSRs that are only consumed when running in
> >   * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
> > @@ -11837,6 +11840,13 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> >               if (kvm_dirty_log_manual_protect_and_init_set(kvm))
> >                       return;
> >
> > +             /*
> > +              * Attempt to split all large pages into 4K pages so that vCPUs
> > +              * do not have to take write-protection faults.
> > +              */
> > +             if (READ_ONCE(eagerly_split_huge_pages_for_dirty_logging))
> > +                     kvm_mmu_slot_try_split_huge_pages(kvm, new, PG_LEVEL_4K);
> > +
> >               if (kvm_x86_ops.cpu_dirty_log_size) {
> >                       kvm_mmu_slot_leaf_clear_dirty(kvm, new);
> >                       kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
> > --
> > 2.34.1.173.g76aa8bc2d0-goog
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2022-01-05 17:49     ` David Matlack
@ 2022-01-06 22:48       ` Sean Christopherson
  0 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 22:48 UTC (permalink / raw)
  To: David Matlack
  Cc: Peter Xu, Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel,
	Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Wed, Jan 05, 2022, David Matlack wrote:
> > I don't even see anywhere that the tdp mmu disables the EXEC bit for 4K.. if
> > that's true then perhaps we can even drop "access" and this check?  But I could
> > have missed something.
> 
> TDP MMU always passes ACC_ALL so the access check could be omitted
> from this patch. But it will be needed to support eager splitting for
> the shadow MMU, which does not always allow execution.

Sure, but let's add that complexity when it's needed.  That also has the advantage
of documenting, in git history, why the param is in needed.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2022-01-06 22:20     ` David Matlack
@ 2022-01-06 22:56       ` Sean Christopherson
  2022-01-07  2:02       ` Peter Xu
  1 sibling, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 22:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 06, 2022, David Matlack wrote:
> > Nit, to be consistent with the kernel, s/huge page/hugepage.
> 
> The kernel seems pretty split on which to use actually:
> 
> $ git grep --extended "huge[ _]page" | wc -l
> 1663
> $ git grep --extended "hugepage" | wc -l
> 1558
> 
> The former has a slight edge so I went with that throughout the series.

Ha, and KVM leans even more heavily huge_page.  huge_page it is.

> > > +
> > > +static bool
> >
> > Please put the return type and attributes on the same line as the function name,
> > splitting them makes grep sad.
> 
> Sure will do. Out of curiosity, what sort of workflow expects to be
> able to grep the return type, attributes, and function name on the
> same line?

Start here[*], there are several good examples further down that thread.

[*] https://lore.kernel.org/all/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2022-01-06 20:12   ` Sean Christopherson
@ 2022-01-06 22:56     ` David Matlack
  2022-01-07 18:24       ` David Matlack
  0 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2022-01-06 22:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 12:12 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > Factor out the logic to atomically replace an SPTE with an SPTE that
> > points to a new page table. This will be used in a follow-up commit to
> > split a large page SPTE into one level lower.
> >
> > Opportunistically drop the kvm_mmu_get_page tracepoint in
> > kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
> > alloc_tdp_mmu_page().
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 48 +++++++++++++++++++++++++++-----------
> >  1 file changed, 34 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 656ebf5b20dc..dbd07c10d11a 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -950,6 +950,36 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >       return ret;
> >  }
> >
> > +/*
> > + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> > + * spte pointing to the provided page table.
> > + *
> > + * @kvm: kvm instance
> > + * @iter: a tdp_iter instance currently on the SPTE that should be set
> > + * @sp: The new TDP page table to install.
> > + * @account_nx: True if this page table is being installed to split a
> > + *              non-executable huge page.
> > + *
> > + * Returns: True if the new page table was installed. False if spte being
> > + *          replaced changed, causing the atomic compare-exchange to fail.
>
> I'd prefer to return an int with 0/-EBUSY on success/fail.  Ditto for the existing
> tdp_mmu_set_spte_atomic().  Actually, if you add a prep patch to make that happen,
> then this can be:
>
>         u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
>         int ret;
>
>         ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
>         if (ret)
>                 return ret;
>
>         tdp_mmu_link_page(kvm, sp, account_nx);
>         return 0;

Will do.

>
>
>
> > + *          If this function returns false the sp will be freed before
> > + *          returning.
>
> Uh, no it's not?  The call to tdp_mmu_free_sp() is still done by kvm_tdp_mmu_map().

Correct. I missed cleaning up this comment after I pulled the
tdp_mmu_free_sp() call up a level from where it was in the RFC.

>
> > + */
> > +static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
>
> Hmm, so this helper is the only user of tdp_mmu_link_page(), and _that_ helper
> is rather tiny.  And this would also be a good opportunity to clean up the
> "(un)link_page" verbiage, as the bare "page" doesn't communicate to the reader
> that it's for linking shadow pages, e.g. not struct page.
>
> So, what about folding in tdp_mmu_link_page(), naming this helper either
> tdp_mmu_link_sp_atomic() or tdp_mmu_link_shadow_page_atomic(), and then renaming
> tdp_mmu_unlink_page() accordingly?  And for bonus points, add a blurb in the
> function comment like:
>
>         * Note the lack of a non-atomic variant!  The TDP MMU always builds its
>         * page tables while holding mmu_lock for read.

Sure, I'll include that cleanup as part of the next version of this series.

>
> > +                                   struct tdp_iter *iter,
> > +                                   struct kvm_mmu_page *sp,
> > +                                   bool account_nx)
> > +{
> > +     u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> > +
> > +     if (!tdp_mmu_set_spte_atomic(kvm, iter, spte))
> > +             return false;
> > +
> > +     tdp_mmu_link_page(kvm, sp, account_nx);
> > +
> > +     return true;
> > +}
> > +
> >  /*
> >   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> >   * page tables and SPTEs to translate the faulting guest physical address.
> > @@ -959,8 +989,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >       struct kvm_mmu *mmu = vcpu->arch.mmu;
> >       struct tdp_iter iter;
> >       struct kvm_mmu_page *sp;
> > -     u64 *child_pt;
> > -     u64 new_spte;
> >       int ret;
> >
> >       kvm_mmu_hugepage_adjust(vcpu, fault);
> > @@ -996,6 +1024,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >               }
> >
> >               if (!is_shadow_present_pte(iter.old_spte)) {
> > +                     bool account_nx = fault->huge_page_disallowed &&
> > +                                       fault->req_level >= iter.level;
> > +
> >                       /*
> >                        * If SPTE has been frozen by another thread, just
> >                        * give up and retry, avoiding unnecessary page table
> > @@ -1005,18 +1036,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                               break;
> >
> >                       sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> > -                     child_pt = sp->spt;
> > -
> > -                     new_spte = make_nonleaf_spte(child_pt,
> > -                                                  !shadow_accessed_mask);
> > -
> > -                     if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
> > -                             tdp_mmu_link_page(vcpu->kvm, sp,
> > -                                               fault->huge_page_disallowed &&
> > -                                               fault->req_level >= iter.level);
> > -
> > -                             trace_kvm_mmu_get_page(sp, true);
> > -                     } else {
> > +                     if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
> >                               tdp_mmu_free_sp(sp);
> >                               break;
> >                       }
> > --
> > 2.34.1.173.g76aa8bc2d0-goog
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  2022-01-06 20:34   ` Sean Christopherson
@ 2022-01-06 22:57     ` David Matlack
  0 siblings, 0 replies; 55+ messages in thread
From: David Matlack @ 2022-01-06 22:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 12:34 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > Instead of passing a pointer to the root page table and the root level
> > separately, pass in a pointer to the kvm_mmu_page that backs the root.
> > This reduces the number of arguments by 1, cutting down on line lengths.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > Reviewed-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_iter.c |  5 ++++-
> >  arch/x86/kvm/mmu/tdp_iter.h | 10 +++++-----
> >  arch/x86/kvm/mmu/tdp_mmu.c  | 14 +++++---------
> >  3 files changed, 14 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> > index b3ed302c1a35..92b3a075525a 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.c
> > +++ b/arch/x86/kvm/mmu/tdp_iter.c
> > @@ -39,9 +39,12 @@ void tdp_iter_restart(struct tdp_iter *iter)
> >   * Sets a TDP iterator to walk a pre-order traversal of the paging structure
> >   * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
> >   */
> > -void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> > +void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
> >                   int min_level, gfn_t next_last_level_gfn)
> >  {
> > +     u64 *root_pt = root->spt;
> > +     int root_level = root->role.level;
>
> Uber nit, arch/x86/ prefers reverse fir tree, though I've yet to get Paolo fully
> on board :-)
>
> But looking at the usage of root_pt, even better would be
>
>   void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
>                       int min_level, gfn_t next_last_level_gfn)
>   {
>         int root_level = root->role.level;
>
>         WARN_ON(root_level < 1);
>         WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
>
>         iter->next_last_level_gfn = next_last_level_gfn;
>         iter->root_level = root_level;
>         iter->min_level = min_level;
>         iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
>         iter->as_id = kvm_mmu_page_as_id(root);
>
>         tdp_iter_restart(iter);
>   }
>
> to avoid the pointless sptep_to_sp() and eliminate the motivation for root_pt.

Will do.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  2022-01-06 20:27   ` Sean Christopherson
@ 2022-01-06 22:58     ` David Matlack
  0 siblings, 0 replies; 55+ messages in thread
From: David Matlack @ 2022-01-06 22:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 12:27 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > restore_acc_track_spte is purely an SPTE manipulation, making it a good
> > fit for spte.c. It is also needed in spte.c in a follow-up commit so we
> > can construct child SPTEs during large page splitting.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > Reviewed-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c  | 18 ------------------
> >  arch/x86/kvm/mmu/spte.c | 18 ++++++++++++++++++
> >  arch/x86/kvm/mmu/spte.h |  1 +
> >  3 files changed, 19 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8b702f2b6a70..3c2cb4dd1f11 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -646,24 +646,6 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
> >       return __get_spte_lockless(sptep);
> >  }
> >
> > -/* Restore an acc-track PTE back to a regular PTE */
> > -static u64 restore_acc_track_spte(u64 spte)
> > -{
> > -     u64 new_spte = spte;
> > -     u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
> > -                      & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
> > -
> > -     WARN_ON_ONCE(spte_ad_enabled(spte));
> > -     WARN_ON_ONCE(!is_access_track_spte(spte));
> > -
> > -     new_spte &= ~shadow_acc_track_mask;
> > -     new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
> > -                   SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
> > -     new_spte |= saved_bits;
> > -
> > -     return new_spte;
> > -}
> > -
> >  /* Returns the Accessed status of the PTE and resets it at the same time. */
> >  static bool mmu_spte_age(u64 *sptep)
> >  {
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 8a7b03207762..fd34ae5d6940 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -268,6 +268,24 @@ u64 mark_spte_for_access_track(u64 spte)
> >       return spte;
> >  }
> >
> > +/* Restore an acc-track PTE back to a regular PTE */
> > +u64 restore_acc_track_spte(u64 spte)
> > +{
> > +     u64 new_spte = spte;
> > +     u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT)
> > +                      & SHADOW_ACC_TRACK_SAVED_BITS_MASK;
>
> Obviously not your code, but this could be:
>
>         u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) &
>                          SHADOW_ACC_TRACK_SAVED_BITS_MASK;
>
>         WARN_ON_ONCE(spte_ad_enabled(spte));
>         WARN_ON_ONCE(!is_access_track_spte(spte));
>
>         spte &= ~shadow_acc_track_mask;
>         spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
>                   SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
>         spte |= saved_bits;
>
>         return spte;
>
> which is really just an excuse to move the ampersand up a line :-)
>
> And looking at the two callers, the WARNs are rather silly.  The spte_ad_enabled()
> WARN is especially pointless because that's also checked by is_access_track_spte().
> I like paranoid WARNs as much as anyone, but I don't see why this code warrants
> extra checking relative to the other SPTE helpers that have more subtle requirements.
>
> At that point, make make this an inline helper?
>
>   static inline u64 restore_acc_track_spte(u64 spte)
>   {
>         u64 saved_bits = (spte >> SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) &
>                          SHADOW_ACC_TRACK_SAVED_BITS_MASK;
>
>         spte &= ~shadow_acc_track_mask;
>         spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
>                   SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
>         spte |= saved_bits;
>
>         return spte;
>   }

That all sounds reasonable. I'll include some additional patches in
the next version to include these cleanups.
>
> > +     WARN_ON_ONCE(spte_ad_enabled(spte));
> > +     WARN_ON_ONCE(!is_access_track_spte(spte));
> > +
> > +     new_spte &= ~shadow_acc_track_mask;
> > +     new_spte &= ~(SHADOW_ACC_TRACK_SAVED_BITS_MASK <<
> > +                   SHADOW_ACC_TRACK_SAVED_BITS_SHIFT);
> > +     new_spte |= saved_bits;
> > +
> > +     return new_spte;
> > +}
> > +
> >  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> >  {
> >       BUG_ON((u64)(unsigned)access_mask != access_mask);
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index a4af2a42695c..9b0c7b27f23f 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -337,6 +337,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
> >  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
> >  u64 mark_spte_for_access_track(u64 spte);
> > +u64 restore_acc_track_spte(u64 spte);
> >  u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
> >
> >  void kvm_mmu_reset_all_pte_masks(void);
> > --
> > 2.34.1.173.g76aa8bc2d0-goog
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent
  2022-01-06 20:45   ` Sean Christopherson
@ 2022-01-06 23:00     ` David Matlack
  0 siblings, 0 replies; 55+ messages in thread
From: David Matlack @ 2022-01-06 23:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 12:45 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Please include "TDP MMU" somewhere in the shortlog.  It's a nice to have, e.g. not
> worth forcing if there's more interesting info to put in the shortlog, but in this
> case there are plenty of chars to go around.  E.g.
>
>   KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > Derive the page role from the parent shadow page, since the only thing
> > that changes is the level. This is in preparation for eagerly splitting
> > large pages during VM-ioctls which does not have access to the vCPU
>
> s/does/do since VM-ioctls is plural.
>
> > MMU context.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 43 ++++++++++++++++++++------------------
> >  1 file changed, 23 insertions(+), 20 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 2fb2d7677fbf..582d9a798899 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -157,23 +157,8 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >               if (kvm_mmu_page_as_id(_root) != _as_id) {              \
> >               } else
> >
> > -static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> > -                                                int level)
> > -{
> > -     union kvm_mmu_page_role role;
> > -
> > -     role = vcpu->arch.mmu->mmu_role.base;
> > -     role.level = level;
> > -     role.direct = true;
> > -     role.has_4_byte_gpte = false;
> > -     role.access = ACC_ALL;
> > -     role.ad_disabled = !shadow_accessed_mask;
> > -
> > -     return role;
> > -}
> > -
> >  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > -                                            int level)
> > +                                            union kvm_mmu_page_role role)
> >  {
> >       struct kvm_mmu_page *sp;
> >
> > @@ -181,7 +166,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> >       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> > -     sp->role.word = page_role_for_level(vcpu, level).word;
> > +     sp->role = role;
> >       sp->gfn = gfn;
> >       sp->tdp_mmu_page = true;
> >
> > @@ -190,6 +175,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >       return sp;
> >  }
> >
> > +static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu, struct tdp_iter *iter)
>
> Newline please, this is well over 80 chars.
>
> static struct kvm_mmu_page *alloc_child_tdp_mmu_page(struct kvm_vcpu *vcpu,
>                                                      struct tdp_iter *iter)
> > +{
> > +     struct kvm_mmu_page *parent_sp;
> > +     union kvm_mmu_page_role role;
> > +
> > +     parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
> > +
> > +     role = parent_sp->role;
> > +     role.level--;
> > +
> > +     return alloc_tdp_mmu_page(vcpu, iter->gfn, role);
> > +}
> > +
> >  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >  {
> >       union kvm_mmu_page_role role;
> > @@ -198,7 +196,12 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >
> >       lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > -     role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
> > +     role = vcpu->arch.mmu->mmu_role.base;
> > +     role.level = vcpu->arch.mmu->shadow_root_level;
> > +     role.direct = true;
> > +     role.has_4_byte_gpte = false;
> > +     role.access = ACC_ALL;
> > +     role.ad_disabled = !shadow_accessed_mask;
>
> Hmm, so _all_ of this unnecessary, i.e. this can simply be:
>
>         role = vcpu->arch.mmu->mmu_role.base;
>
> Probably better to handle everything except .level in a separate prep commit.
>
> I'm not worried about the cost, I want to avoid potential confusion as to why the
> TDP MMU is apparently "overriding" these fields.

All great suggestions. I'll include these changes in the next version,
including an additional patch to eliminate the redundant role
overrides.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization
  2022-01-06 22:08     ` David Matlack
@ 2022-01-06 23:02       ` Sean Christopherson
  0 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 23:02 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 06, 2022, David Matlack wrote:
> On Thu, Jan 6, 2022 at 12:59 PM Sean Christopherson <seanjc@google.com> wrote:
> > Newline.  I'm all in favor of running over when doing so improves readability, but
> > that's not the case here.
> 
> Ah shoot. I had configured my editor to use a 100 char line limit for
> kernel code, but reading the kernel style guide more closely I see
> that 80 is still the preferred limit. I'll go back to preferring 80 and
> only go over when it explicitly makes the code more readable.

Yeah, checkpatch was modified to warn at 100 chars so that people would stop
interpreting 80 as a hard limit, e.g. wrapping due to being one character over,
but 80 is still the soft limit.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
  2021-12-13 22:59 ` [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
  2022-01-05  8:38   ` Peter Xu
@ 2022-01-06 23:14   ` Sean Christopherson
  2022-01-07  0:54     ` David Matlack
  1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2022-01-06 23:14 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021, David Matlack wrote:
> Add a tracepoint that records whenever KVM eagerly splits a huge page.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmutrace.h | 20 ++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c  |  2 ++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index de5e8e4e1aa7..4feabf773387 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -416,6 +416,26 @@ TRACE_EVENT(
>  	)
>  );
>  
> +TRACE_EVENT(
> +	kvm_mmu_split_huge_page,
> +	TP_PROTO(u64 gfn, u64 spte, int level),
> +	TP_ARGS(gfn, spte, level),
> +
> +	TP_STRUCT__entry(
> +		__field(u64, gfn)
> +		__field(u64, spte)
> +		__field(int, level)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->gfn = gfn;
> +		__entry->spte = spte;
> +		__entry->level = level;
> +	),
> +
> +	TP_printk("gfn %llx spte %llx level %d", __entry->gfn, __entry->spte, __entry->level)
> +);
> +
>  #endif /* _TRACE_KVMMMU_H */
>  
>  #undef TRACE_INCLUDE_PATH
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index be5eb74ac053..e6910b9b5c12 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1325,6 +1325,8 @@ tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter, struct kv
>  	u64 child_spte;
>  	int i;
>  
> +	trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level);

This should either be called iff splitting is successful, or it should record
whether or not the split was successful.  The latter is probably useful info,
and easy to do, e.g. assuming this is changed to return an int like the lower
helpers:


	ret = tdp_mmu_install_sp_atomic(kvm, iter, sp, false);

	/*
	 * tdp_mmu_install_sp_atomic will handle subtracting the split huge
	 * page from stats, but we have to manually update the new present child
	 * pages on success.
	 */
	if (!ret)
		kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);

	trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level, ret);

	return ret;

and then the tracpoint can do 'ret ? "failed" : "succeeded"' or something.

> +
>  	init_child_tdp_mmu_page(sp, iter);
>  
>  	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> -- 
> 2.34.1.173.g76aa8bc2d0-goog
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages
  2022-01-06 23:14   ` Sean Christopherson
@ 2022-01-07  0:54     ` David Matlack
  0 siblings, 0 replies; 55+ messages in thread
From: David Matlack @ 2022-01-07  0:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 3:14 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Dec 13, 2021, David Matlack wrote:
> > Add a tracepoint that records whenever KVM eagerly splits a huge page.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmutrace.h | 20 ++++++++++++++++++++
> >  arch/x86/kvm/mmu/tdp_mmu.c  |  2 ++
> >  2 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> > index de5e8e4e1aa7..4feabf773387 100644
> > --- a/arch/x86/kvm/mmu/mmutrace.h
> > +++ b/arch/x86/kvm/mmu/mmutrace.h
> > @@ -416,6 +416,26 @@ TRACE_EVENT(
> >       )
> >  );
> >
> > +TRACE_EVENT(
> > +     kvm_mmu_split_huge_page,
> > +     TP_PROTO(u64 gfn, u64 spte, int level),
> > +     TP_ARGS(gfn, spte, level),
> > +
> > +     TP_STRUCT__entry(
> > +             __field(u64, gfn)
> > +             __field(u64, spte)
> > +             __field(int, level)
> > +     ),
> > +
> > +     TP_fast_assign(
> > +             __entry->gfn = gfn;
> > +             __entry->spte = spte;
> > +             __entry->level = level;
> > +     ),
> > +
> > +     TP_printk("gfn %llx spte %llx level %d", __entry->gfn, __entry->spte, __entry->level)
> > +);
> > +
> >  #endif /* _TRACE_KVMMMU_H */
> >
> >  #undef TRACE_INCLUDE_PATH
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index be5eb74ac053..e6910b9b5c12 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1325,6 +1325,8 @@ tdp_mmu_split_huge_page_atomic(struct kvm *kvm, struct tdp_iter *iter, struct kv
> >       u64 child_spte;
> >       int i;
> >
> > +     trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level);
>
> This should either be called iff splitting is successful, or it should record
> whether or not the split was successful.

Blegh. My intention was to do the former but it's obviously wrong if
the cmpxchg fails.

> The latter is probably useful info,
> and easy to do, e.g. assuming this is changed to return an int like the lower
> helpers:
>
>
>         ret = tdp_mmu_install_sp_atomic(kvm, iter, sp, false);
>
>         /*
>          * tdp_mmu_install_sp_atomic will handle subtracting the split huge
>          * page from stats, but we have to manually update the new present child
>          * pages on success.
>          */
>         if (!ret)
>                 kvm_update_page_stats(kvm, level - 1, PT64_ENT_PER_PAGE);
>
>         trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level, ret);
>
>         return ret;
>
> and then the tracpoint can do 'ret ? "failed" : "succeeded"' or something.

If we do this we should capture all the reasons why splitting might
fail. cmpxchg races are one, and the other is failing to allocate the
sp memory. I'll take a look at doing this in the next version. It
doesn't look too difficult.

>
> > +
> >       init_child_tdp_mmu_page(sp, iter);
> >
> >       for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> > --
> > 2.34.1.173.g76aa8bc2d0-goog
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2022-01-06 22:20     ` David Matlack
  2022-01-06 22:56       ` Sean Christopherson
@ 2022-01-07  2:02       ` Peter Xu
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-07  2:02 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Paolo Bonzini, kvm list, Ben Gardon,
	Joerg Roedel, Jim Mattson, Wanpeng Li, Vitaly Kuznetsov,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Thu, Jan 06, 2022 at 02:20:25PM -0800, David Matlack wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 85127b3e3690..fb5592bf2eee 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -187,6 +187,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO);
> > >  int __read_mostly pi_inject_timer = -1;
> > >  module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
> > >
> > > +static bool __read_mostly eagerly_split_huge_pages_for_dirty_logging = true;
> > > +module_param(eagerly_split_huge_pages_for_dirty_logging, bool, 0644);
> >
> > Heh, can we use a shorter name for the module param?  There's 0% chance I'll ever
> > type that correctly.  Maybe eager_hugepage_splitting?  Though even that is a bit
> > too long for my tastes.
> 
> Yeah I'll pick a shorter name :). I was going back and forth on a few.
> The other contender was "eager_page_splitting", since that's what I've
> been calling this feature throughout the discussion of this series.
> Although I can see the argument for adding "huge" in there.

I didn't raise this question when reviewing but I agree. :) I'll even go with
the shorter "eager_page_split" since the suffix "-ting" doesn't help anything
on understanding, imho; meanwhile "huge" is implied by "split" (as small page
won't need any split anyway).

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled
  2021-12-13 22:59 ` [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled David Matlack
  2022-01-05  7:54   ` Peter Xu
  2022-01-06 21:28   ` Sean Christopherson
@ 2022-01-07  2:06   ` Peter Xu
  2 siblings, 0 replies; 55+ messages in thread
From: Peter Xu @ 2022-01-07  2:06 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
	Janis Schoetterl-Glausch, Junaid Shahid, Oliver Upton,
	Harish Barathvajasankar, Peter Shier, Nikunj A . Dadhania

On Mon, Dec 13, 2021 at 10:59:14PM +0000, David Matlack wrote:
> When dirty logging is enabled without initially-all-set, attempt to
> split all huge pages in the memslot down to 4KB pages so that vCPUs
> do not have to take expensive write-protection faults to split huge
> pages.
> 
> Huge page splitting is best-effort only. This commit only adds the
> support for the TDP MMU, and even there splitting may fail due to out
> of memory conditions. Failures to split a huge page is fine from a
> correctness standpoint because we still always follow it up by write-
> protecting any remaining huge pages.

One more thing: how about slightly document the tlb flush skipping behavior in
commit message, or, in the function to do the split?  I don't see any problem
with it but I'm just not sure whether it's super obvious to anyone.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2022-01-06 22:56     ` David Matlack
@ 2022-01-07 18:24       ` David Matlack
  2022-01-07 21:39         ` Sean Christopherson
  0 siblings, 1 reply; 55+ messages in thread
From: David Matlack @ 2022-01-07 18:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Thu, Jan 6, 2022 at 2:56 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Jan 6, 2022 at 12:12 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Dec 13, 2021, David Matlack wrote:
> > > Factor out the logic to atomically replace an SPTE with an SPTE that
> > > points to a new page table. This will be used in a follow-up commit to
> > > split a large page SPTE into one level lower.
> > >
> > > Opportunistically drop the kvm_mmu_get_page tracepoint in
> > > kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
> > > alloc_tdp_mmu_page().
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 48 +++++++++++++++++++++++++++-----------
> > >  1 file changed, 34 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 656ebf5b20dc..dbd07c10d11a 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -950,6 +950,36 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> > >       return ret;
> > >  }
> > >
> > > +/*
> > > + * tdp_mmu_install_sp_atomic - Atomically replace the given spte with an
> > > + * spte pointing to the provided page table.
> > > + *
> > > + * @kvm: kvm instance
> > > + * @iter: a tdp_iter instance currently on the SPTE that should be set
> > > + * @sp: The new TDP page table to install.
> > > + * @account_nx: True if this page table is being installed to split a
> > > + *              non-executable huge page.
> > > + *
> > > + * Returns: True if the new page table was installed. False if spte being
> > > + *          replaced changed, causing the atomic compare-exchange to fail.
> >
> > I'd prefer to return an int with 0/-EBUSY on success/fail.  Ditto for the existing
> > tdp_mmu_set_spte_atomic().  Actually, if you add a prep patch to make that happen,
> > then this can be:
> >
> >         u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> >         int ret;
> >
> >         ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
> >         if (ret)
> >                 return ret;
> >
> >         tdp_mmu_link_page(kvm, sp, account_nx);
> >         return 0;
>
> Will do.
>
> >
> >
> >
> > > + *          If this function returns false the sp will be freed before
> > > + *          returning.
> >
> > Uh, no it's not?  The call to tdp_mmu_free_sp() is still done by kvm_tdp_mmu_map().
>
> Correct. I missed cleaning up this comment after I pulled the
> tdp_mmu_free_sp() call up a level from where it was in the RFC.
>
> >
> > > + */
> > > +static bool tdp_mmu_install_sp_atomic(struct kvm *kvm,
> >
> > Hmm, so this helper is the only user of tdp_mmu_link_page(), and _that_ helper
> > is rather tiny.  And this would also be a good opportunity to clean up the
> > "(un)link_page" verbiage, as the bare "page" doesn't communicate to the reader
> > that it's for linking shadow pages, e.g. not struct page.
> >
> > So, what about folding in tdp_mmu_link_page(), naming this helper either
> > tdp_mmu_link_sp_atomic() or tdp_mmu_link_shadow_page_atomic(), and then renaming
> > tdp_mmu_unlink_page() accordingly?  And for bonus points, add a blurb in the
> > function comment like:
> >
> >         * Note the lack of a non-atomic variant!  The TDP MMU always builds its
> >         * page tables while holding mmu_lock for read.
>
> Sure, I'll include that cleanup as part of the next version of this series.

While I'm here how do you feel about renaming alloc_tdp_mmu_page() to
tdp_mmu_alloc_sp()? First to increase consistency that "tdp_mmu" is a
prefix before the verb, and to clarify that we are allocating a shadow
page.
>
> >
> > > +                                   struct tdp_iter *iter,
> > > +                                   struct kvm_mmu_page *sp,
> > > +                                   bool account_nx)
> > > +{
> > > +     u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> > > +
> > > +     if (!tdp_mmu_set_spte_atomic(kvm, iter, spte))
> > > +             return false;
> > > +
> > > +     tdp_mmu_link_page(kvm, sp, account_nx);
> > > +
> > > +     return true;
> > > +}
> > > +
> > >  /*
> > >   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> > >   * page tables and SPTEs to translate the faulting guest physical address.
> > > @@ -959,8 +989,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >       struct kvm_mmu *mmu = vcpu->arch.mmu;
> > >       struct tdp_iter iter;
> > >       struct kvm_mmu_page *sp;
> > > -     u64 *child_pt;
> > > -     u64 new_spte;
> > >       int ret;
> > >
> > >       kvm_mmu_hugepage_adjust(vcpu, fault);
> > > @@ -996,6 +1024,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >               }
> > >
> > >               if (!is_shadow_present_pte(iter.old_spte)) {
> > > +                     bool account_nx = fault->huge_page_disallowed &&
> > > +                                       fault->req_level >= iter.level;
> > > +
> > >                       /*
> > >                        * If SPTE has been frozen by another thread, just
> > >                        * give up and retry, avoiding unnecessary page table
> > > @@ -1005,18 +1036,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >                               break;
> > >
> > >                       sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> > > -                     child_pt = sp->spt;
> > > -
> > > -                     new_spte = make_nonleaf_spte(child_pt,
> > > -                                                  !shadow_accessed_mask);
> > > -
> > > -                     if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
> > > -                             tdp_mmu_link_page(vcpu->kvm, sp,
> > > -                                               fault->huge_page_disallowed &&
> > > -                                               fault->req_level >= iter.level);
> > > -
> > > -                             trace_kvm_mmu_get_page(sp, true);
> > > -                     } else {
> > > +                     if (!tdp_mmu_install_sp_atomic(vcpu->kvm, &iter, sp, account_nx)) {
> > >                               tdp_mmu_free_sp(sp);
> > >                               break;
> > >                       }
> > > --
> > > 2.34.1.173.g76aa8bc2d0-goog
> > >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table
  2022-01-07 18:24       ` David Matlack
@ 2022-01-07 21:39         ` Sean Christopherson
  0 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2022-01-07 21:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, kvm list, Ben Gardon, Joerg Roedel, Jim Mattson,
	Wanpeng Li, Vitaly Kuznetsov, Janis Schoetterl-Glausch,
	Junaid Shahid, Oliver Upton, Harish Barathvajasankar, Peter Xu,
	Peter Shier, Nikunj A . Dadhania

On Fri, Jan 07, 2022, David Matlack wrote:
> While I'm here how do you feel about renaming alloc_tdp_mmu_page() to
> tdp_mmu_alloc_sp()? First to increase consistency that "tdp_mmu" is a
> prefix before the verb, and to clarify that we are allocating a shadow
> page.

I like that idea.

Ben, any objections to the suggested renames?  I know it's a bit weird calling TDP
pages "shadow" pages, but having consistent and unique terminology is very helpful
for discussions.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2022-01-07 21:39 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13 22:59 [PATCH v1 00/13] KVM: x86/mmu: Eager Page Splitting for the TDP MMU David Matlack
2021-12-13 22:59 ` [PATCH v1 01/13] KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn David Matlack
2022-01-06  0:35   ` Sean Christopherson
2021-12-13 22:59 ` [PATCH v1 02/13] KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect David Matlack
2022-01-06  0:35   ` Sean Christopherson
2021-12-13 22:59 ` [PATCH v1 03/13] KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails David Matlack
2022-01-04 10:13   ` Peter Xu
2022-01-04 17:29     ` Ben Gardon
2022-01-06  0:54   ` Sean Christopherson
2022-01-06 18:04     ` David Matlack
2021-12-13 22:59 ` [PATCH v1 04/13] KVM: x86/mmu: Factor out logic to atomically install a new page table David Matlack
2022-01-04 10:32   ` Peter Xu
2022-01-04 18:26     ` David Matlack
2022-01-05  1:00       ` Peter Xu
2022-01-06 20:12   ` Sean Christopherson
2022-01-06 22:56     ` David Matlack
2022-01-07 18:24       ` David Matlack
2022-01-07 21:39         ` Sean Christopherson
2021-12-13 22:59 ` [PATCH v1 05/13] KVM: x86/mmu: Move restore_acc_track_spte to spte.c David Matlack
2022-01-04 10:33   ` Peter Xu
2022-01-06 20:27   ` Sean Christopherson
2022-01-06 22:58     ` David Matlack
2021-12-13 22:59 ` [PATCH v1 06/13] KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root David Matlack
2022-01-04 10:35   ` Peter Xu
2022-01-06 20:34   ` Sean Christopherson
2022-01-06 22:57     ` David Matlack
2021-12-13 22:59 ` [PATCH v1 07/13] KVM: x86/mmu: Derive page role from parent David Matlack
2022-01-05  7:51   ` Peter Xu
2022-01-06 20:45   ` Sean Christopherson
2022-01-06 23:00     ` David Matlack
2021-12-13 22:59 ` [PATCH v1 08/13] KVM: x86/mmu: Refactor TDP MMU child page initialization David Matlack
2022-01-05  7:51   ` Peter Xu
2022-01-06 20:59   ` Sean Christopherson
2022-01-06 22:08     ` David Matlack
2022-01-06 23:02       ` Sean Christopherson
2021-12-13 22:59 ` [PATCH v1 09/13] KVM: x86/mmu: Split huge pages when dirty logging is enabled David Matlack
2022-01-05  7:54   ` Peter Xu
2022-01-05 17:49     ` David Matlack
2022-01-06 22:48       ` Sean Christopherson
2022-01-06 21:28   ` Sean Christopherson
2022-01-06 22:20     ` David Matlack
2022-01-06 22:56       ` Sean Christopherson
2022-01-07  2:02       ` Peter Xu
2022-01-07  2:06   ` Peter Xu
2021-12-13 22:59 ` [PATCH v1 10/13] KVM: Push MMU locking down into kvm_arch_mmu_enable_log_dirty_pt_masked David Matlack
2021-12-13 22:59 ` [PATCH v1 11/13] KVM: x86/mmu: Split huge pages during CLEAR_DIRTY_LOG David Matlack
2022-01-05  9:02   ` Peter Xu
2022-01-05 17:55     ` David Matlack
2022-01-05 17:57       ` David Matlack
2021-12-13 22:59 ` [PATCH v1 12/13] KVM: x86/mmu: Add tracepoint for splitting huge pages David Matlack
2022-01-05  8:38   ` Peter Xu
2022-01-06 23:14   ` Sean Christopherson
2022-01-07  0:54     ` David Matlack
2021-12-13 22:59 ` [PATCH v1 13/13] KVM: selftests: Add an option to disable MANUAL_PROTECT_ENABLE and INITIALLY_SET David Matlack
2022-01-05  8:38   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.