All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU
@ 2022-02-03  1:00 David Matlack
  2022-02-03  1:00 ` [PATCH 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
                   ` (23 more replies)
  0 siblings, 24 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
This will be useful for configurations that use Nested Virtualization,
disable the TDP MMU, or disable/lack TDP hardware support.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages.

    - Indirect shadow pages have the possibilty of being unsync. As a
      policy we opt not to split such pages as their translation may no
      longer be valid.
    - Huge pages on indirect shadow pages may have access permission
      constraints from the guest (unlike the TDP MMU which is ACC_ALL
      by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases (patches 19-22), which comes with some additional
complexity and an additional 4KiB of memory per VM to store the extra
pte_list_desc cache. However it does also avoid the need for TLB flushes
in most cases.

About half of this series, patches 1-13, is just refactoring the
existing MMU code in preparation for splitting. The bulk of the
refactoring is to make it possible to operate on the MMU outside of a
vCPU context.

Performance
-----------

Eager page splitting moves the cost of splitting huge pages off of the
vCPU thread and onto the thread invoking VM-ioctls to configure dirty
logging. This is useful because:

 - Splitting on the vCPU thread interrupts vCPUs execution and is
   disruptive to customers whereas splitting on VM ioctl threads can
   run in parallel with vCPU execution.

 - Splitting on the VM ioctl thread is more efficient because it does
   no require performing VM-exit handling and page table walks for every
   4K page.

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
vCPU, and backed by 1GiB HugeTLB memory.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.310786549s         | 0.058731929s         |
4            | 0.419165587s         | 0.059615316s         |
8            | 1.061233860s         | 0.060945457s         |
16           | 2.852955595s         | 0.067069980s         |
32           | 7.032750509s         | 0.078623606s         |
64           | 16.501287504s        | 0.083914116s         |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and does
not bottleneck on the MMU lock.

             | "Enabling dirty logging time"               |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001581619s         |  0.025699730s        |
4            | 0.003138664s         |  0.051510208s        |
8            | 0.006247177s         |  0.102960379s        |
16           | 0.012603892s         |  0.206949435s        |
32           | 0.026428036s         |  0.435855597s        |
64           | 0.103826796s         |  1.199686530s        |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | "Iteration 1 clear dirty log time"          |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001544730s         | 0.055327916s         |
4            | 0.003145920s         | 0.111887354s         |
8            | 0.006306964s         | 0.223920530s         |
16           | 0.012681628s         | 0.447849488s         |
32           | 0.026827560s         | 0.943874520s         |
64           | 0.090461490s         | 2.664388025s         |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | "Iteration 2 clear dirty log time"          |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.015613726s         | 0.015771982s         |
4            | 0.031456620s         | 0.031911594s         |
8            | 0.063341572s         | 0.063837403s         |
16           | 0.128409332s         | 0.127484064s         |
32           | 0.255635696s         | 0.268837996s         |
64           | 0.695572818s         | 0.700420727s         |

Eager Page Splitting also improves the performance for shadow paging
configurations, as measured with ept=N. Although the absolute gains are
less since ept=N requires taking the MMU lock to track writes to 4KiB
pages (i.e. no fast_page_fault() or PML), which dominates the dirty
memory time.

             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.373022770s         | 0.348926043s         |
4            | 0.563697483s         | 0.453022037s         |
8            | 1.588492808s         | 1.524962010s         |
16           | 3.988934732s         | 3.369129917s         |
32           | 9.470333115s         | 8.292953856s         |
64           | 20.086419186s        | 18.531840021s        |

Testing
-------

- Ran all kvm-unit-tests and KVM selftests with all combinations of
  ept=[NY] and tdp_mmu=[NY].
- Tested VM live migration [*] with ept=N and ept=Y and observed pages
  being split via tracepoint and the pages_* stats.

[*] The live migration setup consisted of an 8 vCPU 8 GiB VM running
    on an Intel Cascade Lake host and backed by 1GiB HugeTLBFS memory.
    The VM was running Debian 10 and a workload that consisted of 16
    independent processes that each dirty memory. The tests were run
    with ept=N to exercise the interaction of Eager Page Splitting and
    shadow paging.

David Matlack (23):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Pass memslot to kvm_mmu_create_sp()
  KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from
    vCPU caches
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Pass bool flush parameter to drop_large_spte()
  KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: Allow GFP flags to be passed when topping up MMU caches
  KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc
    structs
  KVM: x86/mmu: Split huge pages aliased by multiple SPTEs
  KVM: selftests: Map x86_64 guest virtual memory with huge pages

 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/mmu.c                          |  12 +-
 arch/mips/include/asm/kvm_host.h              |   2 +-
 arch/x86/include/asm/kvm_host.h               |  19 +-
 arch/x86/include/asm/kvm_page_track.h         |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 744 +++++++++++++++---
 arch/x86/kvm/mmu/mmu_internal.h               |  22 +-
 arch/x86/kvm/mmu/page_track.c                 |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  25 +-
 arch/x86/kvm/mmu/spte.c                       |  10 +-
 arch/x86/kvm/mmu/spte.h                       |   3 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  37 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   2 +-
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |  24 +-
 .../selftests/kvm/include/x86_64/processor.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  31 +
 virt/kvm/kvm_main.c                           |  17 +-
 20 files changed, 765 insertions(+), 205 deletions(-)


base-commit: f02ccc0f669341de1a831dfa7ca843ebbdbc8bd7
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-19  0:57   ` Sean Christopherson
  2022-02-03  1:00 ` [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further and skip the checks for
all direct shadow pages. Direct shadow pages are never marked unsynced
or have a non-zero write-flooding count.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 296f8723f9ae..6ca38277f2ab 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2052,7 +2052,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned int access)
 {
-	bool direct_mmu = vcpu->arch.mmu->direct_map;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2093,7 +2092,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			continue;
 		}
 
-		if (direct_mmu)
+		/* unsync and write-flooding only apply to indirect SPs. */
+		if (sp->role.direct)
 			goto trace_get_page;
 
 		if (sp->unsync) {
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
  2022-02-03  1:00 ` [PATCH 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-19  1:14   ` Sean Christopherson
  2022-02-03  1:00 ` [PATCH 03/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Instead of computing the shadow page role from scratch for every new
page, we can derive most of the information from the parent shadow page.
This avoids redundant calculations such as the quadrant, and reduces the
number of parameters to kvm_mmu_get_page().

Preemptivel split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 71 ++++++++++++++++++++++------------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 +++--
 2 files changed, 51 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6ca38277f2ab..fc9a4d9c0ddd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2045,30 +2045,14 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     int direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2086,7 +2070,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2125,14 +2109,14 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2144,6 +2128,31 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(struct kvm_mmu_page *parent_sp,
+						  bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(parent_sp, direct, access);
+
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2942,8 +2951,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3325,9 +3333,22 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
+	unsigned int quadrant;
+
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte) {
+		quadrant = gva >> (PAGE_SHIFT + (PT64_PT_BITS * level));
+		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
+		role.quadrant = quadrant;
+	}
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5b5bdac97c7b..f93d4423a067 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -683,8 +683,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -740,8 +741,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 03/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
  2022-02-03  1:00 ` [PATCH 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
  2022-02-03  1:00 ` [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-19  1:25   ` Sean Christopherson
  2022-02-03  1:00 ` [PATCH 04/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
                   ` (20 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
functions:

kvm_mmu_get_existing_sp_mabye_unsync() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role. Does not attempt to synchronize the page if it is
  unsync.

kvm_mmu_get_existing_sp() -
  Gets an existing page from the page hash if it exists and guarantees
  the page, if one is returned, is synced.  Implemented as a thin wrapper
  around kvm_mmu_get_existing_page_mabye_unsync. Requres access to a vcpu
  pointer in order to sync the page.

kvm_mmu_create_sp()
  Allocates an entirely new kvm_mmu_page. This currently requries a
  vcpu pointer for allocation and looking up the memslot but that will
  be removed in a future commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 132 ++++++++++++++++++++++++---------
 arch/x86/kvm/mmu/paging_tmpl.h |   5 +-
 arch/x86/kvm/mmu/spte.c        |   5 +-
 3 files changed, 101 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fc9a4d9c0ddd..24b3cf53aa12 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2045,16 +2045,25 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+/*
+ * Looks up an existing SP for the given gfn and role. Makes no attempt to
+ * sync the SP if it is marked unsync.
+ *
+ * If creating an upper-level page table, zaps unsynced pages for the same
+ * gfn and adds them to the invalid_list. It's the callers responsibility
+ * to call kvm_mmu_commit_zap_page() on invalid_list.
+ */
+static struct kvm_mmu_page *kvm_mmu_get_existing_sp_maybe_unsync(struct kvm *kvm,
+								 gfn_t gfn,
+								 union kvm_mmu_page_role role,
+								 struct list_head *invalid_list)
 {
 	struct hlist_head *sp_list;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
-	LIST_HEAD(invalid_list);
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
-	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	for_each_valid_sp(kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
@@ -2071,60 +2080,109 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 			 * upper-level page will be write-protected.
 			 */
 			if (role.level > PG_LEVEL_4K && sp->unsync)
-				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
-							 &invalid_list);
+				kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+
 			continue;
 		}
 
-		/* unsync and write-flooding only apply to indirect SPs. */
-		if (sp->role.direct)
-			goto trace_get_page;
+		/* Write-flooding is only tracked for indirect SPs. */
+		if (!sp->role.direct)
+			__clear_sp_write_flooding_count(sp);
 
-		if (sp->unsync) {
-			/*
-			 * The page is good, but is stale.  kvm_sync_page does
-			 * get the latest guest state, but (unlike mmu_unsync_children)
-			 * it doesn't write-protect the page or mark it synchronized!
-			 * This way the validity of the mapping is ensured, but the
-			 * overhead of write protection is not incurred until the
-			 * guest invalidates the TLB mapping.  This allows multiple
-			 * SPs for a single gfn to be unsync.
-			 *
-			 * If the sync fails, the page is zapped.  If so, break
-			 * in order to rebuild it.
-			 */
-			if (!kvm_sync_page(vcpu, sp, &invalid_list))
-				break;
+		goto out;
+	}
 
-			WARN_ON(!list_empty(&invalid_list));
-			kvm_flush_remote_tlbs(vcpu->kvm);
-		}
+	sp = NULL;
 
-		__clear_sp_write_flooding_count(sp);
+out:
+	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
+		kvm->stat.max_mmu_page_hash_collisions = collisions;
+
+	return sp;
+}
 
-trace_get_page:
-		trace_kvm_mmu_get_page(sp, false);
+/*
+ * Looks up an existing SP for the given gfn and role if one exists. The
+ * return SP is guaranteed to be synced.
+ */
+static struct kvm_mmu_page *kvm_mmu_get_existing_sp(struct kvm_vcpu *vcpu,
+						    gfn_t gfn,
+						    union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	sp = kvm_mmu_get_existing_sp_maybe_unsync(vcpu->kvm, gfn, role, &invalid_list);
+	if (!sp)
 		goto out;
+
+	if (sp->unsync) {
+		/*
+		 * The page is good, but is stale.  kvm_sync_page does
+		 * get the latest guest state, but (unlike mmu_unsync_children)
+		 * it doesn't write-protect the page or mark it synchronized!
+		 * This way the validity of the mapping is ensured, but the
+		 * overhead of write protection is not incurred until the
+		 * guest invalidates the TLB mapping.  This allows multiple
+		 * SPs for a single gfn to be unsync.
+		 *
+		 * If the sync fails, the page is zapped and added to the
+		 * invalid_list.
+		 */
+		if (!kvm_sync_page(vcpu, sp, &invalid_list)) {
+			sp = NULL;
+			goto out;
+		}
+
+		WARN_ON(!list_empty(&invalid_list));
+		kvm_flush_remote_tlbs(vcpu->kvm);
 	}
 
+out:
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
+					      gfn_t gfn,
+					      union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	struct hlist_head *sp_list;
+
 	++vcpu->kvm->stat.mmu_cache_miss;
 
 	sp = kvm_mmu_alloc_page(vcpu, role.direct);
-
 	sp->gfn = gfn;
 	sp->role = role;
+
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
+
 	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
 		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
-	trace_kvm_mmu_get_page(sp, true);
-out:
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 
-	if (collisions > vcpu->kvm->stat.max_mmu_page_hash_collisions)
-		vcpu->kvm->stat.max_mmu_page_hash_collisions = collisions;
+	return sp;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	bool created = false;
+
+	sp = kvm_mmu_get_existing_sp(vcpu, gfn, role);
+	if (sp)
+		goto out;
+
+	created = true;
+	sp = kvm_mmu_create_sp(vcpu, gfn, role);
+
+out:
+	trace_kvm_mmu_get_page(sp, created);
 	return sp;
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f93d4423a067..c533c191925e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -692,8 +692,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			 * the gpte is changed from non-present to present.
 			 * Otherwise, the guest may use the wrong mapping.
 			 *
-			 * For PG_LEVEL_4K, kvm_mmu_get_page() has already
-			 * synchronized it transiently via kvm_sync_page().
+			 * For PG_LEVEL_4K, kvm_mmu_get_existing_sp() has
+			 * already synchronized it transiently via
+			 * kvm_sync_page().
 			 *
 			 * For higher level pagetable, we synchronize it via
 			 * the slower mmu_sync_children().  If it needs to
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 8b5309faf5b9..20cf9e0d45dd 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -149,8 +149,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		/*
 		 * Optimization: for pte sync, if spte was writable the hash
 		 * lookup is unnecessary (and expensive). Write protection
-		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
-		 * Same reasoning can be applied to dirty page accounting.
+		 * is responsibility of kvm_mmu_create_sp() and
+		 * kvm_mmu_sync_roots(). Same reasoning can be applied to dirty
+		 * page accounting.
 		 */
 		if (is_writable_pte(old_spte))
 			goto out;
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 04/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (2 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 03/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-03  1:00 ` [PATCH 05/23] KVM: x86/mmu: Pass memslot to kvm_mmu_create_sp() David Matlack
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Rename 3 functions:

  kvm_mmu_get_page()   -> kvm_mmu_get_sp()
  kvm_mmu_alloc_page() -> kvm_mmu_alloc_sp()
  kvm_mmu_free_page()  -> kvm_mmu_free_sp()

This change makes it clear that these functions deal with shadow pages
rather than struct pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 24b3cf53aa12..6f55af9c66db 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1679,7 +1679,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_sp(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -1717,7 +1717,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, int direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -2152,7 +2152,7 @@ static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, role.direct);
+	sp = kvm_mmu_alloc_sp(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
 
@@ -2168,8 +2168,8 @@ static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     union kvm_mmu_page_role role)
+static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
+					   union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 	bool created = false;
@@ -2208,7 +2208,7 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 
 	role = kvm_mmu_child_role(parent_sp, direct, access);
 
-	return kvm_mmu_get_page(vcpu, gfn, role);
+	return kvm_mmu_get_sp(vcpu, gfn, role);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -2478,7 +2478,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_sp(sp);
 	}
 }
 
@@ -3406,7 +3406,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 		role.quadrant = quadrant;
 	}
 
-	sp = kvm_mmu_get_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_sp(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 05/23] KVM: x86/mmu: Pass memslot to kvm_mmu_create_sp()
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (3 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 04/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-03  1:00 ` [PATCH 06/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization David Matlack
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Passing the memslot to kvm_mmu_create_sp() avoids the need for the vCPU
pointer when write-protecting indirect 4k shadow pages. This moves us
closer to being able to create new shadow pages during VM ioctls for
eager page splitting, where there is not vCPU pointer.

This change does not negatively impact "Populate memory time" for ept=Y
or ept=N configurations since kvm_vcpu_gfn_to_memslot() caches the last
use slot. So even though we now look up the slot more often, it is a
very cheap check.

Opportunistically move the code to write-protect GFNs shadowed by
PG_LEVEL_4K shadow pages into account_shadowed() to reduce indentation
and consolidate the code. This also eliminates a memslot lookup.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f55af9c66db..49f82addf4b5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -804,16 +804,14 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 	update_gfn_disallow_lpage_count(slot, gfn, -1);
 }
 
-static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void account_shadowed(struct kvm *kvm,
+			     struct kvm_memory_slot *slot,
+			     struct kvm_mmu_page *sp)
 {
-	struct kvm_memslots *slots;
-	struct kvm_memory_slot *slot;
 	gfn_t gfn;
 
 	kvm->arch.indirect_shadow_pages++;
 	gfn = sp->gfn;
-	slots = kvm_memslots_for_spte_role(kvm, sp->role);
-	slot = __gfn_to_memslot(slots, gfn);
 
 	/* the non-leaf shadow pages are keeping readonly. */
 	if (sp->role.level > PG_LEVEL_4K)
@@ -821,6 +819,9 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 						    KVM_PAGE_TRACK_WRITE);
 
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
+
+	if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn, PG_LEVEL_4K))
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -2144,6 +2145,7 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp(struct kvm_vcpu *vcpu,
 }
 
 static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
+					      struct kvm_memory_slot *slot,
 					      gfn_t gfn,
 					      union kvm_mmu_page_role role)
 {
@@ -2159,11 +2161,8 @@ static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
-	if (!role.direct) {
-		account_shadowed(vcpu->kvm, sp);
-		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
-			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
-	}
+	if (!role.direct)
+		account_shadowed(vcpu->kvm, slot, sp);
 
 	return sp;
 }
@@ -2171,6 +2170,7 @@ static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
 static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
 					   union kvm_mmu_page_role role)
 {
+	struct kvm_memory_slot *slot;
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
@@ -2179,7 +2179,8 @@ static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
 		goto out;
 
 	created = true;
-	sp = kvm_mmu_create_sp(vcpu, gfn, role);
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+	sp = kvm_mmu_create_sp(vcpu, slot, gfn, role);
 
 out:
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 06/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (4 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 05/23] KVM: x86/mmu: Pass memslot to kvm_mmu_create_sp() David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-16 19:37   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 07/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c David Matlack
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Separate the code that allocates a new shadow page from the vCPU caches
from the code that initializes it. This is in preparation for creating
new shadow pages from VM ioctls for eager page splitting, where we do
not have access to the vCPU caches.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 49f82addf4b5..d4f90a10b652 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1718,7 +1718,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1726,16 +1726,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, int direct)
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	if (!direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
-	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	/*
-	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
-	 * depends on valid pages being added to the head of the list.  See
-	 * comments in kvm_zap_obsolete_pages().
-	 */
-	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 	return sp;
 }
 
@@ -2144,27 +2135,34 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
-					      struct kvm_memory_slot *slot,
-					      gfn_t gfn,
-					      union kvm_mmu_page_role role)
+
+static void kvm_mmu_init_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
+			    struct kvm_memory_slot *slot, gfn_t gfn,
+			    union kvm_mmu_page_role role)
 {
-	struct kvm_mmu_page *sp;
 	struct hlist_head *sp_list;
 
-	++vcpu->kvm->stat.mmu_cache_miss;
+	++kvm->stat.mmu_cache_miss;
+
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	sp = kvm_mmu_alloc_sp(vcpu, role.direct);
 	sp->gfn = gfn;
 	sp->role = role;
+	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	/*
+	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+	 * depends on valid pages being added to the head of the list.  See
+	 * comments in kvm_zap_obsolete_pages().
+	 */
+	list_add(&sp->link, &kvm->arch.active_mmu_pages);
+	kvm_mod_used_mmu_pages(kvm, 1);
+
+	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	hlist_add_head(&sp->hash_link, sp_list);
 
 	if (!role.direct)
-		account_shadowed(vcpu->kvm, slot, sp);
-
-	return sp;
+		account_shadowed(kvm, slot, sp);
 }
 
 static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
@@ -2179,8 +2177,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
 		goto out;
 
 	created = true;
+	sp = kvm_mmu_alloc_sp(vcpu, role.direct);
+
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-	sp = kvm_mmu_create_sp(vcpu, slot, gfn, role);
+	kvm_mmu_init_sp(vcpu->kvm, sp, slot, gfn, role);
 
 out:
 	trace_kvm_mmu_get_page(sp, created);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 07/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (5 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 06/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-03  1:00 ` [PATCH 08/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs David Matlack
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Move the code that allocates a new shadow page for splitting huge pages
into mmu.c. Currently this code is only used by the TDP MMU but it will
be reused in subsequent commits to also split huge pages mapped by the
shadow MMU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 26 ++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 23 ++---------------------
 3 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d4f90a10b652..3acdf372fa9a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1730,6 +1730,32 @@ static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
 	return sp;
 }
 
+/*
+ * Allocate a new shadow page using the provided GFP flags to split a huge page.
+ *
+ * Huge page splitting always uses direct shadow pages since the huge page is
+ * being mapped directly with a lower level page table. Thus there's no need to
+ * allocate the gfns array.
+ */
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp)
+{
+	struct kvm_mmu_page *sp;
+
+	gfp |= __GFP_ZERO;
+
+	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
+	if (!sp)
+		return NULL;
+
+	sp->spt = (void *)__get_free_page(gfp);
+	if (!sp->spt) {
+		kmem_cache_free(mmu_page_header_cache, sp);
+		return NULL;
+	}
+
+	return sp;
+}
+
 static void mark_unsync(u64 *spte);
 static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index da6166b5c377..2c80028695ca 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -160,4 +160,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8def8f810cb0..0d58c3d15894 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1263,25 +1263,6 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
-{
-	struct kvm_mmu_page *sp;
-
-	gfp |= __GFP_ZERO;
-
-	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
-	if (!sp)
-		return NULL;
-
-	sp->spt = (void *)__get_free_page(gfp);
-	if (!sp->spt) {
-		kmem_cache_free(mmu_page_header_cache, sp);
-		return NULL;
-	}
-
-	return sp;
-}
-
 static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       bool shared)
@@ -1297,7 +1278,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 	 * If this allocation fails we drop the lock and retry with reclaim
 	 * allowed.
 	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
 	if (sp)
 		return sp;
 
@@ -1309,7 +1290,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = kvm_mmu_alloc_direct_sp_for_split(GFP_KERNEL_ACCOUNT);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 08/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (6 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 07/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-03  1:00 ` [PATCH 09/23] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches David Matlack
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Use a common function to free kvm_mmu_page structs in the TDP MMU and
the shadow MMU. This reduces the amount of duplicate code and is needed
in subsequent commits that allocate and free kvm_mmu_pages for eager
page splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 8 ++++----
 arch/x86/kvm/mmu/mmu_internal.h | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 3 +--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3acdf372fa9a..09a178e64a04 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1680,11 +1680,8 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_sp(struct kvm_mmu_page *sp)
+void kvm_mmu_free_sp(struct kvm_mmu_page *sp)
 {
-	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
-	hlist_del(&sp->hash_link);
-	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
@@ -2505,6 +2502,9 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
+		MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
+		hlist_del(&sp->hash_link);
+		list_del(&sp->link);
 		kvm_mmu_free_sp(sp);
 	}
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2c80028695ca..c68f45c4a745 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -162,4 +162,6 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp);
 
+void kvm_mmu_free_sp(struct kvm_mmu_page *sp);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0d58c3d15894..60bb29cd2b96 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -59,8 +59,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
-	free_page((unsigned long)sp->spt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+	kvm_mmu_free_sp(sp);
 }
 
 /*
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 09/23] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (7 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 08/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-03  1:00 ` [PATCH 10/23] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Now that allocating a kvm_mmu_page struct is isolated to a helper
function, it can be re-used in the TDP MMU.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 2 +-
 arch/x86/kvm/mmu/mmu_internal.h | 1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 7 +------
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 09a178e64a04..48ebf2bebb90 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1715,7 +1715,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
+struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
 {
 	struct kvm_mmu_page *sp;
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c68f45c4a745..c5f2c0b9177d 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -162,6 +162,7 @@ void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp);
 
+struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct);
 void kvm_mmu_free_sp(struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 60bb29cd2b96..4ff1af24b5aa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -172,12 +172,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 
 static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 {
-	struct kvm_mmu_page *sp;
-
-	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-
-	return sp;
+	return kvm_mmu_alloc_sp(vcpu, true);
 }
 
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, gfn_t gfn,
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 10/23] KVM: x86/mmu: Pass const memslot to rmap_add()
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (8 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 09/23] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-23 23:25   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 11/23] KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants David Matlack
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
memslot.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 48ebf2bebb90..a5e3bb632542 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1607,7 +1607,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 		     u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 11/23] KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (9 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 10/23] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-23 23:27   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 12/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Use a const pointer so that kvm_mmu_init_sp() can be called from
contexts where we have a const pointer.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_page_track.h | 2 +-
 arch/x86/kvm/mmu/mmu.c                | 7 +++----
 arch/x86/kvm/mmu/mmu_internal.h       | 2 +-
 arch/x86/kvm/mmu/page_track.c         | 4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c            | 2 +-
 arch/x86/kvm/mmu/tdp_mmu.h            | 2 +-
 6 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..3a2dc183ae9a 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -58,7 +58,7 @@ int kvm_page_track_create_memslot(struct kvm *kvm,
 				  unsigned long npages);
 
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode);
 void kvm_slot_page_track_remove_page(struct kvm *kvm,
 				     struct kvm_memory_slot *slot, gfn_t gfn,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5e3bb632542..de7c47ee0def 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -805,7 +805,7 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
 }
 
 static void account_shadowed(struct kvm *kvm,
-			     struct kvm_memory_slot *slot,
+			     const struct kvm_memory_slot *slot,
 			     struct kvm_mmu_page *sp)
 {
 	gfn_t gfn;
@@ -1384,7 +1384,7 @@ int kvm_cpu_dirty_log_size(void)
 }
 
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level)
 {
 	struct kvm_rmap_head *rmap_head;
@@ -2158,9 +2158,8 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-
 static void kvm_mmu_init_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
-			    struct kvm_memory_slot *slot, gfn_t gfn,
+			    const struct kvm_memory_slot *slot, gfn_t gfn,
 			    union kvm_mmu_page_role role)
 {
 	struct hlist_head *sp_list;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c5f2c0b9177d..e6bcea5a0aa9 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -123,7 +123,7 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
 void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
-				    struct kvm_memory_slot *slot, u64 gfn,
+				    const struct kvm_memory_slot *slot, u64 gfn,
 				    int min_level);
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 68eb1fb548b6..ebd704946a35 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -83,7 +83,7 @@ int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot)
 	return 0;
 }
 
-static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+static void update_gfn_track(const struct kvm_memory_slot *slot, gfn_t gfn,
 			     enum kvm_page_track_mode mode, short count)
 {
 	int index, val;
@@ -111,7 +111,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
  * @mode: tracking mode, currently only write track is supported.
  */
 void kvm_slot_page_track_add_page(struct kvm *kvm,
-				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const struct kvm_memory_slot *slot, gfn_t gfn,
 				  enum kvm_page_track_mode mode)
 {
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4ff1af24b5aa..34c451f1eac9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1645,7 +1645,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
  * Returns true if an SPTE was set and a TLB flush is needed.
  */
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level)
 {
 	struct kvm_mmu_page *root;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3f987785702a..b1265149a05d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -64,7 +64,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot);
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
-				   struct kvm_memory_slot *slot, gfn_t gfn,
+				   const struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
 void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 12/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (10 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 11/23] KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-23 23:30   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 13/23] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 43 +++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index de7c47ee0def..c2f7f026d414 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -736,9 +736,9 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
-static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
+static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
 {
-	return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
+	return kvm_mmu_memory_cache_alloc(cache);
 }
 
 static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
@@ -885,7 +885,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
-static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
+static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -896,7 +896,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		rmap_head->val = (unsigned long)spte;
 	} else if (!(rmap_head->val & 1)) {
 		rmap_printk("%p %llx 1->many\n", spte, *spte);
-		desc = mmu_alloc_pte_list_desc(vcpu);
+		desc = mmu_alloc_pte_list_desc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
@@ -908,7 +908,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
 			if (!desc->more) {
-				desc->more = mmu_alloc_pte_list_desc(vcpu);
+				desc->more = mmu_alloc_pte_list_desc(cache);
 				desc = desc->more;
 				desc->spte_count = 0;
 				break;
@@ -1607,8 +1607,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 #define RMAP_RECYCLE_THRESHOLD 1000
 
-static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+static void __rmap_add(struct kvm *kvm,
+		       struct kvm_mmu_memory_cache *cache,
+		       const struct kvm_memory_slot *slot,
+		       u64 *spte, gfn_t gfn)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
@@ -1617,15 +1619,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	sp = sptep_to_sp(spte);
 	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
-	rmap_count = pte_list_add(vcpu, spte, rmap_head);
+	rmap_count = pte_list_add(cache, spte, rmap_head);
 
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
-		kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
+		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
-				vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
+				kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
 
+static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
+		     u64 *spte, gfn_t gfn)
+{
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
@@ -1693,13 +1701,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
+static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
 	if (!parent_pte)
 		return;
 
-	pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
+	pte_list_add(cache, parent_pte, &sp->parent_ptes);
 }
 
 static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
@@ -2297,8 +2305,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
+			       struct kvm_mmu_page *sp)
 {
 	u64 spte;
 
@@ -2308,12 +2316,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	mmu_spte_set(sptep, spte);
 
-	mmu_page_add_parent_pte(vcpu, sp, sptep);
+	mmu_page_add_parent_pte(cache, sp, sptep);
 
 	if (sp->unsync_children || sp->unsync)
 		mark_unsync(sptep);
 }
 
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
+{
+	__link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
+}
+
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
 {
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 13/23] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (11 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 12/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-23 23:32   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 14/23] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c2f7f026d414..ae1564e67e49 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1621,6 +1621,8 @@ static void __rmap_add(struct kvm *kvm,
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
+	kvm_update_page_stats(kvm, sp->role.level, 1);
+
 	if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
 		kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
 		kvm_flush_remote_tlbs_with_address(
@@ -2831,7 +2833,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		kvm_update_page_stats(vcpu->kvm, level, 1);
 		rmap_add(vcpu, slot, sptep, gfn);
 	}
 
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 14/23] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (12 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 13/23] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-28 20:30   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 15/23] KVM: x86/mmu: Pass access information to make_huge_page_split_spte() David Matlack
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

In order to split a huge page we need to know what access bits to assign
to the role of the new child page table. This can't be easily derived
from the huge page SPTE itself since KVM applies its own access policies
on top, such as for HugePage NX.

We could walk the guest page tables to determine the correct access
bits, but that is difficult to plumb outside of a vCPU fault context.
Instead, we can store the original access bits for each leaf SPTE
alongside the GFN in the gfns array. The access bits only take up 3
bits, which leaves 61 bits left over for gfns, which is more than
enough. So this change does not require any additional memory.

In order to keep the access bit cache in sync with the guest, we have to
extend FNAME(sync_page) to also update the access bits.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
 arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c371ee7e45f7..f00004c13ccf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -686,7 +686,7 @@ struct kvm_vcpu_arch {
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
+	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ae1564e67e49..e2306a39526a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -719,7 +719,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	if (maybe_indirect) {
-		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
@@ -732,7 +732,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
@@ -749,15 +749,17 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
 	if (!sp->role.direct)
-		return sp->gfns[index];
+		return sp->shadowed_translation[index].gfn;
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
-static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
+static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
+					gfn_t gfn, u32 access)
 {
 	if (!sp->role.direct) {
-		sp->gfns[index] = gfn;
+		sp->shadowed_translation[index].gfn = gfn;
+		sp->shadowed_translation[index].access = access;
 		return;
 	}
 
@@ -1610,14 +1612,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 static void __rmap_add(struct kvm *kvm,
 		       struct kvm_mmu_memory_cache *cache,
 		       const struct kvm_memory_slot *slot,
-		       u64 *spte, gfn_t gfn)
+		       u64 *spte, gfn_t gfn, u32 access)
 {
 	struct kvm_mmu_page *sp;
 	struct kvm_rmap_head *rmap_head;
 	int rmap_count;
 
 	sp = sptep_to_sp(spte);
-	kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
+	kvm_mmu_page_set_gfn_access(sp, spte - sp->spt, gfn, access);
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 	rmap_count = pte_list_add(cache, spte, rmap_head);
 
@@ -1631,9 +1633,9 @@ static void __rmap_add(struct kvm *kvm,
 }
 
 static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
-		     u64 *spte, gfn_t gfn)
+		     u64 *spte, gfn_t gfn, u32 access)
 {
-	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
+	__rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
 }
 
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1694,7 +1696,7 @@ void kvm_mmu_free_sp(struct kvm_mmu_page *sp)
 {
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+		free_page((unsigned long)sp->shadowed_translation);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1731,8 +1733,12 @@ struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+
+	BUILD_BUG_ON(sizeof(sp->shadowed_translation[0]) != sizeof(u64));
+
 	if (!direct)
-		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
+		sp->shadowed_translation =
+			kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadowed_translation_cache);
 
 	return sp;
 }
@@ -1742,7 +1748,7 @@ struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
  *
  * Huge page splitting always uses direct shadow pages since the huge page is
  * being mapped directly with a lower level page table. Thus there's no need to
- * allocate the gfns array.
+ * allocate the shadowed_translation array.
  */
 struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp)
 {
@@ -2833,7 +2839,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 
 	if (!was_rmapped) {
 		WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
-		rmap_add(vcpu, slot, sptep, gfn);
+		rmap_add(vcpu, slot, sptep, gfn, pte_access);
 	}
 
 	return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e6bcea5a0aa9..9ee175adcc12 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -30,6 +30,11 @@ extern bool dbg;
 #define INVALID_PAE_ROOT	0
 #define IS_VALID_PAE_ROOT(x)	(!!(x))
 
+struct shadowed_translation_entry {
+	u64 access:3;
+	u64 gfn:56;
+};
+
 struct kvm_mmu_page {
 	/*
 	 * Note, "link" through "spt" fit in a single 64 byte cache line on
@@ -51,8 +56,14 @@ struct kvm_mmu_page {
 	gfn_t gfn;
 
 	u64 *spt;
-	/* hold the gfn of each spte inside spt */
-	gfn_t *gfns;
+	/*
+	 * For indirect shadow pages, caches the result of the intermediate
+	 * guest translation being shadowed by each SPTE.
+	 *
+	 * NULL for direct shadow pages.
+	 */
+	struct shadowed_translation_entry *shadowed_translation;
+
 	/* Currently serving as active root */
 	union {
 		int root_count;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c533c191925e..703dfb062bf0 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1016,7 +1016,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 /*
- * Using the cached information from sp->gfns is safe because:
+ * Using the information in sp->shadowed_translation is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  *
@@ -1090,12 +1090,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
 			continue;
 
-		if (gfn != sp->gfns[i]) {
+		if (gfn != sp->shadowed_translation[i].gfn) {
 			drop_spte(vcpu->kvm, &sp->spt[i]);
 			flush = true;
 			continue;
 		}
 
+		if (pte_access != sp->shadowed_translation[i].access)
+			sp->shadowed_translation[i].access = pte_access;
+
 		sptep = &sp->spt[i];
 		spte = *sptep;
 		host_writable = spte & shadow_host_writable_mask;
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 15/23] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (13 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 14/23] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-28 20:32   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 16/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU David Matlack
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU. Huge pages
mapped by the shadow MMU may be shadowing huge pages that the guest has
disallowed execute permissions.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/spte.c    | 5 +++--
 arch/x86/kvm/mmu/spte.h    | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 20cf9e0d45dd..7cba5cffc240 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -215,7 +215,8 @@ static u64 make_spte_executable(u64 spte)
  * This is used during huge page splitting to build the SPTEs that make up the
  * new page table.
  */
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
+			      unsigned int access)
 {
 	u64 child_spte;
 	int child_level;
@@ -243,7 +244,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 		 * When splitting to a 4K page, mark the page executable as the
 		 * NX hugepage mitigation no longer applies.
 		 */
-		if (is_nx_huge_page_enabled())
+		if (is_nx_huge_page_enabled() && (access & ACC_EXEC_MASK))
 			child_spte = make_spte_executable(child_spte);
 	}
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..c7ccdd5c440d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
-u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
+			      unsigned int access);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 34c451f1eac9..02bfbc1bebbe 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1310,7 +1310,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * not been linked in yet and thus is not reachable from any other CPU.
 	 */
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++)
-		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
+		sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 16/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (14 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 15/23] KVM: x86/mmu: Pass access information to make_huge_page_split_spte() David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-28 20:39   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 17/23] KVM: x86/mmu: Pass bool flush parameter to drop_large_spte() David Matlack
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
in the rmap). This leads to correct behavior because KVM never creates
intermediate huge pages during dirty logging. For example, a 1GiB page
is never partially split to a 2MiB page.

However this behavior will stop being correct once the shadow MMU
participates in eager page splitting, which can in fact leave behind
partially split huge pages. In preparation for that change, change the
shadow MMU to iterate over all levels when zapping collapsible SPTEs.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2306a39526a..99ad7cc8683f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6038,18 +6038,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	return need_tlb_flush;
 }
 
+static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
+					   const struct kvm_memory_slot *slot)
+{
+	bool flush;
+
+	flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
+				  PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, true);
+
+	if (flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+
+}
+
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot)
 {
 	if (kvm_memslots_have_rmaps(kvm)) {
 		write_lock(&kvm->mmu_lock);
-		/*
-		 * Zap only 4k SPTEs since the legacy MMU only supports dirty
-		 * logging at a 4k granularity and never creates collapsible
-		 * 2m SPTEs during dirty logging.
-		 */
-		if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
-			kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+		kvm_rmap_zap_collapsible_sptes(kvm, slot);
 		write_unlock(&kvm->mmu_lock);
 	}
 
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 17/23] KVM: x86/mmu: Pass bool flush parameter to drop_large_spte()
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (15 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 16/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-28 20:47   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 18/23] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
Its helper function, __drop_large_spte(), does the drop without the
flush. This difference is not obvious from the name.

To make the code more readable, pass an explicit flush parameter. Also
replace the vCPU pointer with a KVM pointer so we can get rid of the
double-underscore helper function.

This is also in preparation for a future commit that will conditionally
flush after dropping a large SPTE.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 25 +++++++++++--------------
 arch/x86/kvm/mmu/paging_tmpl.h |  4 ++--
 2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 99ad7cc8683f..2d47a54e62a5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1162,23 +1162,20 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 }
 
 
-static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
+static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
 {
-	if (is_large_pte(*sptep)) {
-		WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
-		drop_spte(kvm, sptep);
-		return true;
-	}
+	struct kvm_mmu_page *sp;
 
-	return false;
-}
+	if (!is_large_pte(*sptep))
+		return;
 
-static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
-{
-	if (__drop_large_spte(vcpu->kvm, sptep)) {
-		struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+	sp = sptep_to_sp(sptep);
+	WARN_ON(sp->role.level == PG_LEVEL_4K);
+
+	drop_spte(kvm, sptep);
 
-		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
+	if (flush) {
+		kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
 			KVM_PAGES_PER_HPAGE(sp->role.level));
 	}
 }
@@ -3051,7 +3048,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (it.level == fault->goal_level)
 			break;
 
-		drop_large_spte(vcpu, it.sptep);
+		drop_large_spte(vcpu->kvm, it.sptep, true);
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 703dfb062bf0..ba61de29f2e5 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -677,7 +677,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		gfn_t table_gfn;
 
 		clear_sp_write_flooding_count(it.sptep);
-		drop_large_spte(vcpu, it.sptep);
+		drop_large_spte(vcpu->kvm, it.sptep, true);
 
 		sp = NULL;
 		if (!is_shadow_present_pte(*it.sptep)) {
@@ -739,7 +739,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 
 		validate_direct_spte(vcpu, it.sptep, direct_access);
 
-		drop_large_spte(vcpu, it.sptep);
+		drop_large_spte(vcpu->kvm, it.sptep, true);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
 			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 18/23] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (16 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 17/23] KVM: x86/mmu: Pass bool flush parameter to drop_large_spte() David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-28 21:09   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Extend KVM's eager page splitting to also split huge pages that are
mapped by the shadow MMU. Specifically, walk through the rmap splitting
all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
pages.

Splitting huge pages mapped by the shadow MMU requries dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages which have the
    possibility of being unsync. As a policy we opt not to split such
    pages as their translation may no longer be valid.

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.  This commit does *not*
    handle such aliasing and opts not to split such huge pages.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.
    This commit does *not* handle such cases and instead opts to leave
    such lower-level SPTEs non-present. In this situation TLBs must be
    flushed before dropping the MMU lock as a portion of the huge page
    region is being unmapped.

Suggested-by: Peter Feiner <pfeiner@google.com>
[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/x86/kvm/mmu/mmu.c                        | 349 ++++++++++++++++++
 2 files changed, 349 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1b54e410e206..09d236cb15d6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2351,9 +2351,6 @@
 			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
 			cleared.
 
-			Eager page splitting currently only supports splitting
-			huge pages mapped by the TDP MMU.
-
 			Default is Y (on).
 
 	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d47a54e62a5..825cfdec589b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -738,6 +738,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 
 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
 {
+	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
+
+	if (WARN_ON_ONCE(!cache))
+		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
+
 	return kvm_mmu_memory_cache_alloc(cache);
 }
 
@@ -754,6 +759,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
 
+static gfn_t sptep_to_gfn(u64 *sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+	return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+}
+
+static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
+{
+	if (!sp->role.direct)
+		return sp->shadowed_translation[index].access;
+
+	return sp->role.access;
+}
+
+static unsigned int sptep_to_access(u64 *sptep)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
+
+	return kvm_mmu_page_get_access(sp, sptep - sp->spt);
+}
+
 static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
 					gfn_t gfn, u32 access)
 {
@@ -923,6 +950,41 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	return count;
 }
 
+static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
+					 const struct kvm_memory_slot *slot);
+
+static bool pte_list_need_new_desc(struct kvm_rmap_head *rmap_head)
+{
+	struct pte_list_desc *desc;
+
+	if (!rmap_head->val)
+		return false;
+
+	if (!(rmap_head->val & 1))
+		return true;
+
+	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	while (desc->spte_count == PTE_LIST_EXT) {
+		if (!desc->more)
+			return true;
+		desc = desc->more;
+	}
+
+	return false;
+}
+
+/*
+ * Return true if the rmap for the given gfn and level needs a new
+ * pte_list_desc struct allocated to add a new spte.
+ */
+static bool rmap_need_new_pte_list_desc(const struct kvm_memory_slot *slot,
+					gfn_t gfn, int level)
+{
+	struct kvm_rmap_head *rmap_head = gfn_to_rmap(gfn, level, slot);
+
+	return pte_list_need_new_desc(rmap_head);
+}
+
 static void
 pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
 			   struct pte_list_desc *desc, int i,
@@ -2129,6 +2191,24 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp_maybe_unsync(struct kvm *kvm
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_existing_direct_sp(struct kvm *kvm,
+							   gfn_t gfn,
+							   union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+	LIST_HEAD(invalid_list);
+
+	BUG_ON(!role.direct);
+
+	sp = kvm_mmu_get_existing_sp_maybe_unsync(kvm, gfn, role, &invalid_list);
+
+	/* Direct SPs are never unsync. */
+	WARN_ON_ONCE(sp && sp->unsync);
+
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	return sp;
+}
+
 /*
  * Looks up an existing SP for the given gfn and role if one exists. The
  * return SP is guaranteed to be synced.
@@ -5955,12 +6035,275 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+
+static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
+{
+	if (*spp)
+		return 0;
+
+	*spp = kvm_mmu_alloc_direct_sp_for_split(gfp);
+
+	return *spp ? 0 : -ENOMEM;
+}
+
+static int prepare_to_split_huge_page(struct kvm *kvm,
+				      const struct kvm_memory_slot *slot,
+				      u64 *huge_sptep,
+				      struct kvm_mmu_page **spp,
+				      bool *flush,
+				      bool *dropped_lock)
+{
+	int r = 0;
+
+	*dropped_lock = false;
+
+	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
+		return -ENOSPC;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		goto drop_lock;
+
+	r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
+	if (r)
+		goto drop_lock;
+
+	return 0;
+
+drop_lock:
+	if (*flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+
+	*flush = false;
+	*dropped_lock = true;
+
+	write_unlock(&kvm->mmu_lock);
+	cond_resched();
+	r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);
+	write_lock(&kvm->mmu_lock);
+
+	return r;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
+						     const struct kvm_memory_slot *slot,
+						     u64 *huge_sptep,
+						     struct kvm_mmu_page **spp)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	struct kvm_mmu_page *split_sp;
+	union kvm_mmu_page_role role;
+	unsigned int access;
+	gfn_t gfn;
+
+	gfn = sptep_to_gfn(huge_sptep);
+	access = sptep_to_access(huge_sptep);
+
+	/*
+	 * Huge page splitting always uses direct shadow pages since we are
+	 * directly mapping the huge page GFN region with smaller pages.
+	 */
+	role = kvm_mmu_child_role(huge_sp, true, access);
+	split_sp = kvm_mmu_get_existing_direct_sp(kvm, gfn, role);
+
+	/*
+	 * Opt not to split if the lower-level SP already exists. This requires
+	 * more complex handling as the SP may be already partially filled in
+	 * and may need extra pte_list_desc structs to update parent_ptes.
+	 */
+	if (split_sp)
+		return NULL;
+
+	swap(split_sp, *spp);
+	kvm_mmu_init_sp(kvm, split_sp, slot, gfn, role);
+	trace_kvm_mmu_get_page(split_sp, true);
+
+	return split_sp;
+}
+
+static int kvm_mmu_split_huge_page(struct kvm *kvm,
+				   const struct kvm_memory_slot *slot,
+				   u64 *huge_sptep, struct kvm_mmu_page **spp,
+				   bool *flush)
+
+{
+	struct kvm_mmu_page *split_sp;
+	u64 huge_spte, split_spte;
+	int split_level, index;
+	unsigned int access;
+	u64 *split_sptep;
+	gfn_t split_gfn;
+
+	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
+	if (!split_sp)
+		return -EOPNOTSUPP;
+
+	/*
+	 * Since we did not allocate pte_list_desc_structs for the split, we
+	 * cannot add a new parent SPTE to parent_ptes. This should never happen
+	 * in practice though since this is a fresh SP.
+	 *
+	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
+	 */
+	if (WARN_ON_ONCE(pte_list_need_new_desc(&split_sp->parent_ptes)))
+		return -EINVAL;
+
+	huge_spte = READ_ONCE(*huge_sptep);
+
+	split_level = split_sp->role.level;
+	access = split_sp->role.access;
+
+	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
+		split_sptep = &split_sp->spt[index];
+		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
+
+		BUG_ON(is_shadow_present_pte(*split_sptep));
+
+		/*
+		 * Since we did not allocate pte_list_desc structs for the
+		 * split, we can't add a new SPTE that maps this GFN.
+		 * Skipping this SPTE means we're only partially mapping the
+		 * huge page, which means we'll need to flush TLBs before
+		 * dropping the MMU lock.
+		 *
+		 * Note, this make it safe to pass NULL to __rmap_add() below.
+		 */
+		if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
+			*flush = true;
+			continue;
+		}
+
+		split_spte = make_huge_page_split_spte(
+				huge_spte, split_level + 1, index, access);
+
+		mmu_spte_set(split_sptep, split_spte);
+		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
+	}
+
+	/*
+	 * Replace the huge spte with a pointer to the populated lower level
+	 * page table. Since we are making this change without a TLB flush vCPUs
+	 * will see a mix of the split mappings and the original huge mapping,
+	 * depending on what's currently in their TLB. This is fine from a
+	 * correctness standpoint since the translation will be the same either
+	 * way.
+	 */
+	drop_large_spte(kvm, huge_sptep, false);
+	__link_shadow_page(NULL, huge_sptep, split_sp);
+
+	return 0;
+}
+
+static bool should_split_huge_page(u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+
+	if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
+		return false;
+
+	if (huge_sp->role.invalid)
+		return false;
+
+	/*
+	 * As a policy, do not split huge pages if SP on which they reside
+	 * is unsync. Unsync means the guest is modifying the page table being
+	 * shadowed by huge_sp, so splitting may be a waste of cycles and
+	 * memory.
+	 */
+	if (huge_sp->unsync)
+		return false;
+
+	return true;
+}
+
+static bool rmap_try_split_huge_pages(struct kvm *kvm,
+				      struct kvm_rmap_head *rmap_head,
+				      const struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *sp = NULL;
+	struct rmap_iterator iter;
+	u64 *huge_sptep, spte;
+	bool flush = false;
+	bool dropped_lock;
+	int level;
+	gfn_t gfn;
+	int r;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
+		if (!should_split_huge_page(huge_sptep))
+			continue;
+
+		spte = *huge_sptep;
+		level = sptep_to_sp(huge_sptep)->role.level;
+		gfn = sptep_to_gfn(huge_sptep);
+
+		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
+		if (r) {
+			trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+			break;
+		}
+
+		if (dropped_lock)
+			goto restart;
+
+		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
+
+		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
+
+		/*
+		 * If splitting is successful we must restart the iterator
+		 * because huge_sptep has just been removed from it.
+		 */
+		if (!r)
+			goto restart;
+	}
+
+	if (sp)
+		kvm_mmu_free_sp(sp);
+
+	return flush;
+}
+
+static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
+					  const struct kvm_memory_slot *slot,
+					  gfn_t start, gfn_t end,
+					  int target_level)
+{
+	bool flush;
+	int level;
+
+	/*
+	 * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
+	 * down to the target level. This ensures pages are recursively split
+	 * all the way to the target level. There's no need to split pages
+	 * already at the target level.
+	 *
+	 * Note that TLB flushes must be done before dropping the MMU lock since
+	 * rmap_try_split_huge_pages() may partially split any given huge page,
+	 * i.e. it may effectively unmap (make non-present) a portion of the
+	 * huge page.
+	 */
+	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
+		flush = slot_handle_level_range(kvm, slot,
+						rmap_try_split_huge_pages,
+						level, level, start, end - 1,
+						true, flush);
+	}
+
+	if (flush)
+		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+}
+
 /* Must be called with the mmu_lock held in write-mode. */
 void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot,
 				   u64 start, u64 end,
 				   int target_level)
 {
+	if (kvm_memslots_have_rmaps(kvm))
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end,
+					      target_level);
+
 	if (is_tdp_mmu_enabled(kvm))
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
 						 target_level, false);
@@ -5978,6 +6321,12 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
 	u64 start = memslot->base_gfn;
 	u64 end = start + memslot->npages;
 
+	if (kvm_memslots_have_rmaps(kvm)) {
+		write_lock(&kvm->mmu_lock);
+		kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
+		write_unlock(&kvm->mmu_lock);
+	}
+
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (17 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 18/23] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-24 11:28   ` Marc Zyngier
  2022-02-03  1:00 ` [PATCH 20/23] KVM: Allow GFP flags to be passed when topping up MMU caches David Matlack
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +-
 arch/arm64/kvm/mmu.c              | 12 ++++++------
 arch/mips/include/asm/kvm_host.h  |  2 +-
 arch/x86/include/asm/kvm_host.h   |  8 ++++----
 include/linux/kvm_types.h         | 24 ++++++++++++++++++++++--
 virt/kvm/kvm_main.c               |  8 +++++++-
 6 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3b44ea17af88..a450b91cc2d9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
 	bool pause;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* Target CPU and feature flags */
 	int target;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index bc2aba953299..9c853c529b49 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -765,7 +765,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
+	struct kvm_mmu_memory_cache *cache = &page_cache.cache;
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
@@ -774,18 +775,17 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	if (is_protected_kvm_enabled())
 		return -EPERM;
 
+	cache->gfp_zero = __GFP_ZERO;
 	size += offset_in_page(guest_ipa);
 	guest_ipa &= PAGE_MASK;
 
 	for (addr = guest_ipa; addr < guest_ipa + size; addr += PAGE_SIZE) {
-		ret = kvm_mmu_topup_memory_cache(&cache,
-						 kvm_mmu_cache_min_pages(kvm));
+		ret = kvm_mmu_topup_memory_cache(cache, kvm_mmu_cache_min_pages(kvm));
 		if (ret)
 			break;
 
 		spin_lock(&kvm->mmu_lock);
-		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
-					     &cache);
+		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot, cache);
 		spin_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
@@ -793,7 +793,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		pa += PAGE_SIZE;
 	}
 
-	kvm_mmu_free_memory_cache(&cache);
+	kvm_mmu_free_memory_cache(cache);
 	return ret;
 }
 
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 72b90d45a46e..82bbcbc3ead6 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -346,7 +346,7 @@ struct kvm_vcpu_arch {
 	unsigned long pending_exceptions_clr;
 
 	/* Cache some mmu pages needed inside spinlock regions */
-	struct kvm_mmu_memory_cache mmu_page_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
 
 	/* vcpu's vzguestid is different on each host cpu in an smp system */
 	u32 vzguestid[NR_CPUS];
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f00004c13ccf..d0b12bfe5818 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -684,10 +684,10 @@ struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
 
-	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
-	struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
-	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_pte_list_desc_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadow_page_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_shadowed_translation_cache);
+	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_header_cache);
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index dceac12c1ce5..9575fb8d333f 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -78,14 +78,34 @@ struct gfn_to_pfn_cache {
  * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
  * holding MMU locks.  Note, these caches act more like prefetch buffers than
  * classical caches, i.e. objects are not returned to the cache on being freed.
+ *
+ * The storage for the cache objects is laid out after the struct to allow
+ * different declarations to choose different capacities. If the capacity field
+ * is 0, the capacity is assumed to be KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE.
  */
 struct kvm_mmu_memory_cache {
 	int nobjs;
+	int capacity;
 	gfp_t gfp_zero;
 	struct kmem_cache *kmem_cache;
-	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
+	void *objects[0];
 };
-#endif
+
+/*
+ * Note, if defining a memory cache with a non-default capacity, you must
+ * initialize the capacity field at runtime.
+ */
+#define __DEFINE_KVM_MMU_MEMORY_CACHE(_name, _capacity)	\
+	struct {						\
+		struct kvm_mmu_memory_cache _name;		\
+		void *_name##_objects[_capacity];		\
+	}
+
+/* Define a memory cache with the default capacity. */
+#define DEFINE_KVM_MMU_MEMORY_CACHE(_name) \
+	__DEFINE_KVM_MMU_MEMORY_CACHE(_name, KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE)
+
+#endif /* KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE */
 
 #define HALT_POLL_HIST_COUNT			32
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 034c567a680c..afa4bdb6481e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -373,11 +373,17 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 {
+	int capacity;
 	void *obj;
 
+	if (mc->capacity)
+		capacity = mc->capacity;
+	else
+		capacity = KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE;
+
 	if (mc->nobjs >= min)
 		return 0;
-	while (mc->nobjs < ARRAY_SIZE(mc->objects)) {
+	while (mc->nobjs < capacity) {
 		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 20/23] KVM: Allow GFP flags to be passed when topping up MMU caches
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (18 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-28 21:12   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs David Matlack
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

This will be used in a subsequent commit to top-up MMU caches under the
MMU lock with GFP_NOWAIT as part of eager page splitting.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b3810976a27f..128f4c5a8122 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1329,6 +1329,7 @@ void kvm_reload_remote_mmus(struct kvm *kvm);
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index afa4bdb6481e..c39e7ba21fab 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -371,7 +371,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 		return (void *)__get_free_page(gfp_flags);
 }
 
-int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp)
 {
 	int capacity;
 	void *obj;
@@ -384,7 +384,7 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	if (mc->nobjs >= min)
 		return 0;
 	while (mc->nobjs < capacity) {
-		obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
+		obj = mmu_memory_cache_alloc_obj(mc, gfp);
 		if (!obj)
 			return mc->nobjs >= min ? 0 : -ENOMEM;
 		mc->objects[mc->nobjs++] = obj;
@@ -392,6 +392,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
 	return 0;
 }
 
+int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
+{
+	return __kvm_mmu_topup_memory_cache(mc, min, GFP_KERNEL_ACCOUNT);
+}
+
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 {
 	return mc->nobjs;
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (19 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 20/23] KVM: Allow GFP flags to be passed when topping up MMU caches David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-28 21:22   ` Ben Gardon
  2022-02-03  1:00 ` [PATCH 22/23] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs David Matlack
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

When splitting a huge page we need to add all of the lower level SPTEs
to the memslot rmap. The current implementation of eager page splitting
bails if adding an SPTE would require allocating an extra pte_list_desc
struct. Fix this limitation by allocating enough pte_list_desc structs
before splitting the huge page.

This eliminates the need for TLB flushing under the MMU lock because the
huge page is always entirely split (no subregion of the huge page is
unmapped).

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  10 ++++
 arch/x86/kvm/mmu/mmu.c          | 101 ++++++++++++++++++--------------
 2 files changed, 67 insertions(+), 44 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d0b12bfe5818..a0f7578f7a26 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1232,6 +1232,16 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	/*
+	 * Memory cache used to allocate pte_list_desc structs while splitting
+	 * huge pages. In the worst case, to split one huge page we need 512
+	 * pte_list_desc structs to add each new lower level leaf sptep to the
+	 * memslot rmap.
+	 */
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
+	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
+				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 825cfdec589b..c7981a934237 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5905,6 +5905,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.huge_page_split_desc_cache.capacity =
+		HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
+	kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
+	kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6035,9 +6040,42 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
+static int min_descs_for_split(const struct kvm_memory_slot *slot, u64 *huge_sptep)
+{
+	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
+	int split_level = huge_sp->role.level - 1;
+	int i, min = 0;
+	gfn_t gfn;
+
+	gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
 
-static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
+	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+		if (rmap_need_new_pte_list_desc(slot, gfn, split_level))
+			min++;
+
+		gfn += KVM_PAGES_PER_HPAGE(split_level);
+	}
+
+	return min;
+}
+
+static int topup_huge_page_split_desc_cache(struct kvm *kvm, int min, gfp_t gfp)
+{
+	struct kvm_mmu_memory_cache *cache =
+		&kvm->arch.huge_page_split_desc_cache;
+
+	return __kvm_mmu_topup_memory_cache(cache, min, gfp);
+}
+
+static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
+				  int min_descs, gfp_t gfp)
 {
+	int r;
+
+	r = topup_huge_page_split_desc_cache(kvm, min_descs, gfp);
+	if (r)
+		return r;
+
 	if (*spp)
 		return 0;
 
@@ -6050,9 +6088,9 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
 				      const struct kvm_memory_slot *slot,
 				      u64 *huge_sptep,
 				      struct kvm_mmu_page **spp,
-				      bool *flush,
 				      bool *dropped_lock)
 {
+	int min_descs = min_descs_for_split(slot, huge_sptep);
 	int r = 0;
 
 	*dropped_lock = false;
@@ -6063,22 +6101,18 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
 	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
 		goto drop_lock;
 
-	r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
+	r = alloc_memory_for_split(kvm, spp, min_descs, GFP_NOWAIT | __GFP_ACCOUNT);
 	if (r)
 		goto drop_lock;
 
 	return 0;
 
 drop_lock:
-	if (*flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
-
-	*flush = false;
 	*dropped_lock = true;
 
 	write_unlock(&kvm->mmu_lock);
 	cond_resched();
-	r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);
+	r = alloc_memory_for_split(kvm, spp, min_descs, GFP_KERNEL_ACCOUNT);
 	write_lock(&kvm->mmu_lock);
 
 	return r;
@@ -6122,10 +6156,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 
 static int kvm_mmu_split_huge_page(struct kvm *kvm,
 				   const struct kvm_memory_slot *slot,
-				   u64 *huge_sptep, struct kvm_mmu_page **spp,
-				   bool *flush)
+				   u64 *huge_sptep, struct kvm_mmu_page **spp)
 
 {
+	struct kvm_mmu_memory_cache *cache;
 	struct kvm_mmu_page *split_sp;
 	u64 huge_spte, split_spte;
 	int split_level, index;
@@ -6138,9 +6172,9 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 		return -EOPNOTSUPP;
 
 	/*
-	 * Since we did not allocate pte_list_desc_structs for the split, we
-	 * cannot add a new parent SPTE to parent_ptes. This should never happen
-	 * in practice though since this is a fresh SP.
+	 * We did not allocate an extra pte_list_desc struct to add huge_sptep
+	 * to split_sp->parent_ptes. An extra pte_list_desc struct should never
+	 * be necessary in practice though since split_sp is brand new.
 	 *
 	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
 	 */
@@ -6151,6 +6185,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 	split_level = split_sp->role.level;
 	access = split_sp->role.access;
+	cache = &kvm->arch.huge_page_split_desc_cache;
 
 	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
 		split_sptep = &split_sp->spt[index];
@@ -6158,25 +6193,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 		BUG_ON(is_shadow_present_pte(*split_sptep));
 
-		/*
-		 * Since we did not allocate pte_list_desc structs for the
-		 * split, we can't add a new SPTE that maps this GFN.
-		 * Skipping this SPTE means we're only partially mapping the
-		 * huge page, which means we'll need to flush TLBs before
-		 * dropping the MMU lock.
-		 *
-		 * Note, this make it safe to pass NULL to __rmap_add() below.
-		 */
-		if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
-			*flush = true;
-			continue;
-		}
-
 		split_spte = make_huge_page_split_spte(
 				huge_spte, split_level + 1, index, access);
 
 		mmu_spte_set(split_sptep, split_spte);
-		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
+		__rmap_add(kvm, cache, slot, split_sptep, split_gfn, access);
 	}
 
 	/*
@@ -6222,7 +6243,6 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 	struct kvm_mmu_page *sp = NULL;
 	struct rmap_iterator iter;
 	u64 *huge_sptep, spte;
-	bool flush = false;
 	bool dropped_lock;
 	int level;
 	gfn_t gfn;
@@ -6237,7 +6257,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		level = sptep_to_sp(huge_sptep)->role.level;
 		gfn = sptep_to_gfn(huge_sptep);
 
-		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
+		r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &dropped_lock);
 		if (r) {
 			trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
 			break;
@@ -6246,7 +6266,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		if (dropped_lock)
 			goto restart;
 
-		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
+		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
 
 		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
 
@@ -6261,7 +6281,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 	if (sp)
 		kvm_mmu_free_sp(sp);
 
-	return flush;
+	return false;
 }
 
 static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
@@ -6269,7 +6289,6 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
 					  gfn_t start, gfn_t end,
 					  int target_level)
 {
-	bool flush;
 	int level;
 
 	/*
@@ -6277,21 +6296,15 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
 	 * down to the target level. This ensures pages are recursively split
 	 * all the way to the target level. There's no need to split pages
 	 * already at the target level.
-	 *
-	 * Note that TLB flushes must be done before dropping the MMU lock since
-	 * rmap_try_split_huge_pages() may partially split any given huge page,
-	 * i.e. it may effectively unmap (make non-present) a portion of the
-	 * huge page.
 	 */
 	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
-		flush = slot_handle_level_range(kvm, slot,
-						rmap_try_split_huge_pages,
-						level, level, start, end - 1,
-						true, flush);
+		slot_handle_level_range(kvm, slot,
+					rmap_try_split_huge_pages,
+					level, level, start, end - 1,
+					true, false);
 	}
 
-	if (flush)
-		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+	kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
 }
 
 /* Must be called with the mmu_lock held in write-mode. */
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 22/23] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (20 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-02-03  1:00 ` [PATCH 23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages David Matlack
  2022-03-07  5:21 ` [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU Peter Xu
  23 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

The existing huge page splitting code bails if it encounters a huge page
that is aliased by another SPTE that has already been split (either due
to NX huge pages or eager page splitting). Extend the huge page
splitting code to also handle such aliases.

The thing we have to be careful about is dealing with what's already in
the lower level page table. If eager page splitting was the only
operation that split huge pages, this would be fine. However huge pages
can also be split by NX huge pages. This means the lower level page
table may only be partially filled in and may point to even lower level
page tables that are partially filled in. We can fill in the rest of the
page table but dealing with the lower level page tables would be too
complex.

To handle this we flush TLBs after dropping the huge SPTE whenever we
are about to install a lower level page table that was partially filled
in (*). We can skip the TLB flush if the lower level page table was
empty (no aliasing) or identical to what we were already going to
populate it with (aliased huge page that was just eagerly split).

(*) This TLB flush could probably be delayed until we're about to drop
the MMU lock, which would also let us batch flushes for multiple splits.
However such scenarios should be rare in practice (a huge page must be
aliased in multiple SPTEs and have been split for NX Huge Pages in only
some of them). Flushing immediately is simpler to plumb and also reduces
the chances of tripping over a CPU bug (e.g. see iTLB multi-hit).

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 ++-
 arch/x86/kvm/mmu/mmu.c          | 77 +++++++++++++++------------------
 2 files changed, 38 insertions(+), 44 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a0f7578f7a26..c11f27f38981 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1237,9 +1237,10 @@ struct kvm_arch {
 	 * Memory cache used to allocate pte_list_desc structs while splitting
 	 * huge pages. In the worst case, to split one huge page we need 512
 	 * pte_list_desc structs to add each new lower level leaf sptep to the
-	 * memslot rmap.
+	 * memslot rmap plus 1 to extend the parent_ptes rmap of the new lower
+	 * level page table.
 	 */
-#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
+#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 513
 	__DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
 				      HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c7981a934237..62fbff8979ba 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6056,7 +6056,8 @@ static int min_descs_for_split(const struct kvm_memory_slot *slot, u64 *huge_spt
 		gfn += KVM_PAGES_PER_HPAGE(split_level);
 	}
 
-	return min;
+	/* Plus 1 to extend the parent_ptes rmap of the lower level SP. */
+	return min + 1;
 }
 
 static int topup_huge_page_split_desc_cache(struct kvm *kvm, int min, gfp_t gfp)
@@ -6126,6 +6127,7 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
 	struct kvm_mmu_page *split_sp;
 	union kvm_mmu_page_role role;
+	bool created = false;
 	unsigned int access;
 	gfn_t gfn;
 
@@ -6138,25 +6140,21 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
 	 */
 	role = kvm_mmu_child_role(huge_sp, true, access);
 	split_sp = kvm_mmu_get_existing_direct_sp(kvm, gfn, role);
-
-	/*
-	 * Opt not to split if the lower-level SP already exists. This requires
-	 * more complex handling as the SP may be already partially filled in
-	 * and may need extra pte_list_desc structs to update parent_ptes.
-	 */
 	if (split_sp)
-		return NULL;
+		goto out;
 
+	created = true;
 	swap(split_sp, *spp);
 	kvm_mmu_init_sp(kvm, split_sp, slot, gfn, role);
-	trace_kvm_mmu_get_page(split_sp, true);
 
+out:
+	trace_kvm_mmu_get_page(split_sp, created);
 	return split_sp;
 }
 
-static int kvm_mmu_split_huge_page(struct kvm *kvm,
-				   const struct kvm_memory_slot *slot,
-				   u64 *huge_sptep, struct kvm_mmu_page **spp)
+static void kvm_mmu_split_huge_page(struct kvm *kvm,
+				    const struct kvm_memory_slot *slot,
+				    u64 *huge_sptep, struct kvm_mmu_page **spp)
 
 {
 	struct kvm_mmu_memory_cache *cache;
@@ -6164,22 +6162,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 	u64 huge_spte, split_spte;
 	int split_level, index;
 	unsigned int access;
+	bool flush = false;
 	u64 *split_sptep;
 	gfn_t split_gfn;
 
 	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
-	if (!split_sp)
-		return -EOPNOTSUPP;
-
-	/*
-	 * We did not allocate an extra pte_list_desc struct to add huge_sptep
-	 * to split_sp->parent_ptes. An extra pte_list_desc struct should never
-	 * be necessary in practice though since split_sp is brand new.
-	 *
-	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
-	 */
-	if (WARN_ON_ONCE(pte_list_need_new_desc(&split_sp->parent_ptes)))
-		return -EINVAL;
 
 	huge_spte = READ_ONCE(*huge_sptep);
 
@@ -6191,7 +6178,20 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 		split_sptep = &split_sp->spt[index];
 		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
 
-		BUG_ON(is_shadow_present_pte(*split_sptep));
+		/*
+		 * split_sp may have populated page table entries if this huge
+		 * page is aliased in multiple shadow page table entries. We
+		 * know the existing SP will be mapping the same GFN->PFN
+		 * translation since this is a direct SP. However, the SPTE may
+		 * point to an even lower level page table that may only be
+		 * partially filled in (e.g. for NX huge pages). In other words,
+		 * we may be unmapping a portion of the huge page, which
+		 * requires a TLB flush.
+		 */
+		if (is_shadow_present_pte(*split_sptep)) {
+			flush |= !is_last_spte(*split_sptep, split_level);
+			continue;
+		}
 
 		split_spte = make_huge_page_split_spte(
 				huge_spte, split_level + 1, index, access);
@@ -6202,16 +6202,12 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
 
 	/*
 	 * Replace the huge spte with a pointer to the populated lower level
-	 * page table. Since we are making this change without a TLB flush vCPUs
-	 * will see a mix of the split mappings and the original huge mapping,
-	 * depending on what's currently in their TLB. This is fine from a
-	 * correctness standpoint since the translation will be the same either
-	 * way.
+	 * page table. If the lower-level page table indentically maps the huge
+	 * page, there's no need for a TLB flush. Otherwise, flush TLBs after
+	 * dropping the huge page and before installing the shadow page table.
 	 */
-	drop_large_spte(kvm, huge_sptep, false);
-	__link_shadow_page(NULL, huge_sptep, split_sp);
-
-	return 0;
+	drop_large_spte(kvm, huge_sptep, flush);
+	__link_shadow_page(cache, huge_sptep, split_sp);
 }
 
 static bool should_split_huge_page(u64 *huge_sptep)
@@ -6266,16 +6262,13 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
 		if (dropped_lock)
 			goto restart;
 
-		r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
-
-		trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
-
 		/*
-		 * If splitting is successful we must restart the iterator
-		 * because huge_sptep has just been removed from it.
+		 * After splitting we must restart the iterator because
+		 * huge_sptep has just been removed from it.
 		 */
-		if (!r)
-			goto restart;
+		kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
+		trace_kvm_mmu_split_huge_page(gfn, spte, level, 0);
+		goto restart;
 	}
 
 	if (sp)
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (21 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 22/23] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs David Matlack
@ 2022-02-03  1:00 ` David Matlack
  2022-03-07  5:21 ` [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU Peter Xu
  23 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-03  1:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm, David Matlack

Override virt_map() in x86_64 selftests to use the largest page size
possible when mapping guest virtual memory. This enables testing eager
page splitting with shadow paging (e.g. kvm_intel.ept=N), as it allows
KVM to shadow guest memory with huge pages.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 .../selftests/kvm/include/x86_64/processor.h  |  6 ++++
 tools/testing/selftests/kvm/lib/kvm_util.c    |  4 +--
 .../selftests/kvm/lib/x86_64/processor.c      | 31 +++++++++++++++++++
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 8a470da7b71a..0d6014b7eaf0 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -465,6 +465,12 @@ enum x86_page_size {
 	X86_PAGE_SIZE_2M,
 	X86_PAGE_SIZE_1G,
 };
+
+static inline size_t page_size_bytes(enum x86_page_size page_size)
+{
+	return 1UL << (page_size * 9 + 12);
+}
+
 void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
 		   enum x86_page_size page_size);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index d8cf851ab119..33c4a43bffcd 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1393,8 +1393,8 @@ vm_vaddr_t vm_vaddr_alloc_page(struct kvm_vm *vm)
  * Within the VM given by @vm, creates a virtual translation for
  * @npages starting at @vaddr to the page range starting at @paddr.
  */
-void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
-	      unsigned int npages)
+void __weak virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
+		     unsigned int npages)
 {
 	size_t page_size = vm->page_size;
 	size_t size = npages * page_size;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 9f000dfb5594..7df84292d5de 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -282,6 +282,37 @@ void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr)
 	__virt_pg_map(vm, vaddr, paddr, X86_PAGE_SIZE_4K);
 }
 
+void virt_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, unsigned int npages)
+{
+	size_t size = (size_t) npages * vm->page_size;
+	size_t vend = vaddr + size;
+	enum x86_page_size page_size;
+	size_t stride;
+
+	TEST_ASSERT(vaddr + size > vaddr, "Vaddr overflow");
+	TEST_ASSERT(paddr + size > paddr, "Paddr overflow");
+
+	/*
+	 * Map the region with all 1G pages if possible, falling back to all
+	 * 2M pages, and finally all 4K pages. This could be improved to use
+	 * a mix of page sizes so that more of the region is mapped with large
+	 * pages.
+	 */
+	for (page_size = X86_PAGE_SIZE_1G; page_size >= X86_PAGE_SIZE_4K; page_size--) {
+		stride = page_size_bytes(page_size);
+
+		if (!(vaddr % stride) && !(paddr % stride) && !(size % stride))
+			break;
+	}
+
+	TEST_ASSERT(page_size >= X86_PAGE_SIZE_4K,
+		    "Cannot map unaligned region: vaddr 0x%lx paddr 0x%lx npages 0x%x\n",
+		    vaddr, paddr, npages);
+
+	for (; vaddr < vend; vaddr += stride, paddr += stride)
+		__virt_pg_map(vm, vaddr, paddr, page_size);
+}
+
 static struct pageTableEntry *_vm_get_page_table_entry(struct kvm_vm *vm, int vcpuid,
 						       uint64_t vaddr)
 {
-- 
2.35.0.rc2.247.g8bbb082509-goog


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 06/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  2022-02-03  1:00 ` [PATCH 06/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization David Matlack
@ 2022-02-16 19:37   ` Ben Gardon
  2022-02-16 21:42     ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2022-02-16 19:37 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> Separate the code that allocates a new shadow page from the vCPU caches
> from the code that initializes it. This is in preparation for creating
> new shadow pages from VM ioctls for eager page splitting, where we do
> not have access to the vCPU caches.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++---------------------
>  1 file changed, 22 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 49f82addf4b5..d4f90a10b652 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1718,7 +1718,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
>         mmu_spte_clear_no_track(parent_pte);
>  }
>
> -static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, int direct)
> +static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
>  {
>         struct kvm_mmu_page *sp;
>
> @@ -1726,16 +1726,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, int direct)
>         sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>         if (!direct)
>                 sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> -       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

I'd be inclined to leave this in the allocation function instead of
moving it to the init function. It might not be any less code, but if
you're doing the sp -> page link here, you might as well do the page
-> sp link too.

>
>
> -       /*
> -        * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> -        * depends on valid pages being added to the head of the list.  See
> -        * comments in kvm_zap_obsolete_pages().
> -        */
> -       sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> -       list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> -       kvm_mod_used_mmu_pages(vcpu->kvm, +1);
>         return sp;
>  }
>
> @@ -2144,27 +2135,34 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp(struct kvm_vcpu *vcpu,
>         return sp;
>  }
>
> -static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
> -                                             struct kvm_memory_slot *slot,
> -                                             gfn_t gfn,
> -                                             union kvm_mmu_page_role role)
> +
> +static void kvm_mmu_init_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> +                           struct kvm_memory_slot *slot, gfn_t gfn,
> +                           union kvm_mmu_page_role role)
>  {
> -       struct kvm_mmu_page *sp;
>         struct hlist_head *sp_list;
>
> -       ++vcpu->kvm->stat.mmu_cache_miss;
> +       ++kvm->stat.mmu_cache_miss;
> +
> +       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
> -       sp = kvm_mmu_alloc_sp(vcpu, role.direct);
>         sp->gfn = gfn;
>         sp->role = role;
> +       sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
>
> -       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> +       /*
> +        * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> +        * depends on valid pages being added to the head of the list.  See
> +        * comments in kvm_zap_obsolete_pages().
> +        */
> +       list_add(&sp->link, &kvm->arch.active_mmu_pages);
> +       kvm_mod_used_mmu_pages(kvm, 1);
> +
> +       sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>         hlist_add_head(&sp->hash_link, sp_list);
>
>         if (!role.direct)
> -               account_shadowed(vcpu->kvm, slot, sp);
> -
> -       return sp;
> +               account_shadowed(kvm, slot, sp);
>  }
>
>  static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
> @@ -2179,8 +2177,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
>                 goto out;
>
>         created = true;
> +       sp = kvm_mmu_alloc_sp(vcpu, role.direct);
> +
>         slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> -       sp = kvm_mmu_create_sp(vcpu, slot, gfn, role);
> +       kvm_mmu_init_sp(vcpu->kvm, sp, slot, gfn, role);
>
>  out:
>         trace_kvm_mmu_get_page(sp, created);
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 06/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  2022-02-16 19:37   ` Ben Gardon
@ 2022-02-16 21:42     ` David Matlack
  0 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-16 21:42 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 16, 2022 at 11:37 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Separate the code that allocates a new shadow page from the vCPU caches
> > from the code that initializes it. This is in preparation for creating
> > new shadow pages from VM ioctls for eager page splitting, where we do
> > not have access to the vCPU caches.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++---------------------
> >  1 file changed, 22 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 49f82addf4b5..d4f90a10b652 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1718,7 +1718,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
> >         mmu_spte_clear_no_track(parent_pte);
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, int direct)
> > +static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
> >  {
> >         struct kvm_mmu_page *sp;
> >
> > @@ -1726,16 +1726,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, int direct)
> >         sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> >         if (!direct)
> >                 sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> > -       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
> I'd be inclined to leave this in the allocation function instead of
> moving it to the init function. It might not be any less code, but if
> you're doing the sp -> page link here, you might as well do the page
> -> sp link too.

Good suggestion. I'll include that change in the next version.
>
> >
> >
> > -       /*
> > -        * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > -        * depends on valid pages being added to the head of the list.  See
> > -        * comments in kvm_zap_obsolete_pages().
> > -        */
> > -       sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > -       list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
> > -       kvm_mod_used_mmu_pages(vcpu->kvm, +1);
> >         return sp;
> >  }
> >
> > @@ -2144,27 +2135,34 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp(struct kvm_vcpu *vcpu,
> >         return sp;
> >  }
> >
> > -static struct kvm_mmu_page *kvm_mmu_create_sp(struct kvm_vcpu *vcpu,
> > -                                             struct kvm_memory_slot *slot,
> > -                                             gfn_t gfn,
> > -                                             union kvm_mmu_page_role role)
> > +
> > +static void kvm_mmu_init_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> > +                           struct kvm_memory_slot *slot, gfn_t gfn,
> > +                           union kvm_mmu_page_role role)
> >  {
> > -       struct kvm_mmu_page *sp;
> >         struct hlist_head *sp_list;
> >
> > -       ++vcpu->kvm->stat.mmu_cache_miss;
> > +       ++kvm->stat.mmu_cache_miss;
> > +
> > +       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> >
> > -       sp = kvm_mmu_alloc_sp(vcpu, role.direct);
> >         sp->gfn = gfn;
> >         sp->role = role;
> > +       sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
> >
> > -       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > +       /*
> > +        * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> > +        * depends on valid pages being added to the head of the list.  See
> > +        * comments in kvm_zap_obsolete_pages().
> > +        */
> > +       list_add(&sp->link, &kvm->arch.active_mmu_pages);
> > +       kvm_mod_used_mmu_pages(kvm, 1);
> > +
> > +       sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> >         hlist_add_head(&sp->hash_link, sp_list);
> >
> >         if (!role.direct)
> > -               account_shadowed(vcpu->kvm, slot, sp);
> > -
> > -       return sp;
> > +               account_shadowed(kvm, slot, sp);
> >  }
> >
> >  static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
> > @@ -2179,8 +2177,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp(struct kvm_vcpu *vcpu, gfn_t gfn,
> >                 goto out;
> >
> >         created = true;
> > +       sp = kvm_mmu_alloc_sp(vcpu, role.direct);
> > +
> >         slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > -       sp = kvm_mmu_create_sp(vcpu, slot, gfn, role);
> > +       kvm_mmu_init_sp(vcpu->kvm, sp, slot, gfn, role);
> >
> >  out:
> >         trace_kvm_mmu_get_page(sp, created);
> > --
> > 2.35.0.rc2.247.g8bbb082509-goog
> >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  2022-02-03  1:00 ` [PATCH 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
@ 2022-02-19  0:57   ` Sean Christopherson
  0 siblings, 0 replies; 65+ messages in thread
From: Sean Christopherson @ 2022-02-19  0:57 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Vitaly Kuznetsov, Peter Xu, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, maciej.szmigiero, kvm

On Thu, Feb 03, 2022, David Matlack wrote:
> Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for
> fully direct MMUs") skipped the unsync checks and write flood clearing
> for full direct MMUs. We can extend this further and skip the checks for
> all direct shadow pages. Direct shadow pages are never marked unsynced
> or have a non-zero write-flooding count.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

>  arch/x86/kvm/mmu/mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 296f8723f9ae..6ca38277f2ab 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2052,7 +2052,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  					     int direct,
>  					     unsigned int access)
>  {
> -	bool direct_mmu = vcpu->arch.mmu->direct_map;
>  	union kvm_mmu_page_role role;
>  	struct hlist_head *sp_list;
>  	unsigned quadrant;
> @@ -2093,7 +2092,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  			continue;
>  		}
>  
> -		if (direct_mmu)
> +		/* unsync and write-flooding only apply to indirect SPs. */
> +		if (sp->role.direct)

Because I spent waaaay too much time over-analyzing this... checking sp->role.direct
actually generates better code than check @direct.  Because of regsiter pressure,
direct has to get shoved onto the stack and then pulled back off.  Not that it
matters, at all, because this code runs exactly once...

>  			goto trace_get_page;
>  
>  		if (sp->unsync) {
> -- 
> 2.35.0.rc2.247.g8bbb082509-goog
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-02-03  1:00 ` [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
@ 2022-02-19  1:14   ` Sean Christopherson
  2022-02-24 18:45     ` David Matlack
  2022-03-04  0:22     ` David Matlack
  0 siblings, 2 replies; 65+ messages in thread
From: Sean Christopherson @ 2022-02-19  1:14 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Vitaly Kuznetsov, Peter Xu, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, maciej.szmigiero, kvm

On Thu, Feb 03, 2022, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, we can derive most of the information from the parent shadow page.
> This avoids redundant calculations such as the quadrant, and reduces the

Uh, calculating quadrant isn't redundant.  The quadrant forces KVM to use different
(multiple) shadow pages to shadow a single guest PTE when the guest is using 32-bit
paging (1024 PTEs per page table vs. 512 PTEs per page table).  The reason quadrant
is "quad" and not more or less is because 32-bit paging has two levels.  First-level
PTEs can have quadrant=0/1, and that gets doubled for second-level PTEs because we
need to use four PTEs (two to handle 2x guest PTEs, and each of those needs to be
unique for the first-level PTEs they point at).

Indeed, this fails spectacularly when attempting to boot a 32-bit non-PAE kernel
with shadow paging enabled.

 \���	���\���	��\���
 	P��\��`
 BUG: unable to handle page fault for address: ff9fa81c
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 *pde = 00000000
 ����
 Oops: 0000 [#1]��<���� SMP��<������<������<����
 ��<����CPU: 0 PID: 0 Comm: swapper ��<����G        W         5.12.0 #10
 ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
 ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
 ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
 ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
 ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
 ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
 ��<����Call Trace:
 ��<���� ? printkd�r
 ��<���� ��<����memblock_reserved�r
 ��<���� ? 0xc2000000
 ��<���� ��<����setup_archd�r
 ��<���� ? vprintk_defaultd�r
 ��<���� ? vprintkd�r
 ��<���� ��<����start_kerneld�r
 ��<���� ��<����i386_start_kerneld�r
 ��<���� ��<����startup_32_smpd�r

 ����
 CR2: 00000000ff9fa81c

 ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
 ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
 ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
 ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
 ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
 ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600

> number of parameters to kvm_mmu_get_page().
> 
> Preemptivel split out the role calculation to a separate function for

Preemptively.

> use in a following commit.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 03/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-02-03  1:00 ` [PATCH 03/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
@ 2022-02-19  1:25   ` Sean Christopherson
  2022-02-24 18:54     ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Sean Christopherson @ 2022-02-19  1:25 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Vitaly Kuznetsov, Peter Xu, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, maciej.szmigiero, kvm

On Thu, Feb 03, 2022, David Matlack wrote:
> Decompose kvm_mmu_get_page() into separate helper functions to increase
> readability and prepare for allocating shadow pages without a vcpu
> pointer.
> 
> Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
> functions:
> 
> kvm_mmu_get_existing_sp_mabye_unsync() -

Heh, this ain't Java.   Just add two underscores to whatever it's primary caller
ends up being named; that succinctly documents the relationship _and_ suggests
that there's some "danger" in using the inner helper.

>   Walks the page hash checking for any existing mmu pages that match the
>   given gfn and role. Does not attempt to synchronize the page if it is
>   unsync.
> 
> kvm_mmu_get_existing_sp() -

Meh.  We should really be able to distill this down to something like
kvm_mmu_find_sp().  I'm also tempted to say we go with shadow_page instead of
"sp" for these helpers, so long as the line lengths don't get too brutal.  KVM
uses "sp" and "spte" in lots of places, but I suspect it would be helpful to
KVM newbies if the core routines actually spell out shadow_page, a la
to_shadow_page().

>   Gets an existing page from the page hash if it exists and guarantees
>   the page, if one is returned, is synced.  Implemented as a thin wrapper
>   around kvm_mmu_get_existing_page_mabye_unsync. Requres access to a vcpu
>   pointer in order to sync the page.
> 
> kvm_mmu_create_sp()

Probably prefer s/create/alloc to match existing terminology for allocating roots.
Though looking through the series, there's going to be a lot of juggling of names.

It probably makes sense to figure out what names we want to end up with and then
work back from there.  I'll be back next week for a proper bikeshed session. :-)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 10/23] KVM: x86/mmu: Pass const memslot to rmap_add()
  2022-02-03  1:00 ` [PATCH 10/23] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
@ 2022-02-23 23:25   ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2022-02-23 23:25 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> rmap_add() only uses the slot to call gfn_to_rmap() which takes a const
> memslot.
>
> No functional change intended.
>

Reviewed-by: Ben Gardon <bgardon@google.com>

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 48ebf2bebb90..a5e3bb632542 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1607,7 +1607,7 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>
>  #define RMAP_RECYCLE_THRESHOLD 1000
>
> -static void rmap_add(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> +static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
>                      u64 *spte, gfn_t gfn)
>  {
>         struct kvm_mmu_page *sp;
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 11/23] KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants
  2022-02-03  1:00 ` [PATCH 11/23] KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants David Matlack
@ 2022-02-23 23:27   ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2022-02-23 23:27 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> Use a const pointer so that kvm_mmu_init_sp() can be called from
> contexts where we have a const pointer.
>
> No functional change intended.
>

Reviewed-by: Ben Gardon <bgardon@google.com>

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_page_track.h | 2 +-
>  arch/x86/kvm/mmu/mmu.c                | 7 +++----
>  arch/x86/kvm/mmu/mmu_internal.h       | 2 +-
>  arch/x86/kvm/mmu/page_track.c         | 4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c            | 2 +-
>  arch/x86/kvm/mmu/tdp_mmu.h            | 2 +-
>  6 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..3a2dc183ae9a 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -58,7 +58,7 @@ int kvm_page_track_create_memslot(struct kvm *kvm,
>                                   unsigned long npages);
>
>  void kvm_slot_page_track_add_page(struct kvm *kvm,
> -                                 struct kvm_memory_slot *slot, gfn_t gfn,
> +                                 const struct kvm_memory_slot *slot, gfn_t gfn,
>                                   enum kvm_page_track_mode mode);
>  void kvm_slot_page_track_remove_page(struct kvm *kvm,
>                                      struct kvm_memory_slot *slot, gfn_t gfn,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a5e3bb632542..de7c47ee0def 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -805,7 +805,7 @@ void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn)
>  }
>
>  static void account_shadowed(struct kvm *kvm,
> -                            struct kvm_memory_slot *slot,
> +                            const struct kvm_memory_slot *slot,
>                              struct kvm_mmu_page *sp)
>  {
>         gfn_t gfn;
> @@ -1384,7 +1384,7 @@ int kvm_cpu_dirty_log_size(void)
>  }
>
>  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> -                                   struct kvm_memory_slot *slot, u64 gfn,
> +                                   const struct kvm_memory_slot *slot, u64 gfn,
>                                     int min_level)
>  {
>         struct kvm_rmap_head *rmap_head;
> @@ -2158,9 +2158,8 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp(struct kvm_vcpu *vcpu,
>         return sp;
>  }
>
> -
>  static void kvm_mmu_init_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> -                           struct kvm_memory_slot *slot, gfn_t gfn,
> +                           const struct kvm_memory_slot *slot, gfn_t gfn,
>                             union kvm_mmu_page_role role)
>  {
>         struct hlist_head *sp_list;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index c5f2c0b9177d..e6bcea5a0aa9 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -123,7 +123,7 @@ int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
>  void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
>  void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
>  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> -                                   struct kvm_memory_slot *slot, u64 gfn,
> +                                   const struct kvm_memory_slot *slot, u64 gfn,
>                                     int min_level);
>  void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
>                                         u64 start_gfn, u64 pages);
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 68eb1fb548b6..ebd704946a35 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -83,7 +83,7 @@ int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot)
>         return 0;
>  }
>
> -static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
> +static void update_gfn_track(const struct kvm_memory_slot *slot, gfn_t gfn,
>                              enum kvm_page_track_mode mode, short count)
>  {
>         int index, val;
> @@ -111,7 +111,7 @@ static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
>   * @mode: tracking mode, currently only write track is supported.
>   */
>  void kvm_slot_page_track_add_page(struct kvm *kvm,
> -                                 struct kvm_memory_slot *slot, gfn_t gfn,
> +                                 const struct kvm_memory_slot *slot, gfn_t gfn,
>                                   enum kvm_page_track_mode mode)
>  {
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4ff1af24b5aa..34c451f1eac9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1645,7 +1645,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
>   * Returns true if an SPTE was set and a TLB flush is needed.
>   */
>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> -                                  struct kvm_memory_slot *slot, gfn_t gfn,
> +                                  const struct kvm_memory_slot *slot, gfn_t gfn,
>                                    int min_level)
>  {
>         struct kvm_mmu_page *root;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3f987785702a..b1265149a05d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -64,7 +64,7 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>                                        const struct kvm_memory_slot *slot);
>
>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> -                                  struct kvm_memory_slot *slot, gfn_t gfn,
> +                                  const struct kvm_memory_slot *slot, gfn_t gfn,
>                                    int min_level);
>
>  void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 12/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  2022-02-03  1:00 ` [PATCH 12/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
@ 2022-02-23 23:30   ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2022-02-23 23:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> Allow adding new entries to the rmap and linking shadow pages without a
> struct kvm_vcpu pointer by moving the implementation of rmap_add() and
> link_shadow_page() into inner helper functions.
>
> No functional change intended.
>

Reviewed-by: Ben Gardon <bgardon@google.com>

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 43 +++++++++++++++++++++++++++---------------
>  1 file changed, 28 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index de7c47ee0def..c2f7f026d414 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -736,9 +736,9 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
>
> -static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
> +static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> -       return kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache);
> +       return kvm_mmu_memory_cache_alloc(cache);
>  }
>
>  static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
> @@ -885,7 +885,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
>  /*
>   * Returns the number of pointers in the rmap chain, not counting the new one.
>   */
> -static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
> +static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>                         struct kvm_rmap_head *rmap_head)
>  {
>         struct pte_list_desc *desc;
> @@ -896,7 +896,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>                 rmap_head->val = (unsigned long)spte;
>         } else if (!(rmap_head->val & 1)) {
>                 rmap_printk("%p %llx 1->many\n", spte, *spte);
> -               desc = mmu_alloc_pte_list_desc(vcpu);
> +               desc = mmu_alloc_pte_list_desc(cache);
>                 desc->sptes[0] = (u64 *)rmap_head->val;
>                 desc->sptes[1] = spte;
>                 desc->spte_count = 2;
> @@ -908,7 +908,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>                 while (desc->spte_count == PTE_LIST_EXT) {
>                         count += PTE_LIST_EXT;
>                         if (!desc->more) {
> -                               desc->more = mmu_alloc_pte_list_desc(vcpu);
> +                               desc->more = mmu_alloc_pte_list_desc(cache);
>                                 desc = desc->more;
>                                 desc->spte_count = 0;
>                                 break;
> @@ -1607,8 +1607,10 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>
>  #define RMAP_RECYCLE_THRESHOLD 1000
>
> -static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> -                    u64 *spte, gfn_t gfn)
> +static void __rmap_add(struct kvm *kvm,
> +                      struct kvm_mmu_memory_cache *cache,
> +                      const struct kvm_memory_slot *slot,
> +                      u64 *spte, gfn_t gfn)
>  {
>         struct kvm_mmu_page *sp;
>         struct kvm_rmap_head *rmap_head;
> @@ -1617,15 +1619,21 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
>         sp = sptep_to_sp(spte);
>         kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
>         rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
> -       rmap_count = pte_list_add(vcpu, spte, rmap_head);
> +       rmap_count = pte_list_add(cache, spte, rmap_head);
>
>         if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
> -               kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
> +               kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
>                 kvm_flush_remote_tlbs_with_address(
> -                               vcpu->kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
> +                               kvm, sp->gfn, KVM_PAGES_PER_HPAGE(sp->role.level));
>         }
>  }
>
> +static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> +                    u64 *spte, gfn_t gfn)
> +{
> +       __rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
> +}
> +
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         bool young = false;
> @@ -1693,13 +1701,13 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
>         return hash_64(gfn, KVM_MMU_HASH_SHIFT);
>  }
>
> -static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
> +static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
>                                     struct kvm_mmu_page *sp, u64 *parent_pte)
>  {
>         if (!parent_pte)
>                 return;
>
> -       pte_list_add(vcpu, parent_pte, &sp->parent_ptes);
> +       pte_list_add(cache, parent_pte, &sp->parent_ptes);
>  }
>
>  static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
> @@ -2297,8 +2305,8 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
>         __shadow_walk_next(iterator, *iterator->sptep);
>  }
>
> -static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> -                            struct kvm_mmu_page *sp)
> +static void __link_shadow_page(struct kvm_mmu_memory_cache *cache, u64 *sptep,
> +                              struct kvm_mmu_page *sp)
>  {
>         u64 spte;
>
> @@ -2308,12 +2316,17 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>
>         mmu_spte_set(sptep, spte);
>
> -       mmu_page_add_parent_pte(vcpu, sp, sptep);
> +       mmu_page_add_parent_pte(cache, sp, sptep);
>
>         if (sp->unsync_children || sp->unsync)
>                 mark_unsync(sptep);
>  }
>
> +static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp)
> +{
> +       __link_shadow_page(&vcpu->arch.mmu_pte_list_desc_cache, sptep, sp);
> +}
> +
>  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>                                    unsigned direct_access)
>  {
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 13/23] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-02-03  1:00 ` [PATCH 13/23] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
@ 2022-02-23 23:32   ` Ben Gardon
  2022-02-23 23:35     ` Ben Gardon
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2022-02-23 23:32 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> Update the page stats in __rmap_add() rather than at the call site. This
> will avoid having to manually update page stats when splitting huge
> pages in a subsequent commit.
>
> No functional change intended.
>

Reviewed-by: Ben Gardon <bgardon@google.com>

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c2f7f026d414..ae1564e67e49 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1621,6 +1621,8 @@ static void __rmap_add(struct kvm *kvm,
>         rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
>         rmap_count = pte_list_add(cache, spte, rmap_head);
>
> +       kvm_update_page_stats(kvm, sp->role.level, 1);
> +
>         if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
>                 kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
>                 kvm_flush_remote_tlbs_with_address(
> @@ -2831,7 +2833,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>
>         if (!was_rmapped) {
>                 WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
> -               kvm_update_page_stats(vcpu->kvm, level, 1);
>                 rmap_add(vcpu, slot, sptep, gfn);
>         }
>
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 13/23] KVM: x86/mmu: Update page stats in __rmap_add()
  2022-02-23 23:32   ` Ben Gardon
@ 2022-02-23 23:35     ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2022-02-23 23:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 23, 2022 at 3:32 PM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Update the page stats in __rmap_add() rather than at the call site. This
> > will avoid having to manually update page stats when splitting huge
> > pages in a subsequent commit.
> >
> > No functional change intended.
> >
>
> Reviewed-by: Ben Gardon <bgardon@google.com>
>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c2f7f026d414..ae1564e67e49 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1621,6 +1621,8 @@ static void __rmap_add(struct kvm *kvm,
> >         rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
> >         rmap_count = pte_list_add(cache, spte, rmap_head);
> >
> > +       kvm_update_page_stats(kvm, sp->role.level, 1);
> > +

Strictly speaking, this is a functional change since you're moving the
stat update after the rmap update, but there's no synchronization on
the stats anyway, so I don't think it matters if it's updated before
or after.

> >         if (rmap_count > RMAP_RECYCLE_THRESHOLD) {
> >                 kvm_unmap_rmapp(kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0));
> >                 kvm_flush_remote_tlbs_with_address(
> > @@ -2831,7 +2833,6 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >
> >         if (!was_rmapped) {
> >                 WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
> > -               kvm_update_page_stats(vcpu->kvm, level, 1);
> >                 rmap_add(vcpu, slot, sptep, gfn);
> >         }
> >
> > --
> > 2.35.0.rc2.247.g8bbb082509-goog
> >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-02-03  1:00 ` [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
@ 2022-02-24 11:28   ` Marc Zyngier
  2022-02-24 19:20     ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Marc Zyngier @ 2022-02-24 11:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	maciej.szmigiero, kvm

On Thu, 03 Feb 2022 01:00:47 +0000,
David Matlack <dmatlack@google.com> wrote:
> 
> Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> declaration time rather than being fixed for all declarations. This will
> be used in a follow-up commit to declare an cache in x86 with a capacity
> of 512+ objects without having to increase the capacity of all caches in
> KVM.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +-
>  arch/arm64/kvm/mmu.c              | 12 ++++++------
>  arch/mips/include/asm/kvm_host.h  |  2 +-
>  arch/x86/include/asm/kvm_host.h   |  8 ++++----
>  include/linux/kvm_types.h         | 24 ++++++++++++++++++++++--
>  virt/kvm/kvm_main.c               |  8 +++++++-
>  6 files changed, 41 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 3b44ea17af88..a450b91cc2d9 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
>  	bool pause;
>  
>  	/* Cache some mmu pages needed inside spinlock regions */
> -	struct kvm_mmu_memory_cache mmu_page_cache;
> +	DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);

I must say I'm really not a fan of the anonymous structure trick. I
can see why you are doing it that way, but it feels pretty brittle.

>  
>  	/* Target CPU and feature flags */
>  	int target;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index bc2aba953299..9c853c529b49 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -765,7 +765,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>  	phys_addr_t addr;
>  	int ret = 0;
> -	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> +	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
> +	struct kvm_mmu_memory_cache *cache = &page_cache.cache;
>  	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>  				     KVM_PGTABLE_PROT_R |
> @@ -774,18 +775,17 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  	if (is_protected_kvm_enabled())
>  		return -EPERM;
>  
> +	cache->gfp_zero = __GFP_ZERO;

nit: consider this instead, which preserves the existing flow:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 26d6c53be083..86a7ebd03a44 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -764,7 +764,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
+	DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
+		.cache = { .gfp_zero = __GFP_ZERO},
+	};
 	struct kvm_mmu_memory_cache *cache = &page_cache.cache;
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
@@ -774,7 +776,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	if (is_protected_kvm_enabled())
 		return -EPERM;
 
-	cache->gfp_zero = __GFP_ZERO;
 	size += offset_in_page(guest_ipa);
 	guest_ipa &= PAGE_MASK;
 
but whole "declare the outer structure and just use the inner one"
hack is... huh... :-/

This hunk also conflicts with what currently sits in -next. Not a big
deal, but just so you know.

> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index dceac12c1ce5..9575fb8d333f 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -78,14 +78,34 @@ struct gfn_to_pfn_cache {
>   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
>   * holding MMU locks.  Note, these caches act more like prefetch buffers than
>   * classical caches, i.e. objects are not returned to the cache on being freed.
> + *
> + * The storage for the cache objects is laid out after the struct to allow
> + * different declarations to choose different capacities. If the capacity field
> + * is 0, the capacity is assumed to be KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE.
>   */
>  struct kvm_mmu_memory_cache {
>  	int nobjs;
> +	int capacity;
>  	gfp_t gfp_zero;
>  	struct kmem_cache *kmem_cache;
> -	void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> +	void *objects[0];

The VLA police is going to track you down ([0] vs []).

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-02-19  1:14   ` Sean Christopherson
@ 2022-02-24 18:45     ` David Matlack
  2022-03-04  0:22     ` David Matlack
  1 sibling, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-24 18:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Vitaly Kuznetsov, Peter Xu, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, Maciej S. Szmigiero,
	kvm list

On Fri, Feb 18, 2022 at 5:14 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Feb 03, 2022, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, we can derive most of the information from the parent shadow page.
> > This avoids redundant calculations such as the quadrant, and reduces the
>
> Uh, calculating quadrant isn't redundant.  The quadrant forces KVM to use different
> (multiple) shadow pages to shadow a single guest PTE when the guest is using 32-bit
> paging (1024 PTEs per page table vs. 512 PTEs per page table).  The reason quadrant
> is "quad" and not more or less is because 32-bit paging has two levels.  First-level
> PTEs can have quadrant=0/1, and that gets doubled for second-level PTEs because we
> need to use four PTEs (two to handle 2x guest PTEs, and each of those needs to be
> unique for the first-level PTEs they point at).
>
> Indeed, this fails spectacularly when attempting to boot a 32-bit non-PAE kernel
> with shadow paging enabled.

*facepalm*

Thanks for catching this. I'll fix this up in v2 and add 32-bit
non-PAE guests with shadow paging to my test matrix.

>
>  \���   ���\��� ��\���
>         P��\��`
>  BUG: unable to handle page fault for address: ff9fa81c
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  *pde = 00000000
>  ����
>  Oops: 0000 [#1]��<���� SMP��<������<������<����
>  ��<����CPU: 0 PID: 0 Comm: swapper ��<����G        W         5.12.0 #10
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
>  ��<����Call Trace:
>  ��<���� ? printkd�r
>  ��<���� ��<����memblock_reserved�r
>  ��<���� ? 0xc2000000
>  ��<���� ��<����setup_archd�r
>  ��<���� ? vprintk_defaultd�r
>  ��<���� ? vprintkd�r
>  ��<���� ��<����start_kerneld�r
>  ��<���� ��<����i386_start_kerneld�r
>  ��<���� ��<����startup_32_smpd�r
>
>  ����
>  CR2: 00000000ff9fa81c
>
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
>
> > number of parameters to kvm_mmu_get_page().
> >
> > Preemptivel split out the role calculation to a separate function for
>
> Preemptively.
>
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 03/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  2022-02-19  1:25   ` Sean Christopherson
@ 2022-02-24 18:54     ` David Matlack
  0 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-24 18:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Vitaly Kuznetsov, Peter Xu, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, Maciej S. Szmigiero,
	kvm list

On Fri, Feb 18, 2022 at 5:25 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Feb 03, 2022, David Matlack wrote:
> > Decompose kvm_mmu_get_page() into separate helper functions to increase
> > readability and prepare for allocating shadow pages without a vcpu
> > pointer.
> >
> > Specifically, pull the guts of kvm_mmu_get_page() into 3 helper
> > functions:
> >
> > kvm_mmu_get_existing_sp_mabye_unsync() -
>
> Heh, this ain't Java.   Just add two underscores to whatever it's primary caller
> ends up being named; that succinctly documents the relationship _and_ suggests
> that there's some "danger" in using the inner helper.
>
> >   Walks the page hash checking for any existing mmu pages that match the
> >   given gfn and role. Does not attempt to synchronize the page if it is
> >   unsync.
> >
> > kvm_mmu_get_existing_sp() -
>
> Meh.  We should really be able to distill this down to something like
> kvm_mmu_find_sp().  I'm also tempted to say we go with shadow_page instead of
> "sp" for these helpers, so long as the line lengths don't get too brutal.  KVM
> uses "sp" and "spte" in lots of places, but I suspect it would be helpful to
> KVM newbies if the core routines actually spell out shadow_page, a la
> to_shadow_page().

s/get_existing/find/ sounds good to me.

I'll play around with s/sp/shadow_page/ but I suspect it will make the
line lengths quite long. But if I also replace "maybe_unsync" with
double-underscores it might work out.

>
> >   Gets an existing page from the page hash if it exists and guarantees
> >   the page, if one is returned, is synced.  Implemented as a thin wrapper
> >   around kvm_mmu_get_existing_page_mabye_unsync. Requres access to a vcpu
> >   pointer in order to sync the page.
> >
> > kvm_mmu_create_sp()
>
> Probably prefer s/create/alloc to match existing terminology for allocating roots.
> Though looking through the series, there's going to be a lot of juggling of names.
>
> It probably makes sense to figure out what names we want to end up with and then
> work back from there.  I'll be back next week for a proper bikeshed session. :-)

kvm_mmu_create_sp() is temporary anyway. It goes away after patch 6
and we just have kvm_mmu_alloc_sp() and kvm_mmu_init_sp().

I'll see what I can do about using kvm_mmu_alloc_sp() as the temporary
name, but the next patch renames kvm_mmu_alloc_page() to
kvm_mmu_alloc_sp() so it will take some juggling for sure.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-02-24 11:28   ` Marc Zyngier
@ 2022-02-24 19:20     ` David Matlack
  2022-03-04 21:59       ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-24 19:20 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Thu, Feb 24, 2022 at 3:29 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 03 Feb 2022 01:00:47 +0000,
> David Matlack <dmatlack@google.com> wrote:
> >
> > Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
> > declaration time rather than being fixed for all declarations. This will
> > be used in a follow-up commit to declare an cache in x86 with a capacity
> > of 512+ objects without having to increase the capacity of all caches in
> > KVM.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  2 +-
> >  arch/arm64/kvm/mmu.c              | 12 ++++++------
> >  arch/mips/include/asm/kvm_host.h  |  2 +-
> >  arch/x86/include/asm/kvm_host.h   |  8 ++++----
> >  include/linux/kvm_types.h         | 24 ++++++++++++++++++++++--
> >  virt/kvm/kvm_main.c               |  8 +++++++-
> >  6 files changed, 41 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 3b44ea17af88..a450b91cc2d9 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -357,7 +357,7 @@ struct kvm_vcpu_arch {
> >       bool pause;
> >
> >       /* Cache some mmu pages needed inside spinlock regions */
> > -     struct kvm_mmu_memory_cache mmu_page_cache;
> > +     DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
>
> I must say I'm really not a fan of the anonymous structure trick. I
> can see why you are doing it that way, but it feels pretty brittle.

Yeah I don't love it. It's really optimizing for minimizing the patch diff.

The alternative I considered was to dynamically allocate the
kvm_mmu_memory_cache structs. This would get rid of the anonymous
struct and the objects array, and also eliminate the rather gross
capacity hack in kvm_mmu_topup_memory_cache().

The downsides of this approach is more code and more failure paths if
the allocation fails.

>
> >
> >       /* Target CPU and feature flags */
> >       int target;
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index bc2aba953299..9c853c529b49 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -765,7 +765,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >       phys_addr_t addr;
> >       int ret = 0;
> > -     struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> > +     DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
> > +     struct kvm_mmu_memory_cache *cache = &page_cache.cache;
> >       struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                    KVM_PGTABLE_PROT_R |
> > @@ -774,18 +775,17 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >       if (is_protected_kvm_enabled())
> >               return -EPERM;
> >
> > +     cache->gfp_zero = __GFP_ZERO;
>
> nit: consider this instead, which preserves the existing flow:

Will do.

>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 26d6c53be083..86a7ebd03a44 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -764,7 +764,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
> +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> +               .cache = { .gfp_zero = __GFP_ZERO},
> +       };
>         struct kvm_mmu_memory_cache *cache = &page_cache.cache;
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> @@ -774,7 +776,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>         if (is_protected_kvm_enabled())
>                 return -EPERM;
>
> -       cache->gfp_zero = __GFP_ZERO;
>         size += offset_in_page(guest_ipa);
>         guest_ipa &= PAGE_MASK;
>
> but whole "declare the outer structure and just use the inner one"
> hack is... huh... :-/

Yeah it's not great. Unfortunately (or maybe fortunately?) anonymous
structs cannot be defined in functions. So naming the outer struct is
necessary even though we only need to use the inner one.

>
> This hunk also conflicts with what currently sits in -next. Not a big
> deal, but just so you know.

Ack.

>
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index dceac12c1ce5..9575fb8d333f 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -78,14 +78,34 @@ struct gfn_to_pfn_cache {
> >   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> >   * holding MMU locks.  Note, these caches act more like prefetch buffers than
> >   * classical caches, i.e. objects are not returned to the cache on being freed.
> > + *
> > + * The storage for the cache objects is laid out after the struct to allow
> > + * different declarations to choose different capacities. If the capacity field
> > + * is 0, the capacity is assumed to be KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE.
> >   */
> >  struct kvm_mmu_memory_cache {
> >       int nobjs;
> > +     int capacity;
> >       gfp_t gfp_zero;
> >       struct kmem_cache *kmem_cache;
> > -     void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> > +     void *objects[0];
>
> The VLA police is going to track you down ([0] vs []).

Thanks!


>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 14/23] KVM: x86/mmu: Cache the access bits of shadowed translations
  2022-02-03  1:00 ` [PATCH 14/23] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
@ 2022-02-28 20:30   ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2022-02-28 20:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> In order to split a huge page we need to know what access bits to assign
> to the role of the new child page table. This can't be easily derived
> from the huge page SPTE itself since KVM applies its own access policies
> on top, such as for HugePage NX.
>
> We could walk the guest page tables to determine the correct access
> bits, but that is difficult to plumb outside of a vCPU fault context.
> Instead, we can store the original access bits for each leaf SPTE
> alongside the GFN in the gfns array. The access bits only take up 3
> bits, which leaves 61 bits left over for gfns, which is more than
> enough. So this change does not require any additional memory.
>
> In order to keep the access bit cache in sync with the guest, we have to
> extend FNAME(sync_page) to also update the access bits.
>
> Now that the gfns array caches more information than just GFNs, rename
> it to shadowed_translation.
>
> No functional change intended.

This sounds like a functional change, but otherwise seems reasonable.


>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 32 +++++++++++++++++++-------------
>  arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  |  7 +++++--
>  4 files changed, 38 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c371ee7e45f7..f00004c13ccf 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -686,7 +686,7 @@ struct kvm_vcpu_arch {
>
>         struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
>         struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> -       struct kvm_mmu_memory_cache mmu_gfn_array_cache;
> +       struct kvm_mmu_memory_cache mmu_shadowed_translation_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
>
>         /*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ae1564e67e49..e2306a39526a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -719,7 +719,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>         if (r)
>                 return r;
>         if (maybe_indirect) {
> -               r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_gfn_array_cache,
> +               r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache,
>                                                PT64_ROOT_MAX_LEVEL);
>                 if (r)
>                         return r;
> @@ -732,7 +732,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
> +       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_translation_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
>
> @@ -749,15 +749,17 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  {
>         if (!sp->role.direct)
> -               return sp->gfns[index];
> +               return sp->shadowed_translation[index].gfn;
>
>         return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
>
> -static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
> +static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
> +                                       gfn_t gfn, u32 access)
>  {
>         if (!sp->role.direct) {
> -               sp->gfns[index] = gfn;
> +               sp->shadowed_translation[index].gfn = gfn;
> +               sp->shadowed_translation[index].access = access;
>                 return;
>         }
>
> @@ -1610,14 +1612,14 @@ static bool kvm_test_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  static void __rmap_add(struct kvm *kvm,
>                        struct kvm_mmu_memory_cache *cache,
>                        const struct kvm_memory_slot *slot,
> -                      u64 *spte, gfn_t gfn)
> +                      u64 *spte, gfn_t gfn, u32 access)
>  {
>         struct kvm_mmu_page *sp;
>         struct kvm_rmap_head *rmap_head;
>         int rmap_count;
>
>         sp = sptep_to_sp(spte);
> -       kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn);
> +       kvm_mmu_page_set_gfn_access(sp, spte - sp->spt, gfn, access);
>         rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
>         rmap_count = pte_list_add(cache, spte, rmap_head);
>
> @@ -1631,9 +1633,9 @@ static void __rmap_add(struct kvm *kvm,
>  }
>
>  static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> -                    u64 *spte, gfn_t gfn)
> +                    u64 *spte, gfn_t gfn, u32 access)
>  {
> -       __rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn);
> +       __rmap_add(vcpu->kvm, &vcpu->arch.mmu_pte_list_desc_cache, slot, spte, gfn, access);
>  }
>
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> @@ -1694,7 +1696,7 @@ void kvm_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
>         free_page((unsigned long)sp->spt);
>         if (!sp->role.direct)
> -               free_page((unsigned long)sp->gfns);
> +               free_page((unsigned long)sp->shadowed_translation);
>         kmem_cache_free(mmu_page_header_cache, sp);
>  }
>
> @@ -1731,8 +1733,12 @@ struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
>
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>         sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +
> +       BUILD_BUG_ON(sizeof(sp->shadowed_translation[0]) != sizeof(u64));
> +
>         if (!direct)
> -               sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> +               sp->shadowed_translation =
> +                       kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadowed_translation_cache);
>
>         return sp;
>  }
> @@ -1742,7 +1748,7 @@ struct kvm_mmu_page *kvm_mmu_alloc_sp(struct kvm_vcpu *vcpu, bool direct)
>   *
>   * Huge page splitting always uses direct shadow pages since the huge page is
>   * being mapped directly with a lower level page table. Thus there's no need to
> - * allocate the gfns array.
> + * allocate the shadowed_translation array.
>   */
>  struct kvm_mmu_page *kvm_mmu_alloc_direct_sp_for_split(gfp_t gfp)
>  {
> @@ -2833,7 +2839,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>
>         if (!was_rmapped) {
>                 WARN_ON_ONCE(ret == RET_PF_SPURIOUS);
> -               rmap_add(vcpu, slot, sptep, gfn);
> +               rmap_add(vcpu, slot, sptep, gfn, pte_access);
>         }
>
>         return ret;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index e6bcea5a0aa9..9ee175adcc12 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -30,6 +30,11 @@ extern bool dbg;
>  #define INVALID_PAE_ROOT       0
>  #define IS_VALID_PAE_ROOT(x)   (!!(x))
>
> +struct shadowed_translation_entry {
> +       u64 access:3;
> +       u64 gfn:56;
> +};
> +
>  struct kvm_mmu_page {
>         /*
>          * Note, "link" through "spt" fit in a single 64 byte cache line on
> @@ -51,8 +56,14 @@ struct kvm_mmu_page {
>         gfn_t gfn;
>
>         u64 *spt;
> -       /* hold the gfn of each spte inside spt */
> -       gfn_t *gfns;
> +       /*
> +        * For indirect shadow pages, caches the result of the intermediate
> +        * guest translation being shadowed by each SPTE.
> +        *
> +        * NULL for direct shadow pages.
> +        */
> +       struct shadowed_translation_entry *shadowed_translation;
> +
>         /* Currently serving as active root */
>         union {
>                 int root_count;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index c533c191925e..703dfb062bf0 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1016,7 +1016,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  }
>
>  /*
> - * Using the cached information from sp->gfns is safe because:
> + * Using the information in sp->shadowed_translation is safe because:
>   * - The spte has a reference to the struct page, so the pfn for a given gfn
>   *   can't change unless all sptes pointing to it are nuked first.
>   *
> @@ -1090,12 +1090,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>                 if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access))
>                         continue;
>
> -               if (gfn != sp->gfns[i]) {
> +               if (gfn != sp->shadowed_translation[i].gfn) {
>                         drop_spte(vcpu->kvm, &sp->spt[i]);
>                         flush = true;
>                         continue;
>                 }
>
> +               if (pte_access != sp->shadowed_translation[i].access)
> +                       sp->shadowed_translation[i].access = pte_access;
> +
>                 sptep = &sp->spt[i];
>                 spte = *sptep;
>                 host_writable = spte & shadow_host_writable_mask;
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 15/23] KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  2022-02-03  1:00 ` [PATCH 15/23] KVM: x86/mmu: Pass access information to make_huge_page_split_spte() David Matlack
@ 2022-02-28 20:32   ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2022-02-28 20:32 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> Currently make_huge_page_split_spte() assumes execute permissions can be
> granted to any 4K SPTE when splitting huge pages. This is true for the
> TDP MMU but is not necessarily true for the shadow MMU. Huge pages
> mapped by the shadow MMU may be shadowing huge pages that the guest has
> disallowed execute permissions.
>
> No functional change intended.
>

Reviewed-by: Ben Gardon <bgardon@google.com>

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/spte.c    | 5 +++--
>  arch/x86/kvm/mmu/spte.h    | 3 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
>  3 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 20cf9e0d45dd..7cba5cffc240 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -215,7 +215,8 @@ static u64 make_spte_executable(u64 spte)
>   * This is used during huge page splitting to build the SPTEs that make up the
>   * new page table.
>   */
> -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
> +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
> +                             unsigned int access)
>  {
>         u64 child_spte;
>         int child_level;
> @@ -243,7 +244,7 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
>                  * When splitting to a 4K page, mark the page executable as the
>                  * NX hugepage mitigation no longer applies.
>                  */
> -               if (is_nx_huge_page_enabled())
> +               if (is_nx_huge_page_enabled() && (access & ACC_EXEC_MASK))
>                         child_spte = make_spte_executable(child_spte);
>         }
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 73f12615416f..c7ccdd5c440d 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -415,7 +415,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>                u64 old_spte, bool prefetch, bool can_unsync,
>                bool host_writable, u64 *new_spte);
> -u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
> +u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index,
> +                             unsigned int access);
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 34c451f1eac9..02bfbc1bebbe 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1310,7 +1310,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>          * not been linked in yet and thus is not reachable from any other CPU.
>          */
>         for (i = 0; i < PT64_ENT_PER_PAGE; i++)
> -               sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i);
> +               sp->spt[i] = make_huge_page_split_spte(huge_spte, level, i, ACC_ALL);
>
>         /*
>          * Replace the huge spte with a pointer to the populated lower level
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 16/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-02-03  1:00 ` [PATCH 16/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU David Matlack
@ 2022-02-28 20:39   ` Ben Gardon
  2022-03-03 19:42     ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2022-02-28 20:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> in the rmap). This leads to correct behavior because KVM never creates
> intermediate huge pages during dirty logging. For example, a 1GiB page
> is never partially split to a 2MiB page.
>
> However this behavior will stop being correct once the shadow MMU
> participates in eager page splitting, which can in fact leave behind
> partially split huge pages. In preparation for that change, change the
> shadow MMU to iterate over all levels when zapping collapsible SPTEs.
>
> No functional change intended.
>

Reviewed-by: Ben Gardon <bgardon@google.com>

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2306a39526a..99ad7cc8683f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6038,18 +6038,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>         return need_tlb_flush;
>  }
>
> +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> +                                          const struct kvm_memory_slot *slot)
> +{
> +       bool flush;
> +
> +       flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> +                                 PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, true);

The max level here only needs to be 2M since 1G page wouldn't be
split. I think the upper limit can be lowered to
KVM_MAX_HUGEPAGE_LEVEL - 1.
Not a significant performance difference though.

> +
> +       if (flush)
> +               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +}
> +
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>                                    const struct kvm_memory_slot *slot)
>  {
>         if (kvm_memslots_have_rmaps(kvm)) {
>                 write_lock(&kvm->mmu_lock);
> -               /*
> -                * Zap only 4k SPTEs since the legacy MMU only supports dirty
> -                * logging at a 4k granularity and never creates collapsible
> -                * 2m SPTEs during dirty logging.
> -                */
> -               if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> -                       kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +               kvm_rmap_zap_collapsible_sptes(kvm, slot);
>                 write_unlock(&kvm->mmu_lock);
>         }
>
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 17/23] KVM: x86/mmu: Pass bool flush parameter to drop_large_spte()
  2022-02-03  1:00 ` [PATCH 17/23] KVM: x86/mmu: Pass bool flush parameter to drop_large_spte() David Matlack
@ 2022-02-28 20:47   ` Ben Gardon
  2022-03-03 19:52     ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2022-02-28 20:47 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
>
> drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
> Its helper function, __drop_large_spte(), does the drop without the
> flush. This difference is not obvious from the name.
>
> To make the code more readable, pass an explicit flush parameter. Also
> replace the vCPU pointer with a KVM pointer so we can get rid of the
> double-underscore helper function.
>
> This is also in preparation for a future commit that will conditionally
> flush after dropping a large SPTE.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 25 +++++++++++--------------
>  arch/x86/kvm/mmu/paging_tmpl.h |  4 ++--
>  2 files changed, 13 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 99ad7cc8683f..2d47a54e62a5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1162,23 +1162,20 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
>  }
>
>
> -static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
> +static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)

Since there are no callers of __drop_large_spte, I'd be inclined to
hold off on adding the flush parameter in this commit and just add it
when it's needed, or better yet after you add the new user with the
conditional flush so that there's a commit explaining why it's safe to
not always flush in that case.

>  {
> -       if (is_large_pte(*sptep)) {
> -               WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
> -               drop_spte(kvm, sptep);
> -               return true;
> -       }
> +       struct kvm_mmu_page *sp;
>
> -       return false;
> -}
> +       if (!is_large_pte(*sptep))
> +               return;
>
> -static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> -{
> -       if (__drop_large_spte(vcpu->kvm, sptep)) {
> -               struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +       sp = sptep_to_sp(sptep);
> +       WARN_ON(sp->role.level == PG_LEVEL_4K);
> +
> +       drop_spte(kvm, sptep);
>
> -               kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
> +       if (flush) {
> +               kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
>                         KVM_PAGES_PER_HPAGE(sp->role.level));
>         }
>  }
> @@ -3051,7 +3048,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                 if (it.level == fault->goal_level)
>                         break;
>
> -               drop_large_spte(vcpu, it.sptep);
> +               drop_large_spte(vcpu->kvm, it.sptep, true);
>                 if (is_shadow_present_pte(*it.sptep))
>                         continue;
>
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 703dfb062bf0..ba61de29f2e5 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -677,7 +677,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>                 gfn_t table_gfn;
>
>                 clear_sp_write_flooding_count(it.sptep);
> -               drop_large_spte(vcpu, it.sptep);
> +               drop_large_spte(vcpu->kvm, it.sptep, true);
>
>                 sp = NULL;
>                 if (!is_shadow_present_pte(*it.sptep)) {
> @@ -739,7 +739,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>
>                 validate_direct_spte(vcpu, it.sptep, direct_access);
>
> -               drop_large_spte(vcpu, it.sptep);
> +               drop_large_spte(vcpu->kvm, it.sptep, true);
>
>                 if (!is_shadow_present_pte(*it.sptep)) {
>                         sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 18/23] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-02-03  1:00 ` [PATCH 18/23] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU David Matlack
@ 2022-02-28 21:09   ` Ben Gardon
  2022-02-28 23:29     ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2022-02-28 21:09 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

 a

On Wed, Feb 2, 2022 at 5:03 PM David Matlack <dmatlack@google.com> wrote:
>
> Extend KVM's eager page splitting to also split huge pages that are
> mapped by the shadow MMU. Specifically, walk through the rmap splitting
> all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> pages.
>
> Splitting huge pages mapped by the shadow MMU requries dealing with some
> extra complexity beyond that of the TDP MMU:
>
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
>
> (2) Huge pages may be mapped by indirect shadow pages which have the
>     possibility of being unsync. As a policy we opt not to split such
>     pages as their translation may no longer be valid.
>
> (3) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.  This commit does *not*
>     handle such aliasing and opts not to split such huge pages.
>
> (4) When installing the lower level SPTEs, they must be added to the
>     rmap which may require allocating additional pte_list_desc structs.
>     This commit does *not* handle such cases and instead opts to leave
>     such lower-level SPTEs non-present. In this situation TLBs must be
>     flushed before dropping the MMU lock as a portion of the huge page
>     region is being unmapped.
>
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>   Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 -
>  arch/x86/kvm/mmu/mmu.c                        | 349 ++++++++++++++++++
>  2 files changed, 349 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 1b54e410e206..09d236cb15d6 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2351,9 +2351,6 @@
>                         the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>                         cleared.
>
> -                       Eager page splitting currently only supports splitting
> -                       huge pages mapped by the TDP MMU.
> -
>                         Default is Y (on).
>
>         kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2d47a54e62a5..825cfdec589b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -738,6 +738,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> +       static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> +
> +       if (WARN_ON_ONCE(!cache))
> +               return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> +
>         return kvm_mmu_memory_cache_alloc(cache);
>  }

Is this change needed in this commit? In the description it says we're
just skipping the split if a pte_list_desc needs to be allocated.

>
> @@ -754,6 +759,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>         return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
>
> +static gfn_t sptep_to_gfn(u64 *sptep)
> +{
> +       struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +       return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +}
> +
> +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> +{
> +       if (!sp->role.direct)
> +               return sp->shadowed_translation[index].access;
> +
> +       return sp->role.access;
> +}
> +
> +static unsigned int sptep_to_access(u64 *sptep)
> +{
> +       struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +       return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> +}
> +
>  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
>                                         gfn_t gfn, u32 access)
>  {
> @@ -923,6 +950,41 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>         return count;
>  }
>
> +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> +                                        const struct kvm_memory_slot *slot);
> +
> +static bool pte_list_need_new_desc(struct kvm_rmap_head *rmap_head)
> +{
> +       struct pte_list_desc *desc;
> +
> +       if (!rmap_head->val)
> +               return false;
> +
> +       if (!(rmap_head->val & 1))
> +               return true;
> +
> +       desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
> +       while (desc->spte_count == PTE_LIST_EXT) {
> +               if (!desc->more)
> +                       return true;
> +               desc = desc->more;
> +       }
> +
> +       return false;
> +}
> +
> +/*
> + * Return true if the rmap for the given gfn and level needs a new
> + * pte_list_desc struct allocated to add a new spte.
> + */
> +static bool rmap_need_new_pte_list_desc(const struct kvm_memory_slot *slot,
> +                                       gfn_t gfn, int level)
> +{
> +       struct kvm_rmap_head *rmap_head = gfn_to_rmap(gfn, level, slot);
> +
> +       return pte_list_need_new_desc(rmap_head);
> +}
> +
>  static void
>  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
>                            struct pte_list_desc *desc, int i,
> @@ -2129,6 +2191,24 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp_maybe_unsync(struct kvm *kvm
>         return sp;
>  }
>
> +static struct kvm_mmu_page *kvm_mmu_get_existing_direct_sp(struct kvm *kvm,
> +                                                          gfn_t gfn,
> +                                                          union kvm_mmu_page_role role)
> +{
> +       struct kvm_mmu_page *sp;
> +       LIST_HEAD(invalid_list);
> +
> +       BUG_ON(!role.direct);
> +
> +       sp = kvm_mmu_get_existing_sp_maybe_unsync(kvm, gfn, role, &invalid_list);
> +
> +       /* Direct SPs are never unsync. */
> +       WARN_ON_ONCE(sp && sp->unsync);
> +
> +       kvm_mmu_commit_zap_page(kvm, &invalid_list);

This should be unnecessary since the page can't be unsync right?
I'd be inclined to also add an assertion that invalid_list is empty
and then BUG or terminate the VM if it's not.

> +       return sp;
> +}
> +
>  /*
>   * Looks up an existing SP for the given gfn and role if one exists. The
>   * return SP is guaranteed to be synced.
> @@ -5955,12 +6035,275 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>
> +
> +static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
> +{
> +       if (*spp)
> +               return 0;
> +
> +       *spp = kvm_mmu_alloc_direct_sp_for_split(gfp);
> +
> +       return *spp ? 0 : -ENOMEM;
> +}

I assume this is preparation for a more complicated allocation scheme
in a future commit. I'd be inclined to wait on that until it's needed
as this looks unnecessarily complicated.

> +
> +static int prepare_to_split_huge_page(struct kvm *kvm,
> +                                     const struct kvm_memory_slot *slot,
> +                                     u64 *huge_sptep,
> +                                     struct kvm_mmu_page **spp,
> +                                     bool *flush,
> +                                     bool *dropped_lock)
> +{
> +       int r = 0;
> +
> +       *dropped_lock = false;
> +
> +       if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> +               return -ENOSPC;
> +
> +       if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +               goto drop_lock;
> +
> +       r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
> +       if (r)
> +               goto drop_lock;
> +
> +       return 0;
> +
> +drop_lock:
> +       if (*flush)
> +               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +       *flush = false;
> +       *dropped_lock = true;
> +
> +       write_unlock(&kvm->mmu_lock);
> +       cond_resched();
> +       r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);

You're using different sets of flags in these allocations. Is that
intentional? I understand the NOWAIT, but there's also a difference
between GFP_KERNEL_ACCOUNT and __GFP_ACCOUNT which I'm not sure about.

> +       write_lock(&kvm->mmu_lock);
> +
> +       return r;
> +}
> +
> +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> +                                                    const struct kvm_memory_slot *slot,
> +                                                    u64 *huge_sptep,
> +                                                    struct kvm_mmu_page **spp)
> +{
> +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> +       struct kvm_mmu_page *split_sp;
> +       union kvm_mmu_page_role role;
> +       unsigned int access;
> +       gfn_t gfn;
> +
> +       gfn = sptep_to_gfn(huge_sptep);
> +       access = sptep_to_access(huge_sptep);
> +
> +       /*
> +        * Huge page splitting always uses direct shadow pages since we are
> +        * directly mapping the huge page GFN region with smaller pages.
> +        */
> +       role = kvm_mmu_child_role(huge_sp, true, access);
> +       split_sp = kvm_mmu_get_existing_direct_sp(kvm, gfn, role);
> +
> +       /*
> +        * Opt not to split if the lower-level SP already exists. This requires
> +        * more complex handling as the SP may be already partially filled in
> +        * and may need extra pte_list_desc structs to update parent_ptes.
> +        */
> +       if (split_sp)
> +               return NULL;
> +
> +       swap(split_sp, *spp);
> +       kvm_mmu_init_sp(kvm, split_sp, slot, gfn, role);
> +       trace_kvm_mmu_get_page(split_sp, true);
> +
> +       return split_sp;
> +}
> +
> +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> +                                  const struct kvm_memory_slot *slot,
> +                                  u64 *huge_sptep, struct kvm_mmu_page **spp,
> +                                  bool *flush)
> +
> +{
> +       struct kvm_mmu_page *split_sp;
> +       u64 huge_spte, split_spte;
> +       int split_level, index;
> +       unsigned int access;
> +       u64 *split_sptep;
> +       gfn_t split_gfn;
> +
> +       split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> +       if (!split_sp)
> +               return -EOPNOTSUPP;
> +
> +       /*
> +        * Since we did not allocate pte_list_desc_structs for the split, we
> +        * cannot add a new parent SPTE to parent_ptes. This should never happen
> +        * in practice though since this is a fresh SP.
> +        *
> +        * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> +        */
> +       if (WARN_ON_ONCE(pte_list_need_new_desc(&split_sp->parent_ptes)))
> +               return -EINVAL;
> +
> +       huge_spte = READ_ONCE(*huge_sptep);
> +
> +       split_level = split_sp->role.level;
> +       access = split_sp->role.access;
> +
> +       for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +               split_sptep = &split_sp->spt[index];
> +               split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> +
> +               BUG_ON(is_shadow_present_pte(*split_sptep));
> +
> +               /*
> +                * Since we did not allocate pte_list_desc structs for the
> +                * split, we can't add a new SPTE that maps this GFN.
> +                * Skipping this SPTE means we're only partially mapping the
> +                * huge page, which means we'll need to flush TLBs before
> +                * dropping the MMU lock.
> +                *
> +                * Note, this make it safe to pass NULL to __rmap_add() below.
> +                */
> +               if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
> +                       *flush = true;
> +                       continue;
> +               }
> +
> +               split_spte = make_huge_page_split_spte(
> +                               huge_spte, split_level + 1, index, access);
> +
> +               mmu_spte_set(split_sptep, split_spte);
> +               __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
> +       }
> +
> +       /*
> +        * Replace the huge spte with a pointer to the populated lower level
> +        * page table. Since we are making this change without a TLB flush vCPUs
> +        * will see a mix of the split mappings and the original huge mapping,
> +        * depending on what's currently in their TLB. This is fine from a
> +        * correctness standpoint since the translation will be the same either
> +        * way.
> +        */
> +       drop_large_spte(kvm, huge_sptep, false);
> +       __link_shadow_page(NULL, huge_sptep, split_sp);
> +
> +       return 0;
> +}
> +
> +static bool should_split_huge_page(u64 *huge_sptep)
> +{
> +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> +
> +       if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
> +               return false;
> +
> +       if (huge_sp->role.invalid)
> +               return false;
> +
> +       /*
> +        * As a policy, do not split huge pages if SP on which they reside
> +        * is unsync. Unsync means the guest is modifying the page table being
> +        * shadowed by huge_sp, so splitting may be a waste of cycles and
> +        * memory.
> +        */
> +       if (huge_sp->unsync)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool rmap_try_split_huge_pages(struct kvm *kvm,
> +                                     struct kvm_rmap_head *rmap_head,
> +                                     const struct kvm_memory_slot *slot)
> +{
> +       struct kvm_mmu_page *sp = NULL;
> +       struct rmap_iterator iter;
> +       u64 *huge_sptep, spte;
> +       bool flush = false;
> +       bool dropped_lock;
> +       int level;
> +       gfn_t gfn;
> +       int r;
> +
> +restart:
> +       for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
> +               if (!should_split_huge_page(huge_sptep))
> +                       continue;
> +
> +               spte = *huge_sptep;
> +               level = sptep_to_sp(huge_sptep)->role.level;
> +               gfn = sptep_to_gfn(huge_sptep);
> +
> +               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
> +               if (r) {
> +                       trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> +                       break;
> +               }
> +
> +               if (dropped_lock)
> +                       goto restart;
> +
> +               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
> +
> +               trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> +
> +               /*
> +                * If splitting is successful we must restart the iterator
> +                * because huge_sptep has just been removed from it.
> +                */
> +               if (!r)
> +                       goto restart;
> +       }
> +
> +       if (sp)
> +               kvm_mmu_free_sp(sp);
> +
> +       return flush;
> +}
> +
> +static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> +                                         const struct kvm_memory_slot *slot,
> +                                         gfn_t start, gfn_t end,
> +                                         int target_level)
> +{
> +       bool flush;
> +       int level;
> +
> +       /*
> +        * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> +        * down to the target level. This ensures pages are recursively split
> +        * all the way to the target level. There's no need to split pages
> +        * already at the target level.
> +        *
> +        * Note that TLB flushes must be done before dropping the MMU lock since
> +        * rmap_try_split_huge_pages() may partially split any given huge page,
> +        * i.e. it may effectively unmap (make non-present) a portion of the
> +        * huge page.
> +        */
> +       for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> +               flush = slot_handle_level_range(kvm, slot,
> +                                               rmap_try_split_huge_pages,
> +                                               level, level, start, end - 1,
> +                                               true, flush);
> +       }
> +
> +       if (flush)
> +               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +}
> +
>  /* Must be called with the mmu_lock held in write-mode. */
>  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
>                                    const struct kvm_memory_slot *memslot,
>                                    u64 start, u64 end,
>                                    int target_level)
>  {
> +       if (kvm_memslots_have_rmaps(kvm))
> +               kvm_rmap_try_split_huge_pages(kvm, memslot, start, end,
> +                                             target_level);
> +
>         if (is_tdp_mmu_enabled(kvm))
>                 kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
>                                                  target_level, false);
> @@ -5978,6 +6321,12 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
>         u64 start = memslot->base_gfn;
>         u64 end = start + memslot->npages;
>
> +       if (kvm_memslots_have_rmaps(kvm)) {
> +               write_lock(&kvm->mmu_lock);
> +               kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
> +               write_unlock(&kvm->mmu_lock);
> +       }
> +
>         if (is_tdp_mmu_enabled(kvm)) {
>                 read_lock(&kvm->mmu_lock);
>                 kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 20/23] KVM: Allow GFP flags to be passed when topping up MMU caches
  2022-02-03  1:00 ` [PATCH 20/23] KVM: Allow GFP flags to be passed when topping up MMU caches David Matlack
@ 2022-02-28 21:12   ` Ben Gardon
  0 siblings, 0 replies; 65+ messages in thread
From: Ben Gardon @ 2022-02-28 21:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:03 PM David Matlack <dmatlack@google.com> wrote:
>
> This will be used in a subsequent commit to top-up MMU caches under the
> MMU lock with GFP_NOWAIT as part of eager page splitting.
>
> No functional change intended.
>

Reviewed-by: Ben Gardon <bgardon@google.com>

> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  include/linux/kvm_host.h | 1 +
>  virt/kvm/kvm_main.c      | 9 +++++++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b3810976a27f..128f4c5a8122 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1329,6 +1329,7 @@ void kvm_reload_remote_mmus(struct kvm *kvm);
>
>  #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp);
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index afa4bdb6481e..c39e7ba21fab 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -371,7 +371,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>                 return (void *)__get_free_page(gfp_flags);
>  }
>
> -int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min, gfp_t gfp)
>  {
>         int capacity;
>         void *obj;
> @@ -384,7 +384,7 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>         if (mc->nobjs >= min)
>                 return 0;
>         while (mc->nobjs < capacity) {
> -               obj = mmu_memory_cache_alloc_obj(mc, GFP_KERNEL_ACCOUNT);
> +               obj = mmu_memory_cache_alloc_obj(mc, gfp);
>                 if (!obj)
>                         return mc->nobjs >= min ? 0 : -ENOMEM;
>                 mc->objects[mc->nobjs++] = obj;
> @@ -392,6 +392,11 @@ int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
>         return 0;
>  }
>
> +int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min)
> +{
> +       return __kvm_mmu_topup_memory_cache(mc, min, GFP_KERNEL_ACCOUNT);
> +}
> +
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
>  {
>         return mc->nobjs;
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs
  2022-02-03  1:00 ` [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs David Matlack
@ 2022-02-28 21:22   ` Ben Gardon
  2022-02-28 23:41     ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2022-02-28 21:22 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Wed, Feb 2, 2022 at 5:03 PM David Matlack <dmatlack@google.com> wrote:
>
> When splitting a huge page we need to add all of the lower level SPTEs
> to the memslot rmap. The current implementation of eager page splitting
> bails if adding an SPTE would require allocating an extra pte_list_desc
> struct. Fix this limitation by allocating enough pte_list_desc structs
> before splitting the huge page.
>
> This eliminates the need for TLB flushing under the MMU lock because the
> huge page is always entirely split (no subregion of the huge page is
> unmapped).
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  10 ++++
>  arch/x86/kvm/mmu/mmu.c          | 101 ++++++++++++++++++--------------
>  2 files changed, 67 insertions(+), 44 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d0b12bfe5818..a0f7578f7a26 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1232,6 +1232,16 @@ struct kvm_arch {
>         hpa_t   hv_root_tdp;
>         spinlock_t hv_root_tdp_lock;
>  #endif
> +
> +       /*
> +        * Memory cache used to allocate pte_list_desc structs while splitting
> +        * huge pages. In the worst case, to split one huge page we need 512
> +        * pte_list_desc structs to add each new lower level leaf sptep to the
> +        * memslot rmap.
> +        */
> +#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
> +       __DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
> +                                     HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
>  };
>
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 825cfdec589b..c7981a934237 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5905,6 +5905,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>         node->track_write = kvm_mmu_pte_write;
>         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>         kvm_page_track_register_notifier(kvm, node);
> +
> +       kvm->arch.huge_page_split_desc_cache.capacity =
> +               HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
> +       kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
> +       kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
>  }
>
>  void kvm_mmu_uninit_vm(struct kvm *kvm)
> @@ -6035,9 +6040,42 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>
> +static int min_descs_for_split(const struct kvm_memory_slot *slot, u64 *huge_sptep)
> +{
> +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> +       int split_level = huge_sp->role.level - 1;
> +       int i, min = 0;
> +       gfn_t gfn;
> +
> +       gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
>
> -static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
> +       for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> +               if (rmap_need_new_pte_list_desc(slot, gfn, split_level))
> +                       min++;
> +
> +               gfn += KVM_PAGES_PER_HPAGE(split_level);
> +       }
> +
> +       return min;
> +}

Is this calculation worth doing? It seems like we're doing a lot of
work here to calculate exactly how many pages we need to allocate, but
if eager splitting we'll be doing this over and over again. It seems
like it would be more efficient to just always fill the cache since
any extra pages allocated to split one page can be used to split the
next one.

> +
> +static int topup_huge_page_split_desc_cache(struct kvm *kvm, int min, gfp_t gfp)
> +{
> +       struct kvm_mmu_memory_cache *cache =
> +               &kvm->arch.huge_page_split_desc_cache;
> +
> +       return __kvm_mmu_topup_memory_cache(cache, min, gfp);
> +}
> +
> +static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
> +                                 int min_descs, gfp_t gfp)
>  {
> +       int r;
> +
> +       r = topup_huge_page_split_desc_cache(kvm, min_descs, gfp);
> +       if (r)
> +               return r;
> +
>         if (*spp)
>                 return 0;
>
> @@ -6050,9 +6088,9 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
>                                       const struct kvm_memory_slot *slot,
>                                       u64 *huge_sptep,
>                                       struct kvm_mmu_page **spp,
> -                                     bool *flush,
>                                       bool *dropped_lock)
>  {
> +       int min_descs = min_descs_for_split(slot, huge_sptep);
>         int r = 0;
>
>         *dropped_lock = false;
> @@ -6063,22 +6101,18 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
>         if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
>                 goto drop_lock;
>
> -       r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
> +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_NOWAIT | __GFP_ACCOUNT);
>         if (r)
>                 goto drop_lock;
>
>         return 0;
>
>  drop_lock:
> -       if (*flush)
> -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> -
> -       *flush = false;
>         *dropped_lock = true;
>
>         write_unlock(&kvm->mmu_lock);
>         cond_resched();
> -       r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);
> +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_KERNEL_ACCOUNT);
>         write_lock(&kvm->mmu_lock);
>
>         return r;
> @@ -6122,10 +6156,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
>
>  static int kvm_mmu_split_huge_page(struct kvm *kvm,
>                                    const struct kvm_memory_slot *slot,
> -                                  u64 *huge_sptep, struct kvm_mmu_page **spp,
> -                                  bool *flush)
> +                                  u64 *huge_sptep, struct kvm_mmu_page **spp)
>
>  {
> +       struct kvm_mmu_memory_cache *cache;
>         struct kvm_mmu_page *split_sp;
>         u64 huge_spte, split_spte;
>         int split_level, index;
> @@ -6138,9 +6172,9 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
>                 return -EOPNOTSUPP;
>
>         /*
> -        * Since we did not allocate pte_list_desc_structs for the split, we
> -        * cannot add a new parent SPTE to parent_ptes. This should never happen
> -        * in practice though since this is a fresh SP.
> +        * We did not allocate an extra pte_list_desc struct to add huge_sptep
> +        * to split_sp->parent_ptes. An extra pte_list_desc struct should never
> +        * be necessary in practice though since split_sp is brand new.
>          *
>          * Note, this makes it safe to pass NULL to __link_shadow_page() below.
>          */
> @@ -6151,6 +6185,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
>
>         split_level = split_sp->role.level;
>         access = split_sp->role.access;
> +       cache = &kvm->arch.huge_page_split_desc_cache;
>
>         for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
>                 split_sptep = &split_sp->spt[index];
> @@ -6158,25 +6193,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
>
>                 BUG_ON(is_shadow_present_pte(*split_sptep));
>
> -               /*
> -                * Since we did not allocate pte_list_desc structs for the
> -                * split, we can't add a new SPTE that maps this GFN.
> -                * Skipping this SPTE means we're only partially mapping the
> -                * huge page, which means we'll need to flush TLBs before
> -                * dropping the MMU lock.
> -                *
> -                * Note, this make it safe to pass NULL to __rmap_add() below.
> -                */
> -               if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
> -                       *flush = true;
> -                       continue;
> -               }
> -
>                 split_spte = make_huge_page_split_spte(
>                                 huge_spte, split_level + 1, index, access);
>
>                 mmu_spte_set(split_sptep, split_spte);
> -               __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
> +               __rmap_add(kvm, cache, slot, split_sptep, split_gfn, access);
>         }
>
>         /*
> @@ -6222,7 +6243,6 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
>         struct kvm_mmu_page *sp = NULL;
>         struct rmap_iterator iter;
>         u64 *huge_sptep, spte;
> -       bool flush = false;
>         bool dropped_lock;
>         int level;
>         gfn_t gfn;
> @@ -6237,7 +6257,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
>                 level = sptep_to_sp(huge_sptep)->role.level;
>                 gfn = sptep_to_gfn(huge_sptep);
>
> -               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
> +               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &dropped_lock);
>                 if (r) {
>                         trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
>                         break;
> @@ -6246,7 +6266,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
>                 if (dropped_lock)
>                         goto restart;
>
> -               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
> +               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
>
>                 trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
>
> @@ -6261,7 +6281,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
>         if (sp)
>                 kvm_mmu_free_sp(sp);
>
> -       return flush;
> +       return false;
>  }
>
>  static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> @@ -6269,7 +6289,6 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
>                                           gfn_t start, gfn_t end,
>                                           int target_level)
>  {
> -       bool flush;
>         int level;
>
>         /*
> @@ -6277,21 +6296,15 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
>          * down to the target level. This ensures pages are recursively split
>          * all the way to the target level. There's no need to split pages
>          * already at the target level.
> -        *
> -        * Note that TLB flushes must be done before dropping the MMU lock since
> -        * rmap_try_split_huge_pages() may partially split any given huge page,
> -        * i.e. it may effectively unmap (make non-present) a portion of the
> -        * huge page.
>          */
>         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> -               flush = slot_handle_level_range(kvm, slot,
> -                                               rmap_try_split_huge_pages,
> -                                               level, level, start, end - 1,
> -                                               true, flush);
> +               slot_handle_level_range(kvm, slot,
> +                                       rmap_try_split_huge_pages,
> +                                       level, level, start, end - 1,
> +                                       true, false);
>         }
>
> -       if (flush)
> -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +       kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
>  }
>
>  /* Must be called with the mmu_lock held in write-mode. */
> --
> 2.35.0.rc2.247.g8bbb082509-goog
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 18/23] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  2022-02-28 21:09   ` Ben Gardon
@ 2022-02-28 23:29     ` David Matlack
  0 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-02-28 23:29 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Mon, Feb 28, 2022 at 1:09 PM Ben Gardon <bgardon@google.com> wrote:
>
>  a
>
> On Wed, Feb 2, 2022 at 5:03 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Extend KVM's eager page splitting to also split huge pages that are
> > mapped by the shadow MMU. Specifically, walk through the rmap splitting
> > all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> > pages.
> >
> > Splitting huge pages mapped by the shadow MMU requries dealing with some
> > extra complexity beyond that of the TDP MMU:
> >
> > (1) The shadow MMU has a limit on the number of shadow pages that are
> >     allowed to be allocated. So, as a policy, Eager Page Splitting
> >     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
> >     pages available.
> >
> > (2) Huge pages may be mapped by indirect shadow pages which have the
> >     possibility of being unsync. As a policy we opt not to split such
> >     pages as their translation may no longer be valid.
> >
> > (3) Splitting a huge page may end up re-using an existing lower level
> >     shadow page tables. This is unlike the TDP MMU which always allocates
> >     new shadow page tables when splitting.  This commit does *not*
> >     handle such aliasing and opts not to split such huge pages.
> >
> > (4) When installing the lower level SPTEs, they must be added to the
> >     rmap which may require allocating additional pte_list_desc structs.
> >     This commit does *not* handle such cases and instead opts to leave
> >     such lower-level SPTEs non-present. In this situation TLBs must be
> >     flushed before dropping the MMU lock as a portion of the huge page
> >     region is being unmapped.
> >
> > Suggested-by: Peter Feiner <pfeiner@google.com>
> > [ This commit is based off of the original implementation of Eager Page
> >   Splitting from Peter in Google's kernel from 2016. ]
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   3 -
> >  arch/x86/kvm/mmu/mmu.c                        | 349 ++++++++++++++++++
> >  2 files changed, 349 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 1b54e410e206..09d236cb15d6 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2351,9 +2351,6 @@
> >                         the KVM_CLEAR_DIRTY ioctl, and only for the pages being
> >                         cleared.
> >
> > -                       Eager page splitting currently only supports splitting
> > -                       huge pages mapped by the TDP MMU.
> > -
> >                         Default is Y (on).
> >
> >         kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2d47a54e62a5..825cfdec589b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -738,6 +738,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >
> >  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
> >  {
> > +       static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> > +
> > +       if (WARN_ON_ONCE(!cache))
> > +               return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> > +
> >         return kvm_mmu_memory_cache_alloc(cache);
> >  }
>
> Is this change needed in this commit? In the description it says we're
> just skipping the split if a pte_list_desc needs to be allocated.

I made this change out of an abundance of caution since this commit
passes NULL to __rmap_add() and __link_shadow_page(). But yes, you are
right, this code should never be hit in practice (hence the WARN_ON).

>
> >
> > @@ -754,6 +759,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> >         return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
> >  }
> >
> > +static gfn_t sptep_to_gfn(u64 *sptep)
> > +{
> > +       struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +       return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> > +}
> > +
> > +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> > +{
> > +       if (!sp->role.direct)
> > +               return sp->shadowed_translation[index].access;
> > +
> > +       return sp->role.access;
> > +}
> > +
> > +static unsigned int sptep_to_access(u64 *sptep)
> > +{
> > +       struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +
> > +       return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> > +}
> > +
> >  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
> >                                         gfn_t gfn, u32 access)
> >  {
> > @@ -923,6 +950,41 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
> >         return count;
> >  }
> >
> > +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> > +                                        const struct kvm_memory_slot *slot);
> > +
> > +static bool pte_list_need_new_desc(struct kvm_rmap_head *rmap_head)
> > +{
> > +       struct pte_list_desc *desc;
> > +
> > +       if (!rmap_head->val)
> > +               return false;
> > +
> > +       if (!(rmap_head->val & 1))
> > +               return true;
> > +
> > +       desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
> > +       while (desc->spte_count == PTE_LIST_EXT) {
> > +               if (!desc->more)
> > +                       return true;
> > +               desc = desc->more;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> > +/*
> > + * Return true if the rmap for the given gfn and level needs a new
> > + * pte_list_desc struct allocated to add a new spte.
> > + */
> > +static bool rmap_need_new_pte_list_desc(const struct kvm_memory_slot *slot,
> > +                                       gfn_t gfn, int level)
> > +{
> > +       struct kvm_rmap_head *rmap_head = gfn_to_rmap(gfn, level, slot);
> > +
> > +       return pte_list_need_new_desc(rmap_head);
> > +}
> > +
> >  static void
> >  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
> >                            struct pte_list_desc *desc, int i,
> > @@ -2129,6 +2191,24 @@ static struct kvm_mmu_page *kvm_mmu_get_existing_sp_maybe_unsync(struct kvm *kvm
> >         return sp;
> >  }
> >
> > +static struct kvm_mmu_page *kvm_mmu_get_existing_direct_sp(struct kvm *kvm,
> > +                                                          gfn_t gfn,
> > +                                                          union kvm_mmu_page_role role)
> > +{
> > +       struct kvm_mmu_page *sp;
> > +       LIST_HEAD(invalid_list);
> > +
> > +       BUG_ON(!role.direct);
> > +
> > +       sp = kvm_mmu_get_existing_sp_maybe_unsync(kvm, gfn, role, &invalid_list);
> > +
> > +       /* Direct SPs are never unsync. */
> > +       WARN_ON_ONCE(sp && sp->unsync);
> > +
> > +       kvm_mmu_commit_zap_page(kvm, &invalid_list);
>
> This should be unnecessary since the page can't be unsync right?
> I'd be inclined to also add an assertion that invalid_list is empty
> and then BUG or terminate the VM if it's not.

You might be right in practice but the code in kvm_mmu_get_page() (aka
kvm_mmu_get_existing_sp() in this series) does not read that way.
Specifically, KVM zaps unsync SPs that match the same GFN, even if the
target SP is not unsync.

>
> > +       return sp;
> > +}
> > +
> >  /*
> >   * Looks up an existing SP for the given gfn and role if one exists. The
> >   * return SP is guaranteed to be synced.
> > @@ -5955,12 +6035,275 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +
> > +static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
> > +{
> > +       if (*spp)
> > +               return 0;
> > +
> > +       *spp = kvm_mmu_alloc_direct_sp_for_split(gfp);
> > +
> > +       return *spp ? 0 : -ENOMEM;
> > +}
>
> I assume this is preparation for a more complicated allocation scheme
> in a future commit. I'd be inclined to wait on that until it's needed
> as this looks unnecessarily complicated.

Ack.

>
> > +
> > +static int prepare_to_split_huge_page(struct kvm *kvm,
> > +                                     const struct kvm_memory_slot *slot,
> > +                                     u64 *huge_sptep,
> > +                                     struct kvm_mmu_page **spp,
> > +                                     bool *flush,
> > +                                     bool *dropped_lock)
> > +{
> > +       int r = 0;
> > +
> > +       *dropped_lock = false;
> > +
> > +       if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> > +               return -ENOSPC;
> > +
> > +       if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +               goto drop_lock;
> > +
> > +       r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
> > +       if (r)
> > +               goto drop_lock;
> > +
> > +       return 0;
> > +
> > +drop_lock:
> > +       if (*flush)
> > +               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +       *flush = false;
> > +       *dropped_lock = true;
> > +
> > +       write_unlock(&kvm->mmu_lock);
> > +       cond_resched();
> > +       r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);
>
> You're using different sets of flags in these allocations. Is that
> intentional? I understand the NOWAIT, but there's also a difference
> between GFP_KERNEL_ACCOUNT and __GFP_ACCOUNT which I'm not sure about.

Yes this is intentional. GFP_KERNEL_ACCOUNT is just a convenience
macro for GFP_KERNEL | __GFP_ACCOUNT.

We want allocations to be charged the same way, hence we always use
__GFP_ACCOUNT. But when allocating under the lock we don't want to
block on filesystem callbacks and reclaim, hence GFP_NOWAIT in place
of GFP_KERNEL.

>
> > +       write_lock(&kvm->mmu_lock);
> > +
> > +       return r;
> > +}
> > +
> > +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> > +                                                    const struct kvm_memory_slot *slot,
> > +                                                    u64 *huge_sptep,
> > +                                                    struct kvm_mmu_page **spp)
> > +{
> > +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> > +       struct kvm_mmu_page *split_sp;
> > +       union kvm_mmu_page_role role;
> > +       unsigned int access;
> > +       gfn_t gfn;
> > +
> > +       gfn = sptep_to_gfn(huge_sptep);
> > +       access = sptep_to_access(huge_sptep);
> > +
> > +       /*
> > +        * Huge page splitting always uses direct shadow pages since we are
> > +        * directly mapping the huge page GFN region with smaller pages.
> > +        */
> > +       role = kvm_mmu_child_role(huge_sp, true, access);
> > +       split_sp = kvm_mmu_get_existing_direct_sp(kvm, gfn, role);
> > +
> > +       /*
> > +        * Opt not to split if the lower-level SP already exists. This requires
> > +        * more complex handling as the SP may be already partially filled in
> > +        * and may need extra pte_list_desc structs to update parent_ptes.
> > +        */
> > +       if (split_sp)
> > +               return NULL;
> > +
> > +       swap(split_sp, *spp);
> > +       kvm_mmu_init_sp(kvm, split_sp, slot, gfn, role);
> > +       trace_kvm_mmu_get_page(split_sp, true);
> > +
> > +       return split_sp;
> > +}
> > +
> > +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > +                                  const struct kvm_memory_slot *slot,
> > +                                  u64 *huge_sptep, struct kvm_mmu_page **spp,
> > +                                  bool *flush)
> > +
> > +{
> > +       struct kvm_mmu_page *split_sp;
> > +       u64 huge_spte, split_spte;
> > +       int split_level, index;
> > +       unsigned int access;
> > +       u64 *split_sptep;
> > +       gfn_t split_gfn;
> > +
> > +       split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> > +       if (!split_sp)
> > +               return -EOPNOTSUPP;
> > +
> > +       /*
> > +        * Since we did not allocate pte_list_desc_structs for the split, we
> > +        * cannot add a new parent SPTE to parent_ptes. This should never happen
> > +        * in practice though since this is a fresh SP.
> > +        *
> > +        * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> > +        */
> > +       if (WARN_ON_ONCE(pte_list_need_new_desc(&split_sp->parent_ptes)))
> > +               return -EINVAL;
> > +
> > +       huge_spte = READ_ONCE(*huge_sptep);
> > +
> > +       split_level = split_sp->role.level;
> > +       access = split_sp->role.access;
> > +
> > +       for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > +               split_sptep = &split_sp->spt[index];
> > +               split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> > +
> > +               BUG_ON(is_shadow_present_pte(*split_sptep));
> > +
> > +               /*
> > +                * Since we did not allocate pte_list_desc structs for the
> > +                * split, we can't add a new SPTE that maps this GFN.
> > +                * Skipping this SPTE means we're only partially mapping the
> > +                * huge page, which means we'll need to flush TLBs before
> > +                * dropping the MMU lock.
> > +                *
> > +                * Note, this make it safe to pass NULL to __rmap_add() below.
> > +                */
> > +               if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
> > +                       *flush = true;
> > +                       continue;
> > +               }
> > +
> > +               split_spte = make_huge_page_split_spte(
> > +                               huge_spte, split_level + 1, index, access);
> > +
> > +               mmu_spte_set(split_sptep, split_spte);
> > +               __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
> > +       }
> > +
> > +       /*
> > +        * Replace the huge spte with a pointer to the populated lower level
> > +        * page table. Since we are making this change without a TLB flush vCPUs
> > +        * will see a mix of the split mappings and the original huge mapping,
> > +        * depending on what's currently in their TLB. This is fine from a
> > +        * correctness standpoint since the translation will be the same either
> > +        * way.
> > +        */
> > +       drop_large_spte(kvm, huge_sptep, false);
> > +       __link_shadow_page(NULL, huge_sptep, split_sp);
> > +
> > +       return 0;
> > +}
> > +
> > +static bool should_split_huge_page(u64 *huge_sptep)
> > +{
> > +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> > +
> > +       if (WARN_ON_ONCE(!is_large_pte(*huge_sptep)))
> > +               return false;
> > +
> > +       if (huge_sp->role.invalid)
> > +               return false;
> > +
> > +       /*
> > +        * As a policy, do not split huge pages if SP on which they reside
> > +        * is unsync. Unsync means the guest is modifying the page table being
> > +        * shadowed by huge_sp, so splitting may be a waste of cycles and
> > +        * memory.
> > +        */
> > +       if (huge_sp->unsync)
> > +               return false;
> > +
> > +       return true;
> > +}
> > +
> > +static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > +                                     struct kvm_rmap_head *rmap_head,
> > +                                     const struct kvm_memory_slot *slot)
> > +{
> > +       struct kvm_mmu_page *sp = NULL;
> > +       struct rmap_iterator iter;
> > +       u64 *huge_sptep, spte;
> > +       bool flush = false;
> > +       bool dropped_lock;
> > +       int level;
> > +       gfn_t gfn;
> > +       int r;
> > +
> > +restart:
> > +       for_each_rmap_spte(rmap_head, &iter, huge_sptep) {
> > +               if (!should_split_huge_page(huge_sptep))
> > +                       continue;
> > +
> > +               spte = *huge_sptep;
> > +               level = sptep_to_sp(huge_sptep)->role.level;
> > +               gfn = sptep_to_gfn(huge_sptep);
> > +
> > +               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
> > +               if (r) {
> > +                       trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> > +                       break;
> > +               }
> > +
> > +               if (dropped_lock)
> > +                       goto restart;
> > +
> > +               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
> > +
> > +               trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> > +
> > +               /*
> > +                * If splitting is successful we must restart the iterator
> > +                * because huge_sptep has just been removed from it.
> > +                */
> > +               if (!r)
> > +                       goto restart;
> > +       }
> > +
> > +       if (sp)
> > +               kvm_mmu_free_sp(sp);
> > +
> > +       return flush;
> > +}
> > +
> > +static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > +                                         const struct kvm_memory_slot *slot,
> > +                                         gfn_t start, gfn_t end,
> > +                                         int target_level)
> > +{
> > +       bool flush;
> > +       int level;
> > +
> > +       /*
> > +        * Split huge pages starting with KVM_MAX_HUGEPAGE_LEVEL and working
> > +        * down to the target level. This ensures pages are recursively split
> > +        * all the way to the target level. There's no need to split pages
> > +        * already at the target level.
> > +        *
> > +        * Note that TLB flushes must be done before dropping the MMU lock since
> > +        * rmap_try_split_huge_pages() may partially split any given huge page,
> > +        * i.e. it may effectively unmap (make non-present) a portion of the
> > +        * huge page.
> > +        */
> > +       for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> > +               flush = slot_handle_level_range(kvm, slot,
> > +                                               rmap_try_split_huge_pages,
> > +                                               level, level, start, end - 1,
> > +                                               true, flush);
> > +       }
> > +
> > +       if (flush)
> > +               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +}
> > +
> >  /* Must be called with the mmu_lock held in write-mode. */
> >  void kvm_mmu_try_split_huge_pages(struct kvm *kvm,
> >                                    const struct kvm_memory_slot *memslot,
> >                                    u64 start, u64 end,
> >                                    int target_level)
> >  {
> > +       if (kvm_memslots_have_rmaps(kvm))
> > +               kvm_rmap_try_split_huge_pages(kvm, memslot, start, end,
> > +                                             target_level);
> > +
> >         if (is_tdp_mmu_enabled(kvm))
> >                 kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end,
> >                                                  target_level, false);
> > @@ -5978,6 +6321,12 @@ void kvm_mmu_slot_try_split_huge_pages(struct kvm *kvm,
> >         u64 start = memslot->base_gfn;
> >         u64 end = start + memslot->npages;
> >
> > +       if (kvm_memslots_have_rmaps(kvm)) {
> > +               write_lock(&kvm->mmu_lock);
> > +               kvm_rmap_try_split_huge_pages(kvm, memslot, start, end, target_level);
> > +               write_unlock(&kvm->mmu_lock);
> > +       }
> > +
> >         if (is_tdp_mmu_enabled(kvm)) {
> >                 read_lock(&kvm->mmu_lock);
> >                 kvm_tdp_mmu_try_split_huge_pages(kvm, memslot, start, end, target_level, true);
> > --
> > 2.35.0.rc2.247.g8bbb082509-goog
> >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs
  2022-02-28 21:22   ` Ben Gardon
@ 2022-02-28 23:41     ` David Matlack
  2022-03-01  0:37       ` Ben Gardon
  0 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-02-28 23:41 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Mon, Feb 28, 2022 at 1:22 PM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Feb 2, 2022 at 5:03 PM David Matlack <dmatlack@google.com> wrote:
> >
> > When splitting a huge page we need to add all of the lower level SPTEs
> > to the memslot rmap. The current implementation of eager page splitting
> > bails if adding an SPTE would require allocating an extra pte_list_desc
> > struct. Fix this limitation by allocating enough pte_list_desc structs
> > before splitting the huge page.
> >
> > This eliminates the need for TLB flushing under the MMU lock because the
> > huge page is always entirely split (no subregion of the huge page is
> > unmapped).
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  10 ++++
> >  arch/x86/kvm/mmu/mmu.c          | 101 ++++++++++++++++++--------------
> >  2 files changed, 67 insertions(+), 44 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index d0b12bfe5818..a0f7578f7a26 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1232,6 +1232,16 @@ struct kvm_arch {
> >         hpa_t   hv_root_tdp;
> >         spinlock_t hv_root_tdp_lock;
> >  #endif
> > +
> > +       /*
> > +        * Memory cache used to allocate pte_list_desc structs while splitting
> > +        * huge pages. In the worst case, to split one huge page we need 512
> > +        * pte_list_desc structs to add each new lower level leaf sptep to the
> > +        * memslot rmap.
> > +        */
> > +#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
> > +       __DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
> > +                                     HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
> >  };
> >
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 825cfdec589b..c7981a934237 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5905,6 +5905,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
> >         node->track_write = kvm_mmu_pte_write;
> >         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> >         kvm_page_track_register_notifier(kvm, node);
> > +
> > +       kvm->arch.huge_page_split_desc_cache.capacity =
> > +               HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
> > +       kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
> > +       kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
> >  }
> >
> >  void kvm_mmu_uninit_vm(struct kvm *kvm)
> > @@ -6035,9 +6040,42 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> >  }
> >
> > +static int min_descs_for_split(const struct kvm_memory_slot *slot, u64 *huge_sptep)
> > +{
> > +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> > +       int split_level = huge_sp->role.level - 1;
> > +       int i, min = 0;
> > +       gfn_t gfn;
> > +
> > +       gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> >
> > -static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
> > +       for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> > +               if (rmap_need_new_pte_list_desc(slot, gfn, split_level))
> > +                       min++;
> > +
> > +               gfn += KVM_PAGES_PER_HPAGE(split_level);
> > +       }
> > +
> > +       return min;
> > +}
>
> Is this calculation worth doing? It seems like we're doing a lot of
> work here to calculate exactly how many pages we need to allocate, but
> if eager splitting we'll be doing this over and over again. It seems
> like it would be more efficient to just always fill the cache since
> any extra pages allocated to split one page can be used to split the
> next one.

topup_huge_page_split_desc_cache() does fill the cache. This
calculation is just to determine the minimum number of objects needed
to split the next huge page, so that we can skip refilling the cache
when its unnecessary.

I think you are suggesting we unconditionally topup the cache and
hard-code the min to 513 (the capacity of the cache)? That would
certainly allow us to drop this function (less code complexity) but
would result in extra unnecessary allocations. If the cost of those
allocations is negligible then I can see an argument for going with
your approach.

>
> > +
> > +static int topup_huge_page_split_desc_cache(struct kvm *kvm, int min, gfp_t gfp)
> > +{
> > +       struct kvm_mmu_memory_cache *cache =
> > +               &kvm->arch.huge_page_split_desc_cache;
> > +
> > +       return __kvm_mmu_topup_memory_cache(cache, min, gfp);
> > +}
> > +
> > +static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
> > +                                 int min_descs, gfp_t gfp)
> >  {
> > +       int r;
> > +
> > +       r = topup_huge_page_split_desc_cache(kvm, min_descs, gfp);
> > +       if (r)
> > +               return r;
> > +
> >         if (*spp)
> >                 return 0;
> >
> > @@ -6050,9 +6088,9 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
> >                                       const struct kvm_memory_slot *slot,
> >                                       u64 *huge_sptep,
> >                                       struct kvm_mmu_page **spp,
> > -                                     bool *flush,
> >                                       bool *dropped_lock)
> >  {
> > +       int min_descs = min_descs_for_split(slot, huge_sptep);
> >         int r = 0;
> >
> >         *dropped_lock = false;
> > @@ -6063,22 +6101,18 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
> >         if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> >                 goto drop_lock;
> >
> > -       r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
> > +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_NOWAIT | __GFP_ACCOUNT);
> >         if (r)
> >                 goto drop_lock;
> >
> >         return 0;
> >
> >  drop_lock:
> > -       if (*flush)
> > -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > -
> > -       *flush = false;
> >         *dropped_lock = true;
> >
> >         write_unlock(&kvm->mmu_lock);
> >         cond_resched();
> > -       r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);
> > +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_KERNEL_ACCOUNT);
> >         write_lock(&kvm->mmu_lock);
> >
> >         return r;
> > @@ -6122,10 +6156,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> >
> >  static int kvm_mmu_split_huge_page(struct kvm *kvm,
> >                                    const struct kvm_memory_slot *slot,
> > -                                  u64 *huge_sptep, struct kvm_mmu_page **spp,
> > -                                  bool *flush)
> > +                                  u64 *huge_sptep, struct kvm_mmu_page **spp)
> >
> >  {
> > +       struct kvm_mmu_memory_cache *cache;
> >         struct kvm_mmu_page *split_sp;
> >         u64 huge_spte, split_spte;
> >         int split_level, index;
> > @@ -6138,9 +6172,9 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> >                 return -EOPNOTSUPP;
> >
> >         /*
> > -        * Since we did not allocate pte_list_desc_structs for the split, we
> > -        * cannot add a new parent SPTE to parent_ptes. This should never happen
> > -        * in practice though since this is a fresh SP.
> > +        * We did not allocate an extra pte_list_desc struct to add huge_sptep
> > +        * to split_sp->parent_ptes. An extra pte_list_desc struct should never
> > +        * be necessary in practice though since split_sp is brand new.
> >          *
> >          * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> >          */
> > @@ -6151,6 +6185,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> >
> >         split_level = split_sp->role.level;
> >         access = split_sp->role.access;
> > +       cache = &kvm->arch.huge_page_split_desc_cache;
> >
> >         for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> >                 split_sptep = &split_sp->spt[index];
> > @@ -6158,25 +6193,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> >
> >                 BUG_ON(is_shadow_present_pte(*split_sptep));
> >
> > -               /*
> > -                * Since we did not allocate pte_list_desc structs for the
> > -                * split, we can't add a new SPTE that maps this GFN.
> > -                * Skipping this SPTE means we're only partially mapping the
> > -                * huge page, which means we'll need to flush TLBs before
> > -                * dropping the MMU lock.
> > -                *
> > -                * Note, this make it safe to pass NULL to __rmap_add() below.
> > -                */
> > -               if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
> > -                       *flush = true;
> > -                       continue;
> > -               }
> > -
> >                 split_spte = make_huge_page_split_spte(
> >                                 huge_spte, split_level + 1, index, access);
> >
> >                 mmu_spte_set(split_sptep, split_spte);
> > -               __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
> > +               __rmap_add(kvm, cache, slot, split_sptep, split_gfn, access);
> >         }
> >
> >         /*
> > @@ -6222,7 +6243,6 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> >         struct kvm_mmu_page *sp = NULL;
> >         struct rmap_iterator iter;
> >         u64 *huge_sptep, spte;
> > -       bool flush = false;
> >         bool dropped_lock;
> >         int level;
> >         gfn_t gfn;
> > @@ -6237,7 +6257,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> >                 level = sptep_to_sp(huge_sptep)->role.level;
> >                 gfn = sptep_to_gfn(huge_sptep);
> >
> > -               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
> > +               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &dropped_lock);
> >                 if (r) {
> >                         trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> >                         break;
> > @@ -6246,7 +6266,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> >                 if (dropped_lock)
> >                         goto restart;
> >
> > -               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
> > +               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
> >
> >                 trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> >
> > @@ -6261,7 +6281,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> >         if (sp)
> >                 kvm_mmu_free_sp(sp);
> >
> > -       return flush;
> > +       return false;
> >  }
> >
> >  static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > @@ -6269,7 +6289,6 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> >                                           gfn_t start, gfn_t end,
> >                                           int target_level)
> >  {
> > -       bool flush;
> >         int level;
> >
> >         /*
> > @@ -6277,21 +6296,15 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> >          * down to the target level. This ensures pages are recursively split
> >          * all the way to the target level. There's no need to split pages
> >          * already at the target level.
> > -        *
> > -        * Note that TLB flushes must be done before dropping the MMU lock since
> > -        * rmap_try_split_huge_pages() may partially split any given huge page,
> > -        * i.e. it may effectively unmap (make non-present) a portion of the
> > -        * huge page.
> >          */
> >         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> > -               flush = slot_handle_level_range(kvm, slot,
> > -                                               rmap_try_split_huge_pages,
> > -                                               level, level, start, end - 1,
> > -                                               true, flush);
> > +               slot_handle_level_range(kvm, slot,
> > +                                       rmap_try_split_huge_pages,
> > +                                       level, level, start, end - 1,
> > +                                       true, false);
> >         }
> >
> > -       if (flush)
> > -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +       kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
> >  }
> >
> >  /* Must be called with the mmu_lock held in write-mode. */
> > --
> > 2.35.0.rc2.247.g8bbb082509-goog
> >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs
  2022-02-28 23:41     ` David Matlack
@ 2022-03-01  0:37       ` Ben Gardon
  2022-03-03 19:59         ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Ben Gardon @ 2022-03-01  0:37 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Mon, Feb 28, 2022 at 3:41 PM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, Feb 28, 2022 at 1:22 PM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Wed, Feb 2, 2022 at 5:03 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > When splitting a huge page we need to add all of the lower level SPTEs
> > > to the memslot rmap. The current implementation of eager page splitting
> > > bails if adding an SPTE would require allocating an extra pte_list_desc
> > > struct. Fix this limitation by allocating enough pte_list_desc structs
> > > before splitting the huge page.
> > >
> > > This eliminates the need for TLB flushing under the MMU lock because the
> > > huge page is always entirely split (no subregion of the huge page is
> > > unmapped).
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  10 ++++
> > >  arch/x86/kvm/mmu/mmu.c          | 101 ++++++++++++++++++--------------
> > >  2 files changed, 67 insertions(+), 44 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index d0b12bfe5818..a0f7578f7a26 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1232,6 +1232,16 @@ struct kvm_arch {
> > >         hpa_t   hv_root_tdp;
> > >         spinlock_t hv_root_tdp_lock;
> > >  #endif
> > > +
> > > +       /*
> > > +        * Memory cache used to allocate pte_list_desc structs while splitting
> > > +        * huge pages. In the worst case, to split one huge page we need 512
> > > +        * pte_list_desc structs to add each new lower level leaf sptep to the
> > > +        * memslot rmap.
> > > +        */
> > > +#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
> > > +       __DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
> > > +                                     HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
> > >  };
> > >
> > >  struct kvm_vm_stat {
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 825cfdec589b..c7981a934237 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5905,6 +5905,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
> > >         node->track_write = kvm_mmu_pte_write;
> > >         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > >         kvm_page_track_register_notifier(kvm, node);
> > > +
> > > +       kvm->arch.huge_page_split_desc_cache.capacity =
> > > +               HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
> > > +       kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
> > > +       kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
> > >  }
> > >
> > >  void kvm_mmu_uninit_vm(struct kvm *kvm)
> > > @@ -6035,9 +6040,42 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> > >                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> > >  }
> > >
> > > +static int min_descs_for_split(const struct kvm_memory_slot *slot, u64 *huge_sptep)
> > > +{
> > > +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> > > +       int split_level = huge_sp->role.level - 1;
> > > +       int i, min = 0;
> > > +       gfn_t gfn;
> > > +
> > > +       gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> > >
> > > -static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
> > > +       for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> > > +               if (rmap_need_new_pte_list_desc(slot, gfn, split_level))
> > > +                       min++;
> > > +
> > > +               gfn += KVM_PAGES_PER_HPAGE(split_level);
> > > +       }
> > > +
> > > +       return min;
> > > +}
> >
> > Is this calculation worth doing? It seems like we're doing a lot of
> > work here to calculate exactly how many pages we need to allocate, but
> > if eager splitting we'll be doing this over and over again. It seems
> > like it would be more efficient to just always fill the cache since
> > any extra pages allocated to split one page can be used to split the
> > next one.
>
> topup_huge_page_split_desc_cache() does fill the cache. This
> calculation is just to determine the minimum number of objects needed
> to split the next huge page, so that we can skip refilling the cache
> when its unnecessary.
>
> I think you are suggesting we unconditionally topup the cache and
> hard-code the min to 513 (the capacity of the cache)? That would
> certainly allow us to drop this function (less code complexity) but
> would result in extra unnecessary allocations. If the cost of those
> allocations is negligible then I can see an argument for going with
> your approach.

Right, exactly.
If you're eagerly splitting the entire EPT for a VM, then the number
of extra allocations is bounded at 513 because memory allocated for
one page can be used for the next one if not needed right?
If you check how many you need on each pass, you'll be doing
potentially O(pages split) extra work, so I suspect that
unconditionally filling the cache will scale better.

>
> >
> > > +
> > > +static int topup_huge_page_split_desc_cache(struct kvm *kvm, int min, gfp_t gfp)
> > > +{
> > > +       struct kvm_mmu_memory_cache *cache =
> > > +               &kvm->arch.huge_page_split_desc_cache;
> > > +
> > > +       return __kvm_mmu_topup_memory_cache(cache, min, gfp);
> > > +}
> > > +
> > > +static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
> > > +                                 int min_descs, gfp_t gfp)
> > >  {
> > > +       int r;
> > > +
> > > +       r = topup_huge_page_split_desc_cache(kvm, min_descs, gfp);
> > > +       if (r)
> > > +               return r;
> > > +
> > >         if (*spp)
> > >                 return 0;
> > >
> > > @@ -6050,9 +6088,9 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
> > >                                       const struct kvm_memory_slot *slot,
> > >                                       u64 *huge_sptep,
> > >                                       struct kvm_mmu_page **spp,
> > > -                                     bool *flush,
> > >                                       bool *dropped_lock)
> > >  {
> > > +       int min_descs = min_descs_for_split(slot, huge_sptep);
> > >         int r = 0;
> > >
> > >         *dropped_lock = false;
> > > @@ -6063,22 +6101,18 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
> > >         if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > >                 goto drop_lock;
> > >
> > > -       r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
> > > +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_NOWAIT | __GFP_ACCOUNT);
> > >         if (r)
> > >                 goto drop_lock;
> > >
> > >         return 0;
> > >
> > >  drop_lock:
> > > -       if (*flush)
> > > -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > > -
> > > -       *flush = false;
> > >         *dropped_lock = true;
> > >
> > >         write_unlock(&kvm->mmu_lock);
> > >         cond_resched();
> > > -       r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);
> > > +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_KERNEL_ACCOUNT);
> > >         write_lock(&kvm->mmu_lock);
> > >
> > >         return r;
> > > @@ -6122,10 +6156,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> > >
> > >  static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > >                                    const struct kvm_memory_slot *slot,
> > > -                                  u64 *huge_sptep, struct kvm_mmu_page **spp,
> > > -                                  bool *flush)
> > > +                                  u64 *huge_sptep, struct kvm_mmu_page **spp)
> > >
> > >  {
> > > +       struct kvm_mmu_memory_cache *cache;
> > >         struct kvm_mmu_page *split_sp;
> > >         u64 huge_spte, split_spte;
> > >         int split_level, index;
> > > @@ -6138,9 +6172,9 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > >                 return -EOPNOTSUPP;
> > >
> > >         /*
> > > -        * Since we did not allocate pte_list_desc_structs for the split, we
> > > -        * cannot add a new parent SPTE to parent_ptes. This should never happen
> > > -        * in practice though since this is a fresh SP.
> > > +        * We did not allocate an extra pte_list_desc struct to add huge_sptep
> > > +        * to split_sp->parent_ptes. An extra pte_list_desc struct should never
> > > +        * be necessary in practice though since split_sp is brand new.
> > >          *
> > >          * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> > >          */
> > > @@ -6151,6 +6185,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > >
> > >         split_level = split_sp->role.level;
> > >         access = split_sp->role.access;
> > > +       cache = &kvm->arch.huge_page_split_desc_cache;
> > >
> > >         for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > >                 split_sptep = &split_sp->spt[index];
> > > @@ -6158,25 +6193,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > >
> > >                 BUG_ON(is_shadow_present_pte(*split_sptep));
> > >
> > > -               /*
> > > -                * Since we did not allocate pte_list_desc structs for the
> > > -                * split, we can't add a new SPTE that maps this GFN.
> > > -                * Skipping this SPTE means we're only partially mapping the
> > > -                * huge page, which means we'll need to flush TLBs before
> > > -                * dropping the MMU lock.
> > > -                *
> > > -                * Note, this make it safe to pass NULL to __rmap_add() below.
> > > -                */
> > > -               if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
> > > -                       *flush = true;
> > > -                       continue;
> > > -               }
> > > -
> > >                 split_spte = make_huge_page_split_spte(
> > >                                 huge_spte, split_level + 1, index, access);
> > >
> > >                 mmu_spte_set(split_sptep, split_spte);
> > > -               __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
> > > +               __rmap_add(kvm, cache, slot, split_sptep, split_gfn, access);
> > >         }
> > >
> > >         /*
> > > @@ -6222,7 +6243,6 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > >         struct kvm_mmu_page *sp = NULL;
> > >         struct rmap_iterator iter;
> > >         u64 *huge_sptep, spte;
> > > -       bool flush = false;
> > >         bool dropped_lock;
> > >         int level;
> > >         gfn_t gfn;
> > > @@ -6237,7 +6257,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > >                 level = sptep_to_sp(huge_sptep)->role.level;
> > >                 gfn = sptep_to_gfn(huge_sptep);
> > >
> > > -               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
> > > +               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &dropped_lock);
> > >                 if (r) {
> > >                         trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> > >                         break;
> > > @@ -6246,7 +6266,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > >                 if (dropped_lock)
> > >                         goto restart;
> > >
> > > -               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
> > > +               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
> > >
> > >                 trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> > >
> > > @@ -6261,7 +6281,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > >         if (sp)
> > >                 kvm_mmu_free_sp(sp);
> > >
> > > -       return flush;
> > > +       return false;
> > >  }
> > >
> > >  static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > > @@ -6269,7 +6289,6 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > >                                           gfn_t start, gfn_t end,
> > >                                           int target_level)
> > >  {
> > > -       bool flush;
> > >         int level;
> > >
> > >         /*
> > > @@ -6277,21 +6296,15 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > >          * down to the target level. This ensures pages are recursively split
> > >          * all the way to the target level. There's no need to split pages
> > >          * already at the target level.
> > > -        *
> > > -        * Note that TLB flushes must be done before dropping the MMU lock since
> > > -        * rmap_try_split_huge_pages() may partially split any given huge page,
> > > -        * i.e. it may effectively unmap (make non-present) a portion of the
> > > -        * huge page.
> > >          */
> > >         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> > > -               flush = slot_handle_level_range(kvm, slot,
> > > -                                               rmap_try_split_huge_pages,
> > > -                                               level, level, start, end - 1,
> > > -                                               true, flush);
> > > +               slot_handle_level_range(kvm, slot,
> > > +                                       rmap_try_split_huge_pages,
> > > +                                       level, level, start, end - 1,
> > > +                                       true, false);
> > >         }
> > >
> > > -       if (flush)
> > > -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > > +       kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
> > >  }
> > >
> > >  /* Must be called with the mmu_lock held in write-mode. */
> > > --
> > > 2.35.0.rc2.247.g8bbb082509-goog
> > >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 16/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  2022-02-28 20:39   ` Ben Gardon
@ 2022-03-03 19:42     ` David Matlack
  0 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-03-03 19:42 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Mon, Feb 28, 2022 at 12:40 PM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU (i.e.
> > in the rmap). This leads to correct behavior because KVM never creates
> > intermediate huge pages during dirty logging. For example, a 1GiB page
> > is never partially split to a 2MiB page.
> >
> > However this behavior will stop being correct once the shadow MMU
> > participates in eager page splitting, which can in fact leave behind
> > partially split huge pages. In preparation for that change, change the
> > shadow MMU to iterate over all levels when zapping collapsible SPTEs.
> >
> > No functional change intended.
> >
>
> Reviewed-by: Ben Gardon <bgardon@google.com>
>
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e2306a39526a..99ad7cc8683f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6038,18 +6038,25 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >         return need_tlb_flush;
> >  }
> >
> > +static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
> > +                                          const struct kvm_memory_slot *slot)
> > +{
> > +       bool flush;
> > +
> > +       flush = slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte,
> > +                                 PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, true);
>
> The max level here only needs to be 2M since 1G page wouldn't be
> split. I think the upper limit can be lowered to
> KVM_MAX_HUGEPAGE_LEVEL - 1.
> Not a significant performance difference though.

Good point. There's no reason to look at huge pages that are already
mapped at the maximum possible level.

>
> > +
> > +       if (flush)
> > +               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +
> > +}
> > +
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >                                    const struct kvm_memory_slot *slot)
> >  {
> >         if (kvm_memslots_have_rmaps(kvm)) {
> >                 write_lock(&kvm->mmu_lock);
> > -               /*
> > -                * Zap only 4k SPTEs since the legacy MMU only supports dirty
> > -                * logging at a 4k granularity and never creates collapsible
> > -                * 2m SPTEs during dirty logging.
> > -                */
> > -               if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true))
> > -                       kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > +               kvm_rmap_zap_collapsible_sptes(kvm, slot);
> >                 write_unlock(&kvm->mmu_lock);
> >         }
> >
> > --
> > 2.35.0.rc2.247.g8bbb082509-goog
> >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 17/23] KVM: x86/mmu: Pass bool flush parameter to drop_large_spte()
  2022-02-28 20:47   ` Ben Gardon
@ 2022-03-03 19:52     ` David Matlack
  0 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-03-03 19:52 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Mon, Feb 28, 2022 at 12:47 PM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Feb 2, 2022 at 5:02 PM David Matlack <dmatlack@google.com> wrote:
> >
> > drop_large_spte() drops a large SPTE if it exists and then flushes TLBs.
> > Its helper function, __drop_large_spte(), does the drop without the
> > flush. This difference is not obvious from the name.
> >
> > To make the code more readable, pass an explicit flush parameter. Also
> > replace the vCPU pointer with a KVM pointer so we can get rid of the
> > double-underscore helper function.
> >
> > This is also in preparation for a future commit that will conditionally
> > flush after dropping a large SPTE.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c         | 25 +++++++++++--------------
> >  arch/x86/kvm/mmu/paging_tmpl.h |  4 ++--
> >  2 files changed, 13 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 99ad7cc8683f..2d47a54e62a5 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1162,23 +1162,20 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
> >  }
> >
> >
> > -static bool __drop_large_spte(struct kvm *kvm, u64 *sptep)
> > +static void drop_large_spte(struct kvm *kvm, u64 *sptep, bool flush)
>
> Since there are no callers of __drop_large_spte, I'd be inclined to
> hold off on adding the flush parameter in this commit and just add it
> when it's needed,

The same argument about waiting until there's a user could be said
about "KVM: x86/mmu: Pass access information to
make_huge_page_split_spte()". I agree with this advice when the future
user is entirely theoretical or some future series. But when the
future user is literally the next commit in the series, I think it's
ok to do things this way since it distributes the net diff more evenly
among patches, which eases reviewing.

But, you've got me thinking and I think I want to change this commit
slightly: I'll keep __drop_larg_spte() but push all the implementation
into it and add a bool flush parameter there. That way we don't have
to change all the call sites of drop_large_spte() in this commit. The
implementation of drop_large_spte() will just be
__drop_large_spte(..., true). And the next commit can call
__drop_large_spte(..., false) with a comment.

> or better yet after you add the new user with the
> conditional flush so that there's a commit explaining why it's safe to
> not always flush in that case.
>
> >  {
> > -       if (is_large_pte(*sptep)) {
> > -               WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K);
> > -               drop_spte(kvm, sptep);
> > -               return true;
> > -       }
> > +       struct kvm_mmu_page *sp;
> >
> > -       return false;
> > -}
> > +       if (!is_large_pte(*sptep))
> > +               return;
> >
> > -static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> > -{
> > -       if (__drop_large_spte(vcpu->kvm, sptep)) {
> > -               struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> > +       sp = sptep_to_sp(sptep);
> > +       WARN_ON(sp->role.level == PG_LEVEL_4K);
> > +
> > +       drop_spte(kvm, sptep);
> >
> > -               kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
> > +       if (flush) {
> > +               kvm_flush_remote_tlbs_with_address(kvm, sp->gfn,
> >                         KVM_PAGES_PER_HPAGE(sp->role.level));
> >         }
> >  }
> > @@ -3051,7 +3048,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                 if (it.level == fault->goal_level)
> >                         break;
> >
> > -               drop_large_spte(vcpu, it.sptep);
> > +               drop_large_spte(vcpu->kvm, it.sptep, true);
> >                 if (is_shadow_present_pte(*it.sptep))
> >                         continue;
> >
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 703dfb062bf0..ba61de29f2e5 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -677,7 +677,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >                 gfn_t table_gfn;
> >
> >                 clear_sp_write_flooding_count(it.sptep);
> > -               drop_large_spte(vcpu, it.sptep);
> > +               drop_large_spte(vcpu->kvm, it.sptep, true);
> >
> >                 sp = NULL;
> >                 if (!is_shadow_present_pte(*it.sptep)) {
> > @@ -739,7 +739,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >
> >                 validate_direct_spte(vcpu, it.sptep, direct_access);
> >
> > -               drop_large_spte(vcpu, it.sptep);
> > +               drop_large_spte(vcpu->kvm, it.sptep, true);
> >
> >                 if (!is_shadow_present_pte(*it.sptep)) {
> >                         sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> > --
> > 2.35.0.rc2.247.g8bbb082509-goog
> >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs
  2022-03-01  0:37       ` Ben Gardon
@ 2022-03-03 19:59         ` David Matlack
  0 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-03-03 19:59 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S . Szmigiero, kvm

On Mon, Feb 28, 2022 at 4:37 PM Ben Gardon <bgardon@google.com> wrote:
>
> On Mon, Feb 28, 2022 at 3:41 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Mon, Feb 28, 2022 at 1:22 PM Ben Gardon <bgardon@google.com> wrote:
> > >
> > > On Wed, Feb 2, 2022 at 5:03 PM David Matlack <dmatlack@google.com> wrote:
> > > >
> > > > When splitting a huge page we need to add all of the lower level SPTEs
> > > > to the memslot rmap. The current implementation of eager page splitting
> > > > bails if adding an SPTE would require allocating an extra pte_list_desc
> > > > struct. Fix this limitation by allocating enough pte_list_desc structs
> > > > before splitting the huge page.
> > > >
> > > > This eliminates the need for TLB flushing under the MMU lock because the
> > > > huge page is always entirely split (no subregion of the huge page is
> > > > unmapped).
> > > >
> > > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |  10 ++++
> > > >  arch/x86/kvm/mmu/mmu.c          | 101 ++++++++++++++++++--------------
> > > >  2 files changed, 67 insertions(+), 44 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index d0b12bfe5818..a0f7578f7a26 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1232,6 +1232,16 @@ struct kvm_arch {
> > > >         hpa_t   hv_root_tdp;
> > > >         spinlock_t hv_root_tdp_lock;
> > > >  #endif
> > > > +
> > > > +       /*
> > > > +        * Memory cache used to allocate pte_list_desc structs while splitting
> > > > +        * huge pages. In the worst case, to split one huge page we need 512
> > > > +        * pte_list_desc structs to add each new lower level leaf sptep to the
> > > > +        * memslot rmap.
> > > > +        */
> > > > +#define HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY 512
> > > > +       __DEFINE_KVM_MMU_MEMORY_CACHE(huge_page_split_desc_cache,
> > > > +                                     HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY);
> > > >  };
> > > >
> > > >  struct kvm_vm_stat {
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 825cfdec589b..c7981a934237 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -5905,6 +5905,11 @@ void kvm_mmu_init_vm(struct kvm *kvm)
> > > >         node->track_write = kvm_mmu_pte_write;
> > > >         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > > >         kvm_page_track_register_notifier(kvm, node);
> > > > +
> > > > +       kvm->arch.huge_page_split_desc_cache.capacity =
> > > > +               HUGE_PAGE_SPLIT_DESC_CACHE_CAPACITY;
> > > > +       kvm->arch.huge_page_split_desc_cache.kmem_cache = pte_list_desc_cache;
> > > > +       kvm->arch.huge_page_split_desc_cache.gfp_zero = __GFP_ZERO;
> > > >  }
> > > >
> > > >  void kvm_mmu_uninit_vm(struct kvm *kvm)
> > > > @@ -6035,9 +6040,42 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> > > >                 kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
> > > >  }
> > > >
> > > > +static int min_descs_for_split(const struct kvm_memory_slot *slot, u64 *huge_sptep)
> > > > +{
> > > > +       struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> > > > +       int split_level = huge_sp->role.level - 1;
> > > > +       int i, min = 0;
> > > > +       gfn_t gfn;
> > > > +
> > > > +       gfn = kvm_mmu_page_get_gfn(huge_sp, huge_sptep - huge_sp->spt);
> > > >
> > > > -static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp, gfp_t gfp)
> > > > +       for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> > > > +               if (rmap_need_new_pte_list_desc(slot, gfn, split_level))
> > > > +                       min++;
> > > > +
> > > > +               gfn += KVM_PAGES_PER_HPAGE(split_level);
> > > > +       }
> > > > +
> > > > +       return min;
> > > > +}
> > >
> > > Is this calculation worth doing? It seems like we're doing a lot of
> > > work here to calculate exactly how many pages we need to allocate, but
> > > if eager splitting we'll be doing this over and over again. It seems
> > > like it would be more efficient to just always fill the cache since
> > > any extra pages allocated to split one page can be used to split the
> > > next one.
> >
> > topup_huge_page_split_desc_cache() does fill the cache. This
> > calculation is just to determine the minimum number of objects needed
> > to split the next huge page, so that we can skip refilling the cache
> > when its unnecessary.
> >
> > I think you are suggesting we unconditionally topup the cache and
> > hard-code the min to 513 (the capacity of the cache)? That would
> > certainly allow us to drop this function (less code complexity) but
> > would result in extra unnecessary allocations. If the cost of those
> > allocations is negligible then I can see an argument for going with
> > your approach.
>
> Right, exactly.
> If you're eagerly splitting the entire EPT for a VM, then the number
> of extra allocations is bounded at 513 because memory allocated for
> one page can be used for the next one if not needed right?
> If you check how many you need on each pass, you'll be doing
> potentially O(pages split) extra work, so I suspect that
> unconditionally filling the cache will scale better.

Makes sense. I'll do some testing and see if we can drop this code. Thanks!

>
> >
> > >
> > > > +
> > > > +static int topup_huge_page_split_desc_cache(struct kvm *kvm, int min, gfp_t gfp)
> > > > +{
> > > > +       struct kvm_mmu_memory_cache *cache =
> > > > +               &kvm->arch.huge_page_split_desc_cache;
> > > > +
> > > > +       return __kvm_mmu_topup_memory_cache(cache, min, gfp);
> > > > +}
> > > > +
> > > > +static int alloc_memory_for_split(struct kvm *kvm, struct kvm_mmu_page **spp,
> > > > +                                 int min_descs, gfp_t gfp)
> > > >  {
> > > > +       int r;
> > > > +
> > > > +       r = topup_huge_page_split_desc_cache(kvm, min_descs, gfp);
> > > > +       if (r)
> > > > +               return r;
> > > > +
> > > >         if (*spp)
> > > >                 return 0;
> > > >
> > > > @@ -6050,9 +6088,9 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
> > > >                                       const struct kvm_memory_slot *slot,
> > > >                                       u64 *huge_sptep,
> > > >                                       struct kvm_mmu_page **spp,
> > > > -                                     bool *flush,
> > > >                                       bool *dropped_lock)
> > > >  {
> > > > +       int min_descs = min_descs_for_split(slot, huge_sptep);
> > > >         int r = 0;
> > > >
> > > >         *dropped_lock = false;
> > > > @@ -6063,22 +6101,18 @@ static int prepare_to_split_huge_page(struct kvm *kvm,
> > > >         if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > > >                 goto drop_lock;
> > > >
> > > > -       r = alloc_memory_for_split(kvm, spp, GFP_NOWAIT | __GFP_ACCOUNT);
> > > > +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_NOWAIT | __GFP_ACCOUNT);
> > > >         if (r)
> > > >                 goto drop_lock;
> > > >
> > > >         return 0;
> > > >
> > > >  drop_lock:
> > > > -       if (*flush)
> > > > -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > > > -
> > > > -       *flush = false;
> > > >         *dropped_lock = true;
> > > >
> > > >         write_unlock(&kvm->mmu_lock);
> > > >         cond_resched();
> > > > -       r = alloc_memory_for_split(kvm, spp, GFP_KERNEL_ACCOUNT);
> > > > +       r = alloc_memory_for_split(kvm, spp, min_descs, GFP_KERNEL_ACCOUNT);
> > > >         write_lock(&kvm->mmu_lock);
> > > >
> > > >         return r;
> > > > @@ -6122,10 +6156,10 @@ static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> > > >
> > > >  static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > > >                                    const struct kvm_memory_slot *slot,
> > > > -                                  u64 *huge_sptep, struct kvm_mmu_page **spp,
> > > > -                                  bool *flush)
> > > > +                                  u64 *huge_sptep, struct kvm_mmu_page **spp)
> > > >
> > > >  {
> > > > +       struct kvm_mmu_memory_cache *cache;
> > > >         struct kvm_mmu_page *split_sp;
> > > >         u64 huge_spte, split_spte;
> > > >         int split_level, index;
> > > > @@ -6138,9 +6172,9 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > > >                 return -EOPNOTSUPP;
> > > >
> > > >         /*
> > > > -        * Since we did not allocate pte_list_desc_structs for the split, we
> > > > -        * cannot add a new parent SPTE to parent_ptes. This should never happen
> > > > -        * in practice though since this is a fresh SP.
> > > > +        * We did not allocate an extra pte_list_desc struct to add huge_sptep
> > > > +        * to split_sp->parent_ptes. An extra pte_list_desc struct should never
> > > > +        * be necessary in practice though since split_sp is brand new.
> > > >          *
> > > >          * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> > > >          */
> > > > @@ -6151,6 +6185,7 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > > >
> > > >         split_level = split_sp->role.level;
> > > >         access = split_sp->role.access;
> > > > +       cache = &kvm->arch.huge_page_split_desc_cache;
> > > >
> > > >         for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> > > >                 split_sptep = &split_sp->spt[index];
> > > > @@ -6158,25 +6193,11 @@ static int kvm_mmu_split_huge_page(struct kvm *kvm,
> > > >
> > > >                 BUG_ON(is_shadow_present_pte(*split_sptep));
> > > >
> > > > -               /*
> > > > -                * Since we did not allocate pte_list_desc structs for the
> > > > -                * split, we can't add a new SPTE that maps this GFN.
> > > > -                * Skipping this SPTE means we're only partially mapping the
> > > > -                * huge page, which means we'll need to flush TLBs before
> > > > -                * dropping the MMU lock.
> > > > -                *
> > > > -                * Note, this make it safe to pass NULL to __rmap_add() below.
> > > > -                */
> > > > -               if (rmap_need_new_pte_list_desc(slot, split_gfn, split_level)) {
> > > > -                       *flush = true;
> > > > -                       continue;
> > > > -               }
> > > > -
> > > >                 split_spte = make_huge_page_split_spte(
> > > >                                 huge_spte, split_level + 1, index, access);
> > > >
> > > >                 mmu_spte_set(split_sptep, split_spte);
> > > > -               __rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);
> > > > +               __rmap_add(kvm, cache, slot, split_sptep, split_gfn, access);
> > > >         }
> > > >
> > > >         /*
> > > > @@ -6222,7 +6243,6 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > > >         struct kvm_mmu_page *sp = NULL;
> > > >         struct rmap_iterator iter;
> > > >         u64 *huge_sptep, spte;
> > > > -       bool flush = false;
> > > >         bool dropped_lock;
> > > >         int level;
> > > >         gfn_t gfn;
> > > > @@ -6237,7 +6257,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > > >                 level = sptep_to_sp(huge_sptep)->role.level;
> > > >                 gfn = sptep_to_gfn(huge_sptep);
> > > >
> > > > -               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &flush, &dropped_lock);
> > > > +               r = prepare_to_split_huge_page(kvm, slot, huge_sptep, &sp, &dropped_lock);
> > > >                 if (r) {
> > > >                         trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> > > >                         break;
> > > > @@ -6246,7 +6266,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > > >                 if (dropped_lock)
> > > >                         goto restart;
> > > >
> > > > -               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp, &flush);
> > > > +               r = kvm_mmu_split_huge_page(kvm, slot, huge_sptep, &sp);
> > > >
> > > >                 trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
> > > >
> > > > @@ -6261,7 +6281,7 @@ static bool rmap_try_split_huge_pages(struct kvm *kvm,
> > > >         if (sp)
> > > >                 kvm_mmu_free_sp(sp);
> > > >
> > > > -       return flush;
> > > > +       return false;
> > > >  }
> > > >
> > > >  static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > > > @@ -6269,7 +6289,6 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > > >                                           gfn_t start, gfn_t end,
> > > >                                           int target_level)
> > > >  {
> > > > -       bool flush;
> > > >         int level;
> > > >
> > > >         /*
> > > > @@ -6277,21 +6296,15 @@ static void kvm_rmap_try_split_huge_pages(struct kvm *kvm,
> > > >          * down to the target level. This ensures pages are recursively split
> > > >          * all the way to the target level. There's no need to split pages
> > > >          * already at the target level.
> > > > -        *
> > > > -        * Note that TLB flushes must be done before dropping the MMU lock since
> > > > -        * rmap_try_split_huge_pages() may partially split any given huge page,
> > > > -        * i.e. it may effectively unmap (make non-present) a portion of the
> > > > -        * huge page.
> > > >          */
> > > >         for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) {
> > > > -               flush = slot_handle_level_range(kvm, slot,
> > > > -                                               rmap_try_split_huge_pages,
> > > > -                                               level, level, start, end - 1,
> > > > -                                               true, flush);
> > > > +               slot_handle_level_range(kvm, slot,
> > > > +                                       rmap_try_split_huge_pages,
> > > > +                                       level, level, start, end - 1,
> > > > +                                       true, false);
> > > >         }
> > > >
> > > > -       if (flush)
> > > > -               kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> > > > +       kvm_mmu_free_memory_cache(&kvm->arch.huge_page_split_desc_cache);
> > > >  }
> > > >
> > > >  /* Must be called with the mmu_lock held in write-mode. */
> > > > --
> > > > 2.35.0.rc2.247.g8bbb082509-goog
> > > >

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent
  2022-02-19  1:14   ` Sean Christopherson
  2022-02-24 18:45     ` David Matlack
@ 2022-03-04  0:22     ` David Matlack
  1 sibling, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-03-04  0:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Vitaly Kuznetsov, Peter Xu, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, maciej.szmigiero, kvm

On Sat, Feb 19, 2022 at 01:14:16AM +0000, Sean Christopherson wrote:
> On Thu, Feb 03, 2022, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, we can derive most of the information from the parent shadow page.
> > This avoids redundant calculations such as the quadrant, and reduces the
> 
> Uh, calculating quadrant isn't redundant.  The quadrant forces KVM to use different
> (multiple) shadow pages to shadow a single guest PTE when the guest is using 32-bit
> paging (1024 PTEs per page table vs. 512 PTEs per page table).  The reason quadrant
> is "quad" and not more or less is because 32-bit paging has two levels.  First-level
> PTEs can have quadrant=0/1, and that gets doubled for second-level PTEs because we
> need to use four PTEs (two to handle 2x guest PTEs, and each of those needs to be
> unique for the first-level PTEs they point at).

One solution is to keep the quadrant calculation in kvm_mmu_get_page().
The obvious problem for eager page splitting is we need the faulting
address to use the existing calculation to get the quadrant, and there
is no faulting address when doing eager page splitting. This doesn't
really matter though because we really don't care about eagerly
splitting huge pages that are shadowing a 32-bit non-PAE guest, so we
can just skip huge pages mapped on shadow pages with has_4_byte_gpte and
hard-code the quadrant to 0.

Plumbing all that shouldn't be too hard. But it occurs to me it might
not be necessary. The quadrant cannot be literally copied from the
parent SP like this commit does, but I think it can still be derived
from the parent. The upside is we don't need any special casing of
has_4_byte_gpte or hard-coding the quadrant in the eager page splitting
code, and we can still get rid of passing in the faulting address to
kvm_mmu_get_page().

Here's what it would (roughly) look like, applied on top of this commit:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6941b9b99a90..4184662b42bf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2110,9 +2110,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
        return sp;
 }

-static union kvm_mmu_page_role kvm_mmu_child_role(struct kvm_mmu_page *parent_sp,
-                                                 bool direct, u32 access)
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
 {
+       struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;

        role = parent_sp->role;
@@ -2120,6 +2120,28 @@ static union kvm_mmu_page_role kvm_mmu_child_role(struct kvm_mmu_page *parent_sp
        role.access = access;
        role.direct = direct;

+       /*
+        * If the guest has 4-byte PTEs then that means it's using 32-bit,
+        * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+        * directories, each mapping 1/4 of the guest's linear address space
+        * (1GiB). The shadow pages for those 4 page directories are
+        * pre-allocated and assigned a separate quadrant in their role.
+        *
+        * Since we are allocating a child shadow page and there are only 2
+        * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+        * will either be 0 or 1 because it maps 1/2 of the address space mapped
+        * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
+        * is shadowing. In this case, the quadrant can be derived by the index
+        * of the SPTE that points to the new child shadow page in the page
+        * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+        * shadow one half of a guest's page table (or 4MiB huge page) so the
+        * quadrant is just the parity of the index of the SPTE.
+        */
+       if (role.has_4_byte_gpte) {
+               BUG_ON(role.level != PG_LEVEL_4K);
+               role.quadrant = (sptep - parent_sp->spt) % 2;
+       }
+
        return role;
 }

@@ -2127,11 +2149,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
                                                 u64 *sptep, gfn_t gfn,
                                                 bool direct, u32 access)
 {
-       struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;

-       role = kvm_mmu_child_role(parent_sp, direct, access);
-
+       role = kvm_mmu_child_role(sptep, direct, access);
        return kvm_mmu_get_page(vcpu, gfn, role);
 }

> 
> Indeed, this fails spectacularly when attempting to boot a 32-bit non-PAE kernel
> with shadow paging enabled.
> 
>  \���	���\���	��\���
>  	P��\��`
>  BUG: unable to handle page fault for address: ff9fa81c
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  *pde = 00000000
>  ����
>  Oops: 0000 [#1]��<���� SMP��<������<������<����
>  ��<����CPU: 0 PID: 0 Comm: swapper ��<����G        W         5.12.0 #10
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
>  ��<����Call Trace:
>  ��<���� ? printkd�r
>  ��<���� ��<����memblock_reserved�r
>  ��<���� ? 0xc2000000
>  ��<���� ��<����setup_archd�r
>  ��<���� ? vprintk_defaultd�r
>  ��<���� ? vprintkd�r
>  ��<���� ��<����start_kerneld�r
>  ��<���� ��<����i386_start_kerneld�r
>  ��<���� ��<����startup_32_smpd�r
> 
>  ����
>  CR2: 00000000ff9fa81c
> 
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
> 
> > number of parameters to kvm_mmu_get_page().
> > 
> > Preemptivel split out the role calculation to a separate function for
> 
> Preemptively.
> 
> > use in a following commit.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-02-24 19:20     ` David Matlack
@ 2022-03-04 21:59       ` David Matlack
  2022-03-04 22:24         ` David Matlack
  2022-03-05 16:55         ` Marc Zyngier
  0 siblings, 2 replies; 65+ messages in thread
From: David Matlack @ 2022-03-04 21:59 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Thu, Feb 24, 2022 at 11:20 AM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Feb 24, 2022 at 3:29 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Thu, 03 Feb 2022 01:00:47 +0000,
> > David Matlack <dmatlack@google.com> wrote:
> > >

[...]

> > >
> > >       /* Cache some mmu pages needed inside spinlock regions */
> > > -     struct kvm_mmu_memory_cache mmu_page_cache;
> > > +     DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> >
> > I must say I'm really not a fan of the anonymous structure trick. I
> > can see why you are doing it that way, but it feels pretty brittle.
>
> Yeah I don't love it. It's really optimizing for minimizing the patch diff.
>
> The alternative I considered was to dynamically allocate the
> kvm_mmu_memory_cache structs. This would get rid of the anonymous
> struct and the objects array, and also eliminate the rather gross
> capacity hack in kvm_mmu_topup_memory_cache().
>
> The downsides of this approach is more code and more failure paths if
> the allocation fails.

I tried changing all kvm_mmu_memory_cache structs to be dynamically
allocated, but it created a lot of complexity to the setup/teardown
code paths in x86, arm64, mips, and riscv (the arches that use the
caches). I don't think this route is worth it, especially since these
structs don't *need* to be dynamically allocated.

When you said the anonymous struct feels brittle, what did you have in
mind specifically?

>
> >
> > >
> > >       /* Target CPU and feature flags */
> > >       int target;
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index bc2aba953299..9c853c529b49 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -765,7 +765,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > >  {
> > >       phys_addr_t addr;
> > >       int ret = 0;
> > > -     struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> > > +     DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
> > > +     struct kvm_mmu_memory_cache *cache = &page_cache.cache;
> > >       struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > >                                    KVM_PGTABLE_PROT_R |
> > > @@ -774,18 +775,17 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > >       if (is_protected_kvm_enabled())
> > >               return -EPERM;
> > >
> > > +     cache->gfp_zero = __GFP_ZERO;
> >
> > nit: consider this instead, which preserves the existing flow:
>
> Will do.
>
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 26d6c53be083..86a7ebd03a44 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -764,7 +764,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >         phys_addr_t addr;
> >         int ret = 0;
> > -       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
> > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > +               .cache = { .gfp_zero = __GFP_ZERO},
> > +       };
> >         struct kvm_mmu_memory_cache *cache = &page_cache.cache;
> >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > @@ -774,7 +776,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >         if (is_protected_kvm_enabled())
> >                 return -EPERM;
> >
> > -       cache->gfp_zero = __GFP_ZERO;
> >         size += offset_in_page(guest_ipa);
> >         guest_ipa &= PAGE_MASK;
> >
> > but whole "declare the outer structure and just use the inner one"
> > hack is... huh... :-/
>
> Yeah it's not great. Unfortunately (or maybe fortunately?) anonymous
> structs cannot be defined in functions. So naming the outer struct is
> necessary even though we only need to use the inner one.

I see two alternatives to make this cleaner:

1. Dynamically allocate just this cache. The caches defined in
vcpu_arch will continue to use DEFINE_KVM_MMU_MEMORY_CACHE(). This
would get rid of the outer struct but require an extra memory
allocation.
2. Move this cache to struct kvm_arch using
DEFINE_KVM_MMU_MEMORY_CACHE(). Then we don't need to stack allocate it
or dynamically allocate it.

Do either of these approaches appeal to you more than the current one?

>
> >
> > This hunk also conflicts with what currently sits in -next. Not a big
> > deal, but just so you know.
>
> Ack.
>
> >
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index dceac12c1ce5..9575fb8d333f 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -78,14 +78,34 @@ struct gfn_to_pfn_cache {
> > >   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> > >   * holding MMU locks.  Note, these caches act more like prefetch buffers than
> > >   * classical caches, i.e. objects are not returned to the cache on being freed.
> > > + *
> > > + * The storage for the cache objects is laid out after the struct to allow
> > > + * different declarations to choose different capacities. If the capacity field
> > > + * is 0, the capacity is assumed to be KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE.
> > >   */
> > >  struct kvm_mmu_memory_cache {
> > >       int nobjs;
> > > +     int capacity;
> > >       gfp_t gfp_zero;
> > >       struct kmem_cache *kmem_cache;
> > > -     void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> > > +     void *objects[0];
> >
> > The VLA police is going to track you down ([0] vs []).
>
> Thanks!
>
>
> >
> >         M.
> >
> > --
> > Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-04 21:59       ` David Matlack
@ 2022-03-04 22:24         ` David Matlack
  2022-03-05 16:55         ` Marc Zyngier
  1 sibling, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-03-04 22:24 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Fri, Mar 4, 2022 at 1:59 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Feb 24, 2022 at 11:20 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Thu, Feb 24, 2022 at 3:29 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Thu, 03 Feb 2022 01:00:47 +0000,
> > > David Matlack <dmatlack@google.com> wrote:
> > > >
>
> [...]
>
> > > >
> > > >       /* Cache some mmu pages needed inside spinlock regions */
> > > > -     struct kvm_mmu_memory_cache mmu_page_cache;
> > > > +     DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> > >
> > > I must say I'm really not a fan of the anonymous structure trick. I
> > > can see why you are doing it that way, but it feels pretty brittle.
> >
> > Yeah I don't love it. It's really optimizing for minimizing the patch diff.
> >
> > The alternative I considered was to dynamically allocate the
> > kvm_mmu_memory_cache structs. This would get rid of the anonymous
> > struct and the objects array, and also eliminate the rather gross
> > capacity hack in kvm_mmu_topup_memory_cache().
> >
> > The downsides of this approach is more code and more failure paths if
> > the allocation fails.
>
> I tried changing all kvm_mmu_memory_cache structs to be dynamically
> allocated, but it created a lot of complexity to the setup/teardown
> code paths in x86, arm64, mips, and riscv (the arches that use the
> caches). I don't think this route is worth it, especially since these
> structs don't *need* to be dynamically allocated.
>
> When you said the anonymous struct feels brittle, what did you have in
> mind specifically?
>
> >
> > >
> > > >
> > > >       /* Target CPU and feature flags */
> > > >       int target;
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index bc2aba953299..9c853c529b49 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -765,7 +765,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > >  {
> > > >       phys_addr_t addr;
> > > >       int ret = 0;
> > > > -     struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
> > > > +     DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
> > > > +     struct kvm_mmu_memory_cache *cache = &page_cache.cache;
> > > >       struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > > >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > > >                                    KVM_PGTABLE_PROT_R |
> > > > @@ -774,18 +775,17 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > >       if (is_protected_kvm_enabled())
> > > >               return -EPERM;
> > > >
> > > > +     cache->gfp_zero = __GFP_ZERO;
> > >
> > > nit: consider this instead, which preserves the existing flow:
> >
> > Will do.
> >
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 26d6c53be083..86a7ebd03a44 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -764,7 +764,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > >  {
> > >         phys_addr_t addr;
> > >         int ret = 0;
> > > -       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {};
> > > +       DEFINE_KVM_MMU_MEMORY_CACHE(cache) page_cache = {
> > > +               .cache = { .gfp_zero = __GFP_ZERO},
> > > +       };
> > >         struct kvm_mmu_memory_cache *cache = &page_cache.cache;
> > >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > > @@ -774,7 +776,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > >         if (is_protected_kvm_enabled())
> > >                 return -EPERM;
> > >
> > > -       cache->gfp_zero = __GFP_ZERO;
> > >         size += offset_in_page(guest_ipa);
> > >         guest_ipa &= PAGE_MASK;
> > >
> > > but whole "declare the outer structure and just use the inner one"
> > > hack is... huh... :-/
> >
> > Yeah it's not great. Unfortunately (or maybe fortunately?) anonymous
> > structs cannot be defined in functions. So naming the outer struct is
> > necessary even though we only need to use the inner one.
>
> I see two alternatives to make this cleaner:
>
> 1. Dynamically allocate just this cache. The caches defined in
> vcpu_arch will continue to use DEFINE_KVM_MMU_MEMORY_CACHE(). This
> would get rid of the outer struct but require an extra memory
> allocation.
> 2. Move this cache to struct kvm_arch using
> DEFINE_KVM_MMU_MEMORY_CACHE(). Then we don't need to stack allocate it
> or dynamically allocate it.
>
> Do either of these approaches appeal to you more than the current one?

(There's obvious performance and memory overhead trade-offs with these
different approaches, but I don't know enough about arm64 KVM to
assess which option might be best.)

>
> >
> > >
> > > This hunk also conflicts with what currently sits in -next. Not a big
> > > deal, but just so you know.
> >
> > Ack.
> >
> > >
> > > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > > index dceac12c1ce5..9575fb8d333f 100644
> > > > --- a/include/linux/kvm_types.h
> > > > +++ b/include/linux/kvm_types.h
> > > > @@ -78,14 +78,34 @@ struct gfn_to_pfn_cache {
> > > >   * MMU flows is problematic, as is triggering reclaim, I/O, etc... while
> > > >   * holding MMU locks.  Note, these caches act more like prefetch buffers than
> > > >   * classical caches, i.e. objects are not returned to the cache on being freed.
> > > > + *
> > > > + * The storage for the cache objects is laid out after the struct to allow
> > > > + * different declarations to choose different capacities. If the capacity field
> > > > + * is 0, the capacity is assumed to be KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE.
> > > >   */
> > > >  struct kvm_mmu_memory_cache {
> > > >       int nobjs;
> > > > +     int capacity;
> > > >       gfp_t gfp_zero;
> > > >       struct kmem_cache *kmem_cache;
> > > > -     void *objects[KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE];
> > > > +     void *objects[0];
> > >
> > > The VLA police is going to track you down ([0] vs []).
> >
> > Thanks!
> >
> >
> > >
> > >         M.
> > >
> > > --
> > > Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-04 21:59       ` David Matlack
  2022-03-04 22:24         ` David Matlack
@ 2022-03-05 16:55         ` Marc Zyngier
  2022-03-07 23:49           ` David Matlack
  1 sibling, 1 reply; 65+ messages in thread
From: Marc Zyngier @ 2022-03-05 16:55 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Fri, 04 Mar 2022 21:59:12 +0000,
David Matlack <dmatlack@google.com> wrote:
> 
> On Thu, Feb 24, 2022 at 11:20 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Thu, Feb 24, 2022 at 3:29 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Thu, 03 Feb 2022 01:00:47 +0000,
> > > David Matlack <dmatlack@google.com> wrote:
> > > >
> 
> [...]
> 
> > > >
> > > >       /* Cache some mmu pages needed inside spinlock regions */
> > > > -     struct kvm_mmu_memory_cache mmu_page_cache;
> > > > +     DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> > >
> > > I must say I'm really not a fan of the anonymous structure trick. I
> > > can see why you are doing it that way, but it feels pretty brittle.
> >
> > Yeah I don't love it. It's really optimizing for minimizing the patch diff.
> >
> > The alternative I considered was to dynamically allocate the
> > kvm_mmu_memory_cache structs. This would get rid of the anonymous
> > struct and the objects array, and also eliminate the rather gross
> > capacity hack in kvm_mmu_topup_memory_cache().
> >
> > The downsides of this approach is more code and more failure paths if
> > the allocation fails.
> 
> I tried changing all kvm_mmu_memory_cache structs to be dynamically
> allocated, but it created a lot of complexity to the setup/teardown
> code paths in x86, arm64, mips, and riscv (the arches that use the
> caches). I don't think this route is worth it, especially since these
> structs don't *need* to be dynamically allocated.
> 
> When you said the anonymous struct feels brittle, what did you have in
> mind specifically?

I can perfectly see someone using a kvm_mmu_memory_cache and searching
for a bit why they end-up with memory corruption. Yes, this would be a
rookie mistake, but this are some expectations all over the kernel
that DEFINE_* and the corresponding structure are the same object.

[...]

> I see two alternatives to make this cleaner:
> 
> 1. Dynamically allocate just this cache. The caches defined in
> vcpu_arch will continue to use DEFINE_KVM_MMU_MEMORY_CACHE(). This
> would get rid of the outer struct but require an extra memory
> allocation.
> 2. Move this cache to struct kvm_arch using
> DEFINE_KVM_MMU_MEMORY_CACHE(). Then we don't need to stack allocate it
> or dynamically allocate it.
> 
> Do either of these approaches appeal to you more than the current one?

Certainly, #2 feels more solid. Dynamic allocations (and the resulting
pointer chasing) are usually costly in terms of performance, so I'd
avoid it if at all possible.

That being said, if it turns out that #2 isn't practical, I won't get
in the way of your current approach. Moving kvm_mmu_memory_cache to
core code was definitely a good cleanup, and I'm not overly excited
with the perspective of *more* arch-specific code.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU
  2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
                   ` (22 preceding siblings ...)
  2022-02-03  1:00 ` [PATCH 23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages David Matlack
@ 2022-03-07  5:21 ` Peter Xu
  2022-03-07 23:39   ` David Matlack
  23 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2022-03-07  5:21 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, maciej.szmigiero, kvm

Hi, David,

Sorry for a very late comment.

On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> Performance
> -----------
> 
> Eager page splitting moves the cost of splitting huge pages off of the
> vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> logging. This is useful because:
> 
>  - Splitting on the vCPU thread interrupts vCPUs execution and is
>    disruptive to customers whereas splitting on VM ioctl threads can
>    run in parallel with vCPU execution.
> 
>  - Splitting on the VM ioctl thread is more efficient because it does
>    no require performing VM-exit handling and page table walks for every
>    4K page.
> 
> To measure the performance impact of Eager Page Splitting I ran
> dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> vCPU, and backed by 1GiB HugeTLB memory.
> 
> To measure the imapct of customer performance, we can look at the time
> it takes all vCPUs to dirty memory after dirty logging has been enabled.
> Without Eager Page Splitting enabled, such dirtying must take faults to
> split huge pages and bottleneck on the MMU lock.
> 
>              | "Iteration 1 dirty memory time"             |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.310786549s         | 0.058731929s         |
> 4            | 0.419165587s         | 0.059615316s         |
> 8            | 1.061233860s         | 0.060945457s         |
> 16           | 2.852955595s         | 0.067069980s         |
> 32           | 7.032750509s         | 0.078623606s         |
> 64           | 16.501287504s        | 0.083914116s         |
> 
> Eager Page Splitting does increase the time it takes to enable dirty
> logging when not using initially-all-set, since that's when KVM splits
> huge pages. However, this runs in parallel with vCPU execution and does
> not bottleneck on the MMU lock.
> 
>              | "Enabling dirty logging time"               |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.001581619s         |  0.025699730s        |
> 4            | 0.003138664s         |  0.051510208s        |
> 8            | 0.006247177s         |  0.102960379s        |
> 16           | 0.012603892s         |  0.206949435s        |
> 32           | 0.026428036s         |  0.435855597s        |
> 64           | 0.103826796s         |  1.199686530s        |
> 
> Similarly, Eager Page Splitting increases the time it takes to clear the
> dirty log for when using initially-all-set. The first time userspace
> clears the dirty log, KVM will split huge pages:
> 
>              | "Iteration 1 clear dirty log time"          |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.001544730s         | 0.055327916s         |
> 4            | 0.003145920s         | 0.111887354s         |
> 8            | 0.006306964s         | 0.223920530s         |
> 16           | 0.012681628s         | 0.447849488s         |
> 32           | 0.026827560s         | 0.943874520s         |
> 64           | 0.090461490s         | 2.664388025s         |
> 
> Subsequent calls to clear the dirty log incur almost no additional cost
> since KVM can very quickly determine there are no more huge pages to
> split via the RMAP. This is unlike the TDP MMU which must re-traverse
> the entire page table to check for huge pages.
> 
>              | "Iteration 2 clear dirty log time"          |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.015613726s         | 0.015771982s         |
> 4            | 0.031456620s         | 0.031911594s         |
> 8            | 0.063341572s         | 0.063837403s         |
> 16           | 0.128409332s         | 0.127484064s         |
> 32           | 0.255635696s         | 0.268837996s         |
> 64           | 0.695572818s         | 0.700420727s         |

Are all the tests above with ept=Y (except the one below)?

> 
> Eager Page Splitting also improves the performance for shadow paging
> configurations, as measured with ept=N. Although the absolute gains are
> less since ept=N requires taking the MMU lock to track writes to 4KiB
> pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> memory time.
> 
>              | "Iteration 1 dirty memory time"             |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.373022770s         | 0.348926043s         |
> 4            | 0.563697483s         | 0.453022037s         |
> 8            | 1.588492808s         | 1.524962010s         |
> 16           | 3.988934732s         | 3.369129917s         |
> 32           | 9.470333115s         | 8.292953856s         |
> 64           | 20.086419186s        | 18.531840021s        |

This one is definitely for ept=N because it's written there. That's ~10%
performance increase which looks still good, but IMHO that increase is
"debatable" since a normal guest may not simply write over the whole guest
mem.. So that 10% increase is based on some assumptions.

What if the guest writes 80% and reads 20%?  IIUC the split thread will
also start to block the readers too for shadow mmu while it was not blocked
previusly?  From that pov, not sure whether the series needs some more
justification, as the changeset seems still large.

Is there other benefits besides the 10% increase on writes?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU
  2022-03-07  5:21 ` [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU Peter Xu
@ 2022-03-07 23:39   ` David Matlack
  2022-03-09  7:31     ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-03-07 23:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, Maciej S. Szmigiero,
	kvm list

On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, David,
>
> Sorry for a very late comment.
>
> On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > Performance
> > -----------
> >
> > Eager page splitting moves the cost of splitting huge pages off of the
> > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > logging. This is useful because:
> >
> >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> >    disruptive to customers whereas splitting on VM ioctl threads can
> >    run in parallel with vCPU execution.
> >
> >  - Splitting on the VM ioctl thread is more efficient because it does
> >    no require performing VM-exit handling and page table walks for every
> >    4K page.
> >
> > To measure the performance impact of Eager Page Splitting I ran
> > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > vCPU, and backed by 1GiB HugeTLB memory.
> >
> > To measure the imapct of customer performance, we can look at the time
> > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > Without Eager Page Splitting enabled, such dirtying must take faults to
> > split huge pages and bottleneck on the MMU lock.
> >
> >              | "Iteration 1 dirty memory time"             |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.310786549s         | 0.058731929s         |
> > 4            | 0.419165587s         | 0.059615316s         |
> > 8            | 1.061233860s         | 0.060945457s         |
> > 16           | 2.852955595s         | 0.067069980s         |
> > 32           | 7.032750509s         | 0.078623606s         |
> > 64           | 16.501287504s        | 0.083914116s         |
> >
> > Eager Page Splitting does increase the time it takes to enable dirty
> > logging when not using initially-all-set, since that's when KVM splits
> > huge pages. However, this runs in parallel with vCPU execution and does
> > not bottleneck on the MMU lock.
> >
> >              | "Enabling dirty logging time"               |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.001581619s         |  0.025699730s        |
> > 4            | 0.003138664s         |  0.051510208s        |
> > 8            | 0.006247177s         |  0.102960379s        |
> > 16           | 0.012603892s         |  0.206949435s        |
> > 32           | 0.026428036s         |  0.435855597s        |
> > 64           | 0.103826796s         |  1.199686530s        |
> >
> > Similarly, Eager Page Splitting increases the time it takes to clear the
> > dirty log for when using initially-all-set. The first time userspace
> > clears the dirty log, KVM will split huge pages:
> >
> >              | "Iteration 1 clear dirty log time"          |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.001544730s         | 0.055327916s         |
> > 4            | 0.003145920s         | 0.111887354s         |
> > 8            | 0.006306964s         | 0.223920530s         |
> > 16           | 0.012681628s         | 0.447849488s         |
> > 32           | 0.026827560s         | 0.943874520s         |
> > 64           | 0.090461490s         | 2.664388025s         |
> >
> > Subsequent calls to clear the dirty log incur almost no additional cost
> > since KVM can very quickly determine there are no more huge pages to
> > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > the entire page table to check for huge pages.
> >
> >              | "Iteration 2 clear dirty log time"          |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.015613726s         | 0.015771982s         |
> > 4            | 0.031456620s         | 0.031911594s         |
> > 8            | 0.063341572s         | 0.063837403s         |
> > 16           | 0.128409332s         | 0.127484064s         |
> > 32           | 0.255635696s         | 0.268837996s         |
> > 64           | 0.695572818s         | 0.700420727s         |
>
> Are all the tests above with ept=Y (except the one below)?

Yes.

>
> >
> > Eager Page Splitting also improves the performance for shadow paging
> > configurations, as measured with ept=N. Although the absolute gains are
> > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > memory time.
> >
> >              | "Iteration 1 dirty memory time"             |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.373022770s         | 0.348926043s         |
> > 4            | 0.563697483s         | 0.453022037s         |
> > 8            | 1.588492808s         | 1.524962010s         |
> > 16           | 3.988934732s         | 3.369129917s         |
> > 32           | 9.470333115s         | 8.292953856s         |
> > 64           | 20.086419186s        | 18.531840021s        |
>
> This one is definitely for ept=N because it's written there. That's ~10%
> performance increase which looks still good, but IMHO that increase is
> "debatable" since a normal guest may not simply write over the whole guest
> mem.. So that 10% increase is based on some assumptions.
>
> What if the guest writes 80% and reads 20%?  IIUC the split thread will
> also start to block the readers too for shadow mmu while it was not blocked
> previusly?  From that pov, not sure whether the series needs some more
> justification, as the changeset seems still large.
>
> Is there other benefits besides the 10% increase on writes?

Yes, in fact workloads that perform some reads will benefit _more_
than workloads that perform only writes.

The reason is that the current lazy splitting approach unmaps the
entire huge page on write and then maps in the just the faulting 4K
page. That means reads on the unmapped portion of the hugepage will
now take a fault and require the MMU lock. In contrast, Eager Page
Splitting fully splits each huge page so readers should never take
faults.

For example, here is the data with 20% writes and 80% reads (i.e. pass
`-f 5` to dirty_log_perf_test):

             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.403108098s         | 0.071808764s         |
4            | 0.562173582s         | 0.105272819s         |
8            | 1.382974557s         | 0.248713796s         |
16           | 3.608993666s         | 0.571990327s         |
32           | 9.100678321s         | 1.702453103s         |
64           | 19.784780903s        | 3.489443239s        |

>
> Thanks,

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-05 16:55         ` Marc Zyngier
@ 2022-03-07 23:49           ` David Matlack
  2022-03-08  7:42             ` Marc Zyngier
  2022-03-09 21:49             ` David Matlack
  0 siblings, 2 replies; 65+ messages in thread
From: David Matlack @ 2022-03-07 23:49 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Sat, Mar 5, 2022 at 8:55 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 04 Mar 2022 21:59:12 +0000,
> David Matlack <dmatlack@google.com> wrote:
> >
> > On Thu, Feb 24, 2022 at 11:20 AM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Thu, Feb 24, 2022 at 3:29 AM Marc Zyngier <maz@kernel.org> wrote:
> > > >
> > > > On Thu, 03 Feb 2022 01:00:47 +0000,
> > > > David Matlack <dmatlack@google.com> wrote:
> > > > >
> >
> > [...]
> >
> > > > >
> > > > >       /* Cache some mmu pages needed inside spinlock regions */
> > > > > -     struct kvm_mmu_memory_cache mmu_page_cache;
> > > > > +     DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> > > >
> > > > I must say I'm really not a fan of the anonymous structure trick. I
> > > > can see why you are doing it that way, but it feels pretty brittle.
> > >
> > > Yeah I don't love it. It's really optimizing for minimizing the patch diff.
> > >
> > > The alternative I considered was to dynamically allocate the
> > > kvm_mmu_memory_cache structs. This would get rid of the anonymous
> > > struct and the objects array, and also eliminate the rather gross
> > > capacity hack in kvm_mmu_topup_memory_cache().
> > >
> > > The downsides of this approach is more code and more failure paths if
> > > the allocation fails.
> >
> > I tried changing all kvm_mmu_memory_cache structs to be dynamically
> > allocated, but it created a lot of complexity to the setup/teardown
> > code paths in x86, arm64, mips, and riscv (the arches that use the
> > caches). I don't think this route is worth it, especially since these
> > structs don't *need* to be dynamically allocated.
> >
> > When you said the anonymous struct feels brittle, what did you have in
> > mind specifically?
>
> I can perfectly see someone using a kvm_mmu_memory_cache and searching
> for a bit why they end-up with memory corruption. Yes, this would be a
> rookie mistake, but this are some expectations all over the kernel
> that DEFINE_* and the corresponding structure are the same object.

That is a good point. And that risk is very real given that
kvm_mmu_topup_memory_cache() assumes the capacity is
KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE if the capacity field is 0.

One way to mitigate this would be to get rid of the capacity hack in
kvm_mmu_topup_memory_cache() and require the capacity field be
explicitly initialized. That will make it harder to trip over this
and/or easier to debug because kvm_mmu_topup_memory_cache() can issue
a WARN() if the capacity is 0. Once you see that warning and go to
initialize the capacity field you'll realize why it needs to be set in
the first place. The diff will just be slightly larger to set capacity
for each cache.

>
> [...]
>
> > I see two alternatives to make this cleaner:
> >
> > 1. Dynamically allocate just this cache. The caches defined in
> > vcpu_arch will continue to use DEFINE_KVM_MMU_MEMORY_CACHE(). This
> > would get rid of the outer struct but require an extra memory
> > allocation.
> > 2. Move this cache to struct kvm_arch using
> > DEFINE_KVM_MMU_MEMORY_CACHE(). Then we don't need to stack allocate it
> > or dynamically allocate it.
> >
> > Do either of these approaches appeal to you more than the current one?
>
> Certainly, #2 feels more solid. Dynamic allocations (and the resulting
> pointer chasing) are usually costly in terms of performance, so I'd
> avoid it if at all possible.
>
> That being said, if it turns out that #2 isn't practical, I won't get
> in the way of your current approach. Moving kvm_mmu_memory_cache to
> core code was definitely a good cleanup, and I'm not overly excited
> with the perspective of *more* arch-specific code.

Ok I'll play with #2. Thanks for the feedback.

>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-07 23:49           ` David Matlack
@ 2022-03-08  7:42             ` Marc Zyngier
  2022-03-09 21:49             ` David Matlack
  1 sibling, 0 replies; 65+ messages in thread
From: Marc Zyngier @ 2022-03-08  7:42 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Mon, 07 Mar 2022 23:49:06 +0000,
David Matlack <dmatlack@google.com> wrote:
> 
> On Sat, Mar 5, 2022 at 8:55 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 04 Mar 2022 21:59:12 +0000,
> > David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Thu, Feb 24, 2022 at 11:20 AM David Matlack <dmatlack@google.com> wrote:
> > > >
> > > > On Thu, Feb 24, 2022 at 3:29 AM Marc Zyngier <maz@kernel.org> wrote:
> > > > >
> > > > > On Thu, 03 Feb 2022 01:00:47 +0000,
> > > > > David Matlack <dmatlack@google.com> wrote:
> > > > > >
> > >
> > > [...]
> > >
> > > > > >
> > > > > >       /* Cache some mmu pages needed inside spinlock regions */
> > > > > > -     struct kvm_mmu_memory_cache mmu_page_cache;
> > > > > > +     DEFINE_KVM_MMU_MEMORY_CACHE(mmu_page_cache);
> > > > >
> > > > > I must say I'm really not a fan of the anonymous structure trick. I
> > > > > can see why you are doing it that way, but it feels pretty brittle.
> > > >
> > > > Yeah I don't love it. It's really optimizing for minimizing the patch diff.
> > > >
> > > > The alternative I considered was to dynamically allocate the
> > > > kvm_mmu_memory_cache structs. This would get rid of the anonymous
> > > > struct and the objects array, and also eliminate the rather gross
> > > > capacity hack in kvm_mmu_topup_memory_cache().
> > > >
> > > > The downsides of this approach is more code and more failure paths if
> > > > the allocation fails.
> > >
> > > I tried changing all kvm_mmu_memory_cache structs to be dynamically
> > > allocated, but it created a lot of complexity to the setup/teardown
> > > code paths in x86, arm64, mips, and riscv (the arches that use the
> > > caches). I don't think this route is worth it, especially since these
> > > structs don't *need* to be dynamically allocated.
> > >
> > > When you said the anonymous struct feels brittle, what did you have in
> > > mind specifically?
> >
> > I can perfectly see someone using a kvm_mmu_memory_cache and searching
> > for a bit why they end-up with memory corruption. Yes, this would be a
> > rookie mistake, but this are some expectations all over the kernel
> > that DEFINE_* and the corresponding structure are the same object.
> 
> That is a good point. And that risk is very real given that
> kvm_mmu_topup_memory_cache() assumes the capacity is
> KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE if the capacity field is 0.

Exactly. I like being surprised as much as the next one, but I'm not
sure about this particular instance ;-).

> One way to mitigate this would be to get rid of the capacity hack in
> kvm_mmu_topup_memory_cache() and require the capacity field be
> explicitly initialized. That will make it harder to trip over this
> and/or easier to debug because kvm_mmu_topup_memory_cache() can issue
> a WARN() if the capacity is 0. Once you see that warning and go to
> initialize the capacity field you'll realize why it needs to be set in
> the first place. The diff will just be slightly larger to set capacity
> for each cache.

That'd be fine. I don't mind the extra diff if this can be made more
or less foolproof.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU
  2022-03-07 23:39   ` David Matlack
@ 2022-03-09  7:31     ` Peter Xu
  2022-03-09 23:39       ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2022-03-09  7:31 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, Maciej S. Szmigiero,
	kvm list

On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, David,
> >
> > Sorry for a very late comment.
> >
> > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > Performance
> > > -----------
> > >
> > > Eager page splitting moves the cost of splitting huge pages off of the
> > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > logging. This is useful because:
> > >
> > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > >    disruptive to customers whereas splitting on VM ioctl threads can
> > >    run in parallel with vCPU execution.
> > >
> > >  - Splitting on the VM ioctl thread is more efficient because it does
> > >    no require performing VM-exit handling and page table walks for every
> > >    4K page.
> > >
> > > To measure the performance impact of Eager Page Splitting I ran
> > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > vCPU, and backed by 1GiB HugeTLB memory.
> > >
> > > To measure the imapct of customer performance, we can look at the time
> > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > split huge pages and bottleneck on the MMU lock.
> > >
> > >              | "Iteration 1 dirty memory time"             |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.310786549s         | 0.058731929s         |
> > > 4            | 0.419165587s         | 0.059615316s         |
> > > 8            | 1.061233860s         | 0.060945457s         |
> > > 16           | 2.852955595s         | 0.067069980s         |
> > > 32           | 7.032750509s         | 0.078623606s         |
> > > 64           | 16.501287504s        | 0.083914116s         |
> > >
> > > Eager Page Splitting does increase the time it takes to enable dirty
> > > logging when not using initially-all-set, since that's when KVM splits
> > > huge pages. However, this runs in parallel with vCPU execution and does
> > > not bottleneck on the MMU lock.
> > >
> > >              | "Enabling dirty logging time"               |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.001581619s         |  0.025699730s        |
> > > 4            | 0.003138664s         |  0.051510208s        |
> > > 8            | 0.006247177s         |  0.102960379s        |
> > > 16           | 0.012603892s         |  0.206949435s        |
> > > 32           | 0.026428036s         |  0.435855597s        |
> > > 64           | 0.103826796s         |  1.199686530s        |
> > >
> > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > dirty log for when using initially-all-set. The first time userspace
> > > clears the dirty log, KVM will split huge pages:
> > >
> > >              | "Iteration 1 clear dirty log time"          |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.001544730s         | 0.055327916s         |
> > > 4            | 0.003145920s         | 0.111887354s         |
> > > 8            | 0.006306964s         | 0.223920530s         |
> > > 16           | 0.012681628s         | 0.447849488s         |
> > > 32           | 0.026827560s         | 0.943874520s         |
> > > 64           | 0.090461490s         | 2.664388025s         |
> > >
> > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > since KVM can very quickly determine there are no more huge pages to
> > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > the entire page table to check for huge pages.
> > >
> > >              | "Iteration 2 clear dirty log time"          |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.015613726s         | 0.015771982s         |
> > > 4            | 0.031456620s         | 0.031911594s         |
> > > 8            | 0.063341572s         | 0.063837403s         |
> > > 16           | 0.128409332s         | 0.127484064s         |
> > > 32           | 0.255635696s         | 0.268837996s         |
> > > 64           | 0.695572818s         | 0.700420727s         |
> >
> > Are all the tests above with ept=Y (except the one below)?
> 
> Yes.
> 
> >
> > >
> > > Eager Page Splitting also improves the performance for shadow paging
> > > configurations, as measured with ept=N. Although the absolute gains are
> > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > memory time.
> > >
> > >              | "Iteration 1 dirty memory time"             |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.373022770s         | 0.348926043s         |
> > > 4            | 0.563697483s         | 0.453022037s         |
> > > 8            | 1.588492808s         | 1.524962010s         |
> > > 16           | 3.988934732s         | 3.369129917s         |
> > > 32           | 9.470333115s         | 8.292953856s         |
> > > 64           | 20.086419186s        | 18.531840021s        |
> >
> > This one is definitely for ept=N because it's written there. That's ~10%
> > performance increase which looks still good, but IMHO that increase is
> > "debatable" since a normal guest may not simply write over the whole guest
> > mem.. So that 10% increase is based on some assumptions.
> >
> > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > also start to block the readers too for shadow mmu while it was not blocked
> > previusly?  From that pov, not sure whether the series needs some more
> > justification, as the changeset seems still large.
> >
> > Is there other benefits besides the 10% increase on writes?
> 
> Yes, in fact workloads that perform some reads will benefit _more_
> than workloads that perform only writes.
> 
> The reason is that the current lazy splitting approach unmaps the
> entire huge page on write and then maps in the just the faulting 4K
> page. That means reads on the unmapped portion of the hugepage will
> now take a fault and require the MMU lock. In contrast, Eager Page
> Splitting fully splits each huge page so readers should never take
> faults.
> 
> For example, here is the data with 20% writes and 80% reads (i.e. pass
> `-f 5` to dirty_log_perf_test):
> 
>              | "Iteration 1 dirty memory time"             |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.403108098s         | 0.071808764s         |
> 4            | 0.562173582s         | 0.105272819s         |
> 8            | 1.382974557s         | 0.248713796s         |
> 16           | 3.608993666s         | 0.571990327s         |
> 32           | 9.100678321s         | 1.702453103s         |
> 64           | 19.784780903s        | 3.489443239s        |

It's very interesting to know these numbers, thanks for sharing that.

Above reminded me that eager page split actually does two things:

(1) When a page is mapped as huge, we "assume" this whole page will be
    accessed in the near future, so when split is needed we map all the
    small ptes, and,

(2) We move the split operation from page faults to when enable-dirty-track
    happens.

We could have done (1) already without the whole eager split patchsets: if
we see a read-only huge page on a page fault, we could populat the whole
range of ptes, only marking current small pte writable but leaving the rest
small ptes wr-protected.  I had a feeling this will speedup the above 19.78
seconds (64 cores case) fairly much too to some point.

Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
not strongly.

My previous concern was majorly about having readers being blocked during
splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
will start to take write lock rather than read lock (comparing to tdp mmu),
hence any vcpu page faults (hmm, not only reader but writters too I think
with non-present pte..) will be blocked longer than before, am I right?

Meanwhile for shadow mmu I think there can be more page tables to walk
comparing to the tdp mmu for a single huge page to split?  My understanding
is tdp mmu pgtables are mostly limited by the number of address spaces (?),
but shadow pgtables are per-task.  So I'm not sure whether for a guest with
a lot of active tasks sharing pages, the split thread can spend quite some
time splitting, during which time with write lock held without releasing.

These are kind of against the purpose of eager split on shadowing, which is
to reduce influence for guest vcpu threads?  But I can't tell, I could have
missed something else.  It's just that when applying the idea to shadow mmu
it sounds less attractive than the tdp mmu case.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-07 23:49           ` David Matlack
  2022-03-08  7:42             ` Marc Zyngier
@ 2022-03-09 21:49             ` David Matlack
  2022-03-10  8:30               ` Marc Zyngier
  1 sibling, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-03-09 21:49 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Mon, Mar 7, 2022 at 3:49 PM David Matlack <dmatlack@google.com> wrote:
>
> On Sat, Mar 5, 2022 at 8:55 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 04 Mar 2022 21:59:12 +0000,
> > David Matlack <dmatlack@google.com> wrote:
> > > I see two alternatives to make this cleaner:
> > >
> > > 1. Dynamically allocate just this cache. The caches defined in
> > > vcpu_arch will continue to use DEFINE_KVM_MMU_MEMORY_CACHE(). This
> > > would get rid of the outer struct but require an extra memory
> > > allocation.
> > > 2. Move this cache to struct kvm_arch using
> > > DEFINE_KVM_MMU_MEMORY_CACHE(). Then we don't need to stack allocate it
> > > or dynamically allocate it.
> > >
> > > Do either of these approaches appeal to you more than the current one?
> >
> > Certainly, #2 feels more solid. Dynamic allocations (and the resulting
> > pointer chasing) are usually costly in terms of performance, so I'd
> > avoid it if at all possible.
> >
> > That being said, if it turns out that #2 isn't practical, I won't get
> > in the way of your current approach. Moving kvm_mmu_memory_cache to
> > core code was definitely a good cleanup, and I'm not overly excited
> > with the perspective of *more* arch-specific code.
>
> Ok I'll play with #2. Thanks for the feedback.

#2 is very clean to implement but it ends up being a bit silly. It
increases the size of struct kvm_arch by 336 bytes for all VMs but
only ever gets used during kvm_vgic_map_resources(), which is only
called the first time a vCPU is run (according to the comment in
kvm_arch_vcpu_run_pid_change()). I think stack allocation makes the
most sense for this object, I don't think it's worth dancing around
that solely to avoid the inner struct grottiness.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU
  2022-03-09  7:31     ` Peter Xu
@ 2022-03-09 23:39       ` David Matlack
  2022-03-10  7:03         ` Peter Xu
  0 siblings, 1 reply; 65+ messages in thread
From: David Matlack @ 2022-03-09 23:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, Maciej S. Szmigiero,
	kvm list

On Tue, Mar 8, 2022 at 11:31 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> > On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, David,
> > >
> > > Sorry for a very late comment.
> > >
> > > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > > Performance
> > > > -----------
> > > >
> > > > Eager page splitting moves the cost of splitting huge pages off of the
> > > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > > logging. This is useful because:
> > > >
> > > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > > >    disruptive to customers whereas splitting on VM ioctl threads can
> > > >    run in parallel with vCPU execution.
> > > >
> > > >  - Splitting on the VM ioctl thread is more efficient because it does
> > > >    no require performing VM-exit handling and page table walks for every
> > > >    4K page.
> > > >
> > > > To measure the performance impact of Eager Page Splitting I ran
> > > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > > vCPU, and backed by 1GiB HugeTLB memory.
> > > >
> > > > To measure the imapct of customer performance, we can look at the time
> > > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > > split huge pages and bottleneck on the MMU lock.
> > > >
> > > >              | "Iteration 1 dirty memory time"             |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.310786549s         | 0.058731929s         |
> > > > 4            | 0.419165587s         | 0.059615316s         |
> > > > 8            | 1.061233860s         | 0.060945457s         |
> > > > 16           | 2.852955595s         | 0.067069980s         |
> > > > 32           | 7.032750509s         | 0.078623606s         |
> > > > 64           | 16.501287504s        | 0.083914116s         |
> > > >
> > > > Eager Page Splitting does increase the time it takes to enable dirty
> > > > logging when not using initially-all-set, since that's when KVM splits
> > > > huge pages. However, this runs in parallel with vCPU execution and does
> > > > not bottleneck on the MMU lock.
> > > >
> > > >              | "Enabling dirty logging time"               |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.001581619s         |  0.025699730s        |
> > > > 4            | 0.003138664s         |  0.051510208s        |
> > > > 8            | 0.006247177s         |  0.102960379s        |
> > > > 16           | 0.012603892s         |  0.206949435s        |
> > > > 32           | 0.026428036s         |  0.435855597s        |
> > > > 64           | 0.103826796s         |  1.199686530s        |
> > > >
> > > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > > dirty log for when using initially-all-set. The first time userspace
> > > > clears the dirty log, KVM will split huge pages:
> > > >
> > > >              | "Iteration 1 clear dirty log time"          |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.001544730s         | 0.055327916s         |
> > > > 4            | 0.003145920s         | 0.111887354s         |
> > > > 8            | 0.006306964s         | 0.223920530s         |
> > > > 16           | 0.012681628s         | 0.447849488s         |
> > > > 32           | 0.026827560s         | 0.943874520s         |
> > > > 64           | 0.090461490s         | 2.664388025s         |
> > > >
> > > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > > since KVM can very quickly determine there are no more huge pages to
> > > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > > the entire page table to check for huge pages.
> > > >
> > > >              | "Iteration 2 clear dirty log time"          |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.015613726s         | 0.015771982s         |
> > > > 4            | 0.031456620s         | 0.031911594s         |
> > > > 8            | 0.063341572s         | 0.063837403s         |
> > > > 16           | 0.128409332s         | 0.127484064s         |
> > > > 32           | 0.255635696s         | 0.268837996s         |
> > > > 64           | 0.695572818s         | 0.700420727s         |
> > >
> > > Are all the tests above with ept=Y (except the one below)?
> >
> > Yes.
> >
> > >
> > > >
> > > > Eager Page Splitting also improves the performance for shadow paging
> > > > configurations, as measured with ept=N. Although the absolute gains are
> > > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > > memory time.
> > > >
> > > >              | "Iteration 1 dirty memory time"             |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.373022770s         | 0.348926043s         |
> > > > 4            | 0.563697483s         | 0.453022037s         |
> > > > 8            | 1.588492808s         | 1.524962010s         |
> > > > 16           | 3.988934732s         | 3.369129917s         |
> > > > 32           | 9.470333115s         | 8.292953856s         |
> > > > 64           | 20.086419186s        | 18.531840021s        |
> > >
> > > This one is definitely for ept=N because it's written there. That's ~10%
> > > performance increase which looks still good, but IMHO that increase is
> > > "debatable" since a normal guest may not simply write over the whole guest
> > > mem.. So that 10% increase is based on some assumptions.
> > >
> > > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > > also start to block the readers too for shadow mmu while it was not blocked
> > > previusly?  From that pov, not sure whether the series needs some more
> > > justification, as the changeset seems still large.
> > >
> > > Is there other benefits besides the 10% increase on writes?
> >
> > Yes, in fact workloads that perform some reads will benefit _more_
> > than workloads that perform only writes.
> >
> > The reason is that the current lazy splitting approach unmaps the
> > entire huge page on write and then maps in the just the faulting 4K
> > page. That means reads on the unmapped portion of the hugepage will
> > now take a fault and require the MMU lock. In contrast, Eager Page
> > Splitting fully splits each huge page so readers should never take
> > faults.
> >
> > For example, here is the data with 20% writes and 80% reads (i.e. pass
> > `-f 5` to dirty_log_perf_test):
> >
> >              | "Iteration 1 dirty memory time"             |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.403108098s         | 0.071808764s         |
> > 4            | 0.562173582s         | 0.105272819s         |
> > 8            | 1.382974557s         | 0.248713796s         |
> > 16           | 3.608993666s         | 0.571990327s         |
> > 32           | 9.100678321s         | 1.702453103s         |
> > 64           | 19.784780903s        | 3.489443239s        |
>
> It's very interesting to know these numbers, thanks for sharing that.
>
> Above reminded me that eager page split actually does two things:
>
> (1) When a page is mapped as huge, we "assume" this whole page will be
>     accessed in the near future, so when split is needed we map all the
>     small ptes, and,

Note, this series does not add this behavior to the fault path.

>
> (2) We move the split operation from page faults to when enable-dirty-track
>     happens.
>
> We could have done (1) already without the whole eager split patchsets: if
> we see a read-only huge page on a page fault, we could populat the whole
> range of ptes, only marking current small pte writable but leaving the rest
> small ptes wr-protected.  I had a feeling this will speedup the above 19.78
> seconds (64 cores case) fairly much too to some point.

The problem with (1) is that it still requires faults to split the
huge pages. Those faults will need to contend for the MMU lock, and
will hold the lock for longer than they do today since they are doing
extra work.

I agree there might be some benefit for workloads, but for write-heavy
workloads there will still be a "thundering herd" problem when dirty
logging is first enable. I'll admit though I have not testing this
approach.

An alternative approach to handling read-heavy workloads we're looking
at is to perform dirty logging at 2M.

>
> Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
> not strongly.
>
> My previous concern was majorly about having readers being blocked during
> splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
> will start to take write lock rather than read lock (comparing to tdp mmu),
> hence any vcpu page faults (hmm, not only reader but writters too I think
> with non-present pte..) will be blocked longer than before, am I right?
>
> Meanwhile for shadow mmu I think there can be more page tables to walk
> comparing to the tdp mmu for a single huge page to split?  My understanding
> is tdp mmu pgtables are mostly limited by the number of address spaces (?),
> but shadow pgtables are per-task.

Or per-L2 VM, in the case of nested virtualization.

> So I'm not sure whether for a guest with
> a lot of active tasks sharing pages, the split thread can spend quite some
> time splitting, during which time with write lock held without releasing.

The eager page splitting code does check for contention and drop the
MMU lock in between every SPTE it tries to split. But there still
might be some increase in contention due to eager page splitting.

>
> These are kind of against the purpose of eager split on shadowing, which is
> to reduce influence for guest vcpu threads?  But I can't tell, I could have
> missed something else.  It's just that when applying the idea to shadow mmu
> it sounds less attractive than the tdp mmu case.

The shadow MMU is also used for Nested Virtualization, which is a bit
different from "typical" shadow paging (ept/npt=N) because VMs tend
not to share pages, their page tables are fairly static (compared to
process page tables), and they tend to be longer lived. So there will
not be as much steady-state MMU lock contention that would be
negatively impacted by eager page splitting.

You might be right though that ept/npt=N has enough steady-state MMU
lock contention that it will notice eager page splitting. But then
again, it would be even more affected by lazy splitting unless the
guest is doing very few writes.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU
  2022-03-09 23:39       ` David Matlack
@ 2022-03-10  7:03         ` Peter Xu
  2022-03-10 19:26           ` David Matlack
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Xu @ 2022-03-10  7:03 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, Maciej S. Szmigiero,
	kvm list

On Wed, Mar 09, 2022 at 03:39:44PM -0800, David Matlack wrote:
> On Tue, Mar 8, 2022 at 11:31 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> > > On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Hi, David,
> > > >
> > > > Sorry for a very late comment.
> > > >
> > > > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > > > Performance
> > > > > -----------
> > > > >
> > > > > Eager page splitting moves the cost of splitting huge pages off of the
> > > > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > > > logging. This is useful because:
> > > > >
> > > > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > > > >    disruptive to customers whereas splitting on VM ioctl threads can
> > > > >    run in parallel with vCPU execution.
> > > > >
> > > > >  - Splitting on the VM ioctl thread is more efficient because it does
> > > > >    no require performing VM-exit handling and page table walks for every
> > > > >    4K page.
> > > > >
> > > > > To measure the performance impact of Eager Page Splitting I ran
> > > > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > > > vCPU, and backed by 1GiB HugeTLB memory.
> > > > >
> > > > > To measure the imapct of customer performance, we can look at the time
> > > > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > > > split huge pages and bottleneck on the MMU lock.
> > > > >
> > > > >              | "Iteration 1 dirty memory time"             |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.310786549s         | 0.058731929s         |
> > > > > 4            | 0.419165587s         | 0.059615316s         |
> > > > > 8            | 1.061233860s         | 0.060945457s         |
> > > > > 16           | 2.852955595s         | 0.067069980s         |
> > > > > 32           | 7.032750509s         | 0.078623606s         |
> > > > > 64           | 16.501287504s        | 0.083914116s         |
> > > > >
> > > > > Eager Page Splitting does increase the time it takes to enable dirty
> > > > > logging when not using initially-all-set, since that's when KVM splits
> > > > > huge pages. However, this runs in parallel with vCPU execution and does
> > > > > not bottleneck on the MMU lock.
> > > > >
> > > > >              | "Enabling dirty logging time"               |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.001581619s         |  0.025699730s        |
> > > > > 4            | 0.003138664s         |  0.051510208s        |
> > > > > 8            | 0.006247177s         |  0.102960379s        |
> > > > > 16           | 0.012603892s         |  0.206949435s        |
> > > > > 32           | 0.026428036s         |  0.435855597s        |
> > > > > 64           | 0.103826796s         |  1.199686530s        |
> > > > >
> > > > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > > > dirty log for when using initially-all-set. The first time userspace
> > > > > clears the dirty log, KVM will split huge pages:
> > > > >
> > > > >              | "Iteration 1 clear dirty log time"          |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.001544730s         | 0.055327916s         |
> > > > > 4            | 0.003145920s         | 0.111887354s         |
> > > > > 8            | 0.006306964s         | 0.223920530s         |
> > > > > 16           | 0.012681628s         | 0.447849488s         |
> > > > > 32           | 0.026827560s         | 0.943874520s         |
> > > > > 64           | 0.090461490s         | 2.664388025s         |
> > > > >
> > > > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > > > since KVM can very quickly determine there are no more huge pages to
> > > > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > > > the entire page table to check for huge pages.
> > > > >
> > > > >              | "Iteration 2 clear dirty log time"          |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.015613726s         | 0.015771982s         |
> > > > > 4            | 0.031456620s         | 0.031911594s         |
> > > > > 8            | 0.063341572s         | 0.063837403s         |
> > > > > 16           | 0.128409332s         | 0.127484064s         |
> > > > > 32           | 0.255635696s         | 0.268837996s         |
> > > > > 64           | 0.695572818s         | 0.700420727s         |
> > > >
> > > > Are all the tests above with ept=Y (except the one below)?
> > >
> > > Yes.
> > >
> > > >
> > > > >
> > > > > Eager Page Splitting also improves the performance for shadow paging
> > > > > configurations, as measured with ept=N. Although the absolute gains are
> > > > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > > > memory time.
> > > > >
> > > > >              | "Iteration 1 dirty memory time"             |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.373022770s         | 0.348926043s         |
> > > > > 4            | 0.563697483s         | 0.453022037s         |
> > > > > 8            | 1.588492808s         | 1.524962010s         |
> > > > > 16           | 3.988934732s         | 3.369129917s         |
> > > > > 32           | 9.470333115s         | 8.292953856s         |
> > > > > 64           | 20.086419186s        | 18.531840021s        |
> > > >
> > > > This one is definitely for ept=N because it's written there. That's ~10%
> > > > performance increase which looks still good, but IMHO that increase is
> > > > "debatable" since a normal guest may not simply write over the whole guest
> > > > mem.. So that 10% increase is based on some assumptions.
> > > >
> > > > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > > > also start to block the readers too for shadow mmu while it was not blocked
> > > > previusly?  From that pov, not sure whether the series needs some more
> > > > justification, as the changeset seems still large.
> > > >
> > > > Is there other benefits besides the 10% increase on writes?
> > >
> > > Yes, in fact workloads that perform some reads will benefit _more_
> > > than workloads that perform only writes.
> > >
> > > The reason is that the current lazy splitting approach unmaps the
> > > entire huge page on write and then maps in the just the faulting 4K
> > > page. That means reads on the unmapped portion of the hugepage will
> > > now take a fault and require the MMU lock. In contrast, Eager Page
> > > Splitting fully splits each huge page so readers should never take
> > > faults.
> > >
> > > For example, here is the data with 20% writes and 80% reads (i.e. pass
> > > `-f 5` to dirty_log_perf_test):
> > >
> > >              | "Iteration 1 dirty memory time"             |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.403108098s         | 0.071808764s         |
> > > 4            | 0.562173582s         | 0.105272819s         |
> > > 8            | 1.382974557s         | 0.248713796s         |
> > > 16           | 3.608993666s         | 0.571990327s         |
> > > 32           | 9.100678321s         | 1.702453103s         |
> > > 64           | 19.784780903s        | 3.489443239s        |
> >
> > It's very interesting to know these numbers, thanks for sharing that.
> >
> > Above reminded me that eager page split actually does two things:
> >
> > (1) When a page is mapped as huge, we "assume" this whole page will be
> >     accessed in the near future, so when split is needed we map all the
> >     small ptes, and,
> 
> Note, this series does not add this behavior to the fault path.
> 
> >
> > (2) We move the split operation from page faults to when enable-dirty-track
> >     happens.
> >
> > We could have done (1) already without the whole eager split patchsets: if
> > we see a read-only huge page on a page fault, we could populat the whole
> > range of ptes, only marking current small pte writable but leaving the rest
> > small ptes wr-protected.  I had a feeling this will speedup the above 19.78
> > seconds (64 cores case) fairly much too to some point.
> 
> The problem with (1) is that it still requires faults to split the
> huge pages. Those faults will need to contend for the MMU lock, and
> will hold the lock for longer than they do today since they are doing
> extra work.

Right.  But that overhead is very limited, IMHO.. per the numbers, it's the
20sec and 18sec difference for full write faults.

The thing is either split or vcpu will take the write lock anyway.  So it
either contends during split, or later.  Without tdp (so never PML) it'll
need a slow page fault anyway even if split is done before hand..

> 
> I agree there might be some benefit for workloads, but for write-heavy
> workloads there will still be a "thundering herd" problem when dirty
> logging is first enable. I'll admit though I have not testing this
> approach.

Indeed that's majorly the core of my question, on why this series cares
more on write than read workloads.  To me they are all possible workloads,
but maybe I'm wrong?  This series benefits heavy writes, but it may not
benefit (or even make it slower on) heavy reads.

The tdp mmu case is more persuasive in that:

  (a) Split runs concurrently on vcpu faults,

  (b) When with PML the tdp mmu case could completely avoid the small write
      page faults.

All these benefits do not exist for shadow mmu.

I don't think I'm against this series..  I think at least with the series
we can have matching feature on tdp and !tdp, meanwhile it still benefits a
lot on read+write mix workloads are you proved in the follow up tests (PS:
do you think that should be mentioned in the cover letter too?).

IMHO when a performance feature is merged, it'll be harder to be removed
because once merged it'll be harder to be proved wrong.  I hope it'll be
worth it when it gets merged and being maintained in upstream kvm, so I
raised these questions, hope that we at least thoroughly discuss the pros
and cons.

> 
> An alternative approach to handling read-heavy workloads we're looking
> at is to perform dirty logging at 2M.

I agree that's still something worth exploring.

> 
> >
> > Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
> > not strongly.
> >
> > My previous concern was majorly about having readers being blocked during
> > splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
> > will start to take write lock rather than read lock (comparing to tdp mmu),
> > hence any vcpu page faults (hmm, not only reader but writters too I think
> > with non-present pte..) will be blocked longer than before, am I right?
> >
> > Meanwhile for shadow mmu I think there can be more page tables to walk
> > comparing to the tdp mmu for a single huge page to split?  My understanding
> > is tdp mmu pgtables are mostly limited by the number of address spaces (?),
> > but shadow pgtables are per-task.
> 
> Or per-L2 VM, in the case of nested virtualization.
> 
> > So I'm not sure whether for a guest with
> > a lot of active tasks sharing pages, the split thread can spend quite some
> > time splitting, during which time with write lock held without releasing.
> 
> The eager page splitting code does check for contention and drop the
> MMU lock in between every SPTE it tries to split. But there still
> might be some increase in contention due to eager page splitting.

Ah right..

> 
> >
> > These are kind of against the purpose of eager split on shadowing, which is
> > to reduce influence for guest vcpu threads?  But I can't tell, I could have
> > missed something else.  It's just that when applying the idea to shadow mmu
> > it sounds less attractive than the tdp mmu case.
> 
> The shadow MMU is also used for Nested Virtualization, which is a bit
> different from "typical" shadow paging (ept/npt=N) because VMs tend
> not to share pages, their page tables are fairly static (compared to
> process page tables), and they tend to be longer lived. So there will
> not be as much steady-state MMU lock contention that would be
> negatively impacted by eager page splitting.
> 
> You might be right though that ept/npt=N has enough steady-state MMU
> lock contention that it will notice eager page splitting. But then
> again, it would be even more affected by lazy splitting unless the
> guest is doing very few writes.

Yes, indeed I see no easy solution to this due to the same lock contention.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  2022-03-09 21:49             ` David Matlack
@ 2022-03-10  8:30               ` Marc Zyngier
  0 siblings, 0 replies; 65+ messages in thread
From: Marc Zyngier @ 2022-03-10  8:30 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Peter Feiner, Andrew Jones,
	Maciej S. Szmigiero, kvm list

On Wed, 09 Mar 2022 21:49:01 +0000,
David Matlack <dmatlack@google.com> wrote:
> 
> On Mon, Mar 7, 2022 at 3:49 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Sat, Mar 5, 2022 at 8:55 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Fri, 04 Mar 2022 21:59:12 +0000,
> > > David Matlack <dmatlack@google.com> wrote:
> > > > I see two alternatives to make this cleaner:
> > > >
> > > > 1. Dynamically allocate just this cache. The caches defined in
> > > > vcpu_arch will continue to use DEFINE_KVM_MMU_MEMORY_CACHE(). This
> > > > would get rid of the outer struct but require an extra memory
> > > > allocation.
> > > > 2. Move this cache to struct kvm_arch using
> > > > DEFINE_KVM_MMU_MEMORY_CACHE(). Then we don't need to stack allocate it
> > > > or dynamically allocate it.
> > > >
> > > > Do either of these approaches appeal to you more than the current one?
> > >
> > > Certainly, #2 feels more solid. Dynamic allocations (and the resulting
> > > pointer chasing) are usually costly in terms of performance, so I'd
> > > avoid it if at all possible.
> > >
> > > That being said, if it turns out that #2 isn't practical, I won't get
> > > in the way of your current approach. Moving kvm_mmu_memory_cache to
> > > core code was definitely a good cleanup, and I'm not overly excited
> > > with the perspective of *more* arch-specific code.
> >
> > Ok I'll play with #2. Thanks for the feedback.
> 
> #2 is very clean to implement but it ends up being a bit silly. It
> increases the size of struct kvm_arch by 336 bytes for all VMs but
> only ever gets used during kvm_vgic_map_resources(), which is only
> called the first time a vCPU is run (according to the comment in
> kvm_arch_vcpu_run_pid_change()). I think stack allocation makes the
> most sense for this object, I don't think it's worth dancing around
> that solely to avoid the inner struct grottiness.

Fair enough, and thanks for having had a look. I'll look at the next
version once you post it.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU
  2022-03-10  7:03         ` Peter Xu
@ 2022-03-10 19:26           ` David Matlack
  0 siblings, 0 replies; 65+ messages in thread
From: David Matlack @ 2022-03-10 19:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Marc Zyngier, Huacai Chen, leksandar Markovic,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Peter Feiner, Andrew Jones, Maciej S. Szmigiero,
	kvm list

On Wed, Mar 9, 2022 at 11:03 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Mar 09, 2022 at 03:39:44PM -0800, David Matlack wrote:
> > On Tue, Mar 8, 2022 at 11:31 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> > > > On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > Hi, David,
> > > > >
> > > > > Sorry for a very late comment.
> > > > >
> > > > > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > > > > Performance
> > > > > > -----------
> > > > > >
> > > > > > Eager page splitting moves the cost of splitting huge pages off of the
> > > > > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > > > > logging. This is useful because:
> > > > > >
> > > > > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > > > > >    disruptive to customers whereas splitting on VM ioctl threads can
> > > > > >    run in parallel with vCPU execution.
> > > > > >
> > > > > >  - Splitting on the VM ioctl thread is more efficient because it does
> > > > > >    no require performing VM-exit handling and page table walks for every
> > > > > >    4K page.
> > > > > >
> > > > > > To measure the performance impact of Eager Page Splitting I ran
> > > > > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > > > > vCPU, and backed by 1GiB HugeTLB memory.
> > > > > >
> > > > > > To measure the imapct of customer performance, we can look at the time
> > > > > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > > > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > > > > split huge pages and bottleneck on the MMU lock.
> > > > > >
> > > > > >              | "Iteration 1 dirty memory time"             |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.310786549s         | 0.058731929s         |
> > > > > > 4            | 0.419165587s         | 0.059615316s         |
> > > > > > 8            | 1.061233860s         | 0.060945457s         |
> > > > > > 16           | 2.852955595s         | 0.067069980s         |
> > > > > > 32           | 7.032750509s         | 0.078623606s         |
> > > > > > 64           | 16.501287504s        | 0.083914116s         |
> > > > > >
> > > > > > Eager Page Splitting does increase the time it takes to enable dirty
> > > > > > logging when not using initially-all-set, since that's when KVM splits
> > > > > > huge pages. However, this runs in parallel with vCPU execution and does
> > > > > > not bottleneck on the MMU lock.
> > > > > >
> > > > > >              | "Enabling dirty logging time"               |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.001581619s         |  0.025699730s        |
> > > > > > 4            | 0.003138664s         |  0.051510208s        |
> > > > > > 8            | 0.006247177s         |  0.102960379s        |
> > > > > > 16           | 0.012603892s         |  0.206949435s        |
> > > > > > 32           | 0.026428036s         |  0.435855597s        |
> > > > > > 64           | 0.103826796s         |  1.199686530s        |
> > > > > >
> > > > > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > > > > dirty log for when using initially-all-set. The first time userspace
> > > > > > clears the dirty log, KVM will split huge pages:
> > > > > >
> > > > > >              | "Iteration 1 clear dirty log time"          |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.001544730s         | 0.055327916s         |
> > > > > > 4            | 0.003145920s         | 0.111887354s         |
> > > > > > 8            | 0.006306964s         | 0.223920530s         |
> > > > > > 16           | 0.012681628s         | 0.447849488s         |
> > > > > > 32           | 0.026827560s         | 0.943874520s         |
> > > > > > 64           | 0.090461490s         | 2.664388025s         |
> > > > > >
> > > > > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > > > > since KVM can very quickly determine there are no more huge pages to
> > > > > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > > > > the entire page table to check for huge pages.
> > > > > >
> > > > > >              | "Iteration 2 clear dirty log time"          |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.015613726s         | 0.015771982s         |
> > > > > > 4            | 0.031456620s         | 0.031911594s         |
> > > > > > 8            | 0.063341572s         | 0.063837403s         |
> > > > > > 16           | 0.128409332s         | 0.127484064s         |
> > > > > > 32           | 0.255635696s         | 0.268837996s         |
> > > > > > 64           | 0.695572818s         | 0.700420727s         |
> > > > >
> > > > > Are all the tests above with ept=Y (except the one below)?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > >
> > > > > > Eager Page Splitting also improves the performance for shadow paging
> > > > > > configurations, as measured with ept=N. Although the absolute gains are
> > > > > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > > > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > > > > memory time.
> > > > > >
> > > > > >              | "Iteration 1 dirty memory time"             |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.373022770s         | 0.348926043s         |
> > > > > > 4            | 0.563697483s         | 0.453022037s         |
> > > > > > 8            | 1.588492808s         | 1.524962010s         |
> > > > > > 16           | 3.988934732s         | 3.369129917s         |
> > > > > > 32           | 9.470333115s         | 8.292953856s         |
> > > > > > 64           | 20.086419186s        | 18.531840021s        |
> > > > >
> > > > > This one is definitely for ept=N because it's written there. That's ~10%
> > > > > performance increase which looks still good, but IMHO that increase is
> > > > > "debatable" since a normal guest may not simply write over the whole guest
> > > > > mem.. So that 10% increase is based on some assumptions.
> > > > >
> > > > > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > > > > also start to block the readers too for shadow mmu while it was not blocked
> > > > > previusly?  From that pov, not sure whether the series needs some more
> > > > > justification, as the changeset seems still large.
> > > > >
> > > > > Is there other benefits besides the 10% increase on writes?
> > > >
> > > > Yes, in fact workloads that perform some reads will benefit _more_
> > > > than workloads that perform only writes.
> > > >
> > > > The reason is that the current lazy splitting approach unmaps the
> > > > entire huge page on write and then maps in the just the faulting 4K
> > > > page. That means reads on the unmapped portion of the hugepage will
> > > > now take a fault and require the MMU lock. In contrast, Eager Page
> > > > Splitting fully splits each huge page so readers should never take
> > > > faults.
> > > >
> > > > For example, here is the data with 20% writes and 80% reads (i.e. pass
> > > > `-f 5` to dirty_log_perf_test):
> > > >
> > > >              | "Iteration 1 dirty memory time"             |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.403108098s         | 0.071808764s         |
> > > > 4            | 0.562173582s         | 0.105272819s         |
> > > > 8            | 1.382974557s         | 0.248713796s         |
> > > > 16           | 3.608993666s         | 0.571990327s         |
> > > > 32           | 9.100678321s         | 1.702453103s         |
> > > > 64           | 19.784780903s        | 3.489443239s        |
> > >
> > > It's very interesting to know these numbers, thanks for sharing that.
> > >
> > > Above reminded me that eager page split actually does two things:
> > >
> > > (1) When a page is mapped as huge, we "assume" this whole page will be
> > >     accessed in the near future, so when split is needed we map all the
> > >     small ptes, and,
> >
> > Note, this series does not add this behavior to the fault path.
> >
> > >
> > > (2) We move the split operation from page faults to when enable-dirty-track
> > >     happens.
> > >
> > > We could have done (1) already without the whole eager split patchsets: if
> > > we see a read-only huge page on a page fault, we could populat the whole
> > > range of ptes, only marking current small pte writable but leaving the rest
> > > small ptes wr-protected.  I had a feeling this will speedup the above 19.78
> > > seconds (64 cores case) fairly much too to some point.
> >
> > The problem with (1) is that it still requires faults to split the
> > huge pages. Those faults will need to contend for the MMU lock, and
> > will hold the lock for longer than they do today since they are doing
> > extra work.
>
> Right.  But that overhead is very limited, IMHO.. per the numbers, it's the
> 20sec and 18sec difference for full write faults.
>
> The thing is either split or vcpu will take the write lock anyway.  So it
> either contends during split, or later.  Without tdp (so never PML) it'll
> need a slow page fault anyway even if split is done before hand..
>
> >
> > I agree there might be some benefit for workloads, but for write-heavy
> > workloads there will still be a "thundering herd" problem when dirty
> > logging is first enable. I'll admit though I have not testing this
> > approach.
>
> Indeed that's majorly the core of my question, on why this series cares
> more on write than read workloads.  To me they are all possible workloads,
> but maybe I'm wrong?  This series benefits heavy writes, but it may not
> benefit (or even make it slower on) heavy reads.

It's not that either workload is more important than the other, or
that we care about one more than the other. It's about the effects of
dirty logging on each workload.

Eager page splitting is all about avoiding the large (like 99%
degradation), abrupt, scales-with-the-number-of-vcpus, drop in
performance when dirty logging is enabled. This drop can be
catastrophic to customer workloads, causing application failure. Eager
page splitting may introduce higher TLB miss costs for read-heavy
workloads, making them worse than without Eager page splitting, but
that is not something that causes application failure. Maybe this is
bias from working for a cloud provider, but it's much better to have
predictable performance for all workloads (even if it's slightly worse
for some workloads) than a system that causes catastrophic failure for
some workloads.

Now that being said, KVM's shadow paging can still cause "catastrophic
failure" since it requires the write lock to handle 4KiB
write-protection faults. That's something that would be worth
addressing as well, but separately.

>
> The tdp mmu case is more persuasive in that:
>
>   (a) Split runs concurrently on vcpu faults,
>
>   (b) When with PML the tdp mmu case could completely avoid the small write
>       page faults.
>
> All these benefits do not exist for shadow mmu.

Here's how I reason about the benefits of eager page splitting for the
shadow MMU. During dirty logging the shadow MMU suffers from:

(1) Write-protection faults on huge pages that take the MMU lock to
unmap the huge page, map a 4KiB page, and update the dirty log.
(2) Non-present faults caused by (1) that take the MMU lock to map in
the missing page.
(3) Write-protection faults on 4KiB pages that take the MMU lock to
make the page writable and update the dirty log.

The benefit of eager page splitting is to eliminate (1) and (2).

(BTW, maybe to address (3) we could try to handle these
write-protection faults under the MMU read lock.)

>
> I don't think I'm against this series..  I think at least with the series
> we can have matching feature on tdp and !tdp, meanwhile it still benefits a
> lot on read+write mix workloads are you proved in the follow up tests (PS:
> do you think that should be mentioned in the cover letter too?).

Yes, will do!

>
> IMHO when a performance feature is merged, it'll be harder to be removed
> because once merged it'll be harder to be proved wrong.  I hope it'll be
> worth it when it gets merged and being maintained in upstream kvm, so I
> raised these questions, hope that we at least thoroughly discuss the pros
> and cons.
>
> >
> > An alternative approach to handling read-heavy workloads we're looking
> > at is to perform dirty logging at 2M.
>
> I agree that's still something worth exploring.
>
> >
> > >
> > > Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
> > > not strongly.
> > >
> > > My previous concern was majorly about having readers being blocked during
> > > splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
> > > will start to take write lock rather than read lock (comparing to tdp mmu),
> > > hence any vcpu page faults (hmm, not only reader but writters too I think
> > > with non-present pte..) will be blocked longer than before, am I right?
> > >
> > > Meanwhile for shadow mmu I think there can be more page tables to walk
> > > comparing to the tdp mmu for a single huge page to split?  My understanding
> > > is tdp mmu pgtables are mostly limited by the number of address spaces (?),
> > > but shadow pgtables are per-task.
> >
> > Or per-L2 VM, in the case of nested virtualization.
> >
> > > So I'm not sure whether for a guest with
> > > a lot of active tasks sharing pages, the split thread can spend quite some
> > > time splitting, during which time with write lock held without releasing.
> >
> > The eager page splitting code does check for contention and drop the
> > MMU lock in between every SPTE it tries to split. But there still
> > might be some increase in contention due to eager page splitting.
>
> Ah right..
>
> >
> > >
> > > These are kind of against the purpose of eager split on shadowing, which is
> > > to reduce influence for guest vcpu threads?  But I can't tell, I could have
> > > missed something else.  It's just that when applying the idea to shadow mmu
> > > it sounds less attractive than the tdp mmu case.
> >
> > The shadow MMU is also used for Nested Virtualization, which is a bit
> > different from "typical" shadow paging (ept/npt=N) because VMs tend
> > not to share pages, their page tables are fairly static (compared to
> > process page tables), and they tend to be longer lived. So there will
> > not be as much steady-state MMU lock contention that would be
> > negatively impacted by eager page splitting.
> >
> > You might be right though that ept/npt=N has enough steady-state MMU
> > lock contention that it will notice eager page splitting. But then
> > again, it would be even more affected by lazy splitting unless the
> > guest is doing very few writes.
>
> Yes, indeed I see no easy solution to this due to the same lock contention.
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2022-03-10 19:27 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-03  1:00 [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU David Matlack
2022-02-03  1:00 ` [PATCH 01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
2022-02-19  0:57   ` Sean Christopherson
2022-02-03  1:00 ` [PATCH 02/23] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
2022-02-19  1:14   ` Sean Christopherson
2022-02-24 18:45     ` David Matlack
2022-03-04  0:22     ` David Matlack
2022-02-03  1:00 ` [PATCH 03/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
2022-02-19  1:25   ` Sean Christopherson
2022-02-24 18:54     ` David Matlack
2022-02-03  1:00 ` [PATCH 04/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
2022-02-03  1:00 ` [PATCH 05/23] KVM: x86/mmu: Pass memslot to kvm_mmu_create_sp() David Matlack
2022-02-03  1:00 ` [PATCH 06/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization David Matlack
2022-02-16 19:37   ` Ben Gardon
2022-02-16 21:42     ` David Matlack
2022-02-03  1:00 ` [PATCH 07/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c David Matlack
2022-02-03  1:00 ` [PATCH 08/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs David Matlack
2022-02-03  1:00 ` [PATCH 09/23] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches David Matlack
2022-02-03  1:00 ` [PATCH 10/23] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
2022-02-23 23:25   ` Ben Gardon
2022-02-03  1:00 ` [PATCH 11/23] KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants David Matlack
2022-02-23 23:27   ` Ben Gardon
2022-02-03  1:00 ` [PATCH 12/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
2022-02-23 23:30   ` Ben Gardon
2022-02-03  1:00 ` [PATCH 13/23] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
2022-02-23 23:32   ` Ben Gardon
2022-02-23 23:35     ` Ben Gardon
2022-02-03  1:00 ` [PATCH 14/23] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
2022-02-28 20:30   ` Ben Gardon
2022-02-03  1:00 ` [PATCH 15/23] KVM: x86/mmu: Pass access information to make_huge_page_split_spte() David Matlack
2022-02-28 20:32   ` Ben Gardon
2022-02-03  1:00 ` [PATCH 16/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU David Matlack
2022-02-28 20:39   ` Ben Gardon
2022-03-03 19:42     ` David Matlack
2022-02-03  1:00 ` [PATCH 17/23] KVM: x86/mmu: Pass bool flush parameter to drop_large_spte() David Matlack
2022-02-28 20:47   ` Ben Gardon
2022-03-03 19:52     ` David Matlack
2022-02-03  1:00 ` [PATCH 18/23] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU David Matlack
2022-02-28 21:09   ` Ben Gardon
2022-02-28 23:29     ` David Matlack
2022-02-03  1:00 ` [PATCH 19/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
2022-02-24 11:28   ` Marc Zyngier
2022-02-24 19:20     ` David Matlack
2022-03-04 21:59       ` David Matlack
2022-03-04 22:24         ` David Matlack
2022-03-05 16:55         ` Marc Zyngier
2022-03-07 23:49           ` David Matlack
2022-03-08  7:42             ` Marc Zyngier
2022-03-09 21:49             ` David Matlack
2022-03-10  8:30               ` Marc Zyngier
2022-02-03  1:00 ` [PATCH 20/23] KVM: Allow GFP flags to be passed when topping up MMU caches David Matlack
2022-02-28 21:12   ` Ben Gardon
2022-02-03  1:00 ` [PATCH 21/23] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs David Matlack
2022-02-28 21:22   ` Ben Gardon
2022-02-28 23:41     ` David Matlack
2022-03-01  0:37       ` Ben Gardon
2022-03-03 19:59         ` David Matlack
2022-02-03  1:00 ` [PATCH 22/23] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs David Matlack
2022-02-03  1:00 ` [PATCH 23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages David Matlack
2022-03-07  5:21 ` [PATCH 00/23] Extend Eager Page Splitting to the shadow MMU Peter Xu
2022-03-07 23:39   ` David Matlack
2022-03-09  7:31     ` Peter Xu
2022-03-09 23:39       ` David Matlack
2022-03-10  7:03         ` Peter Xu
2022-03-10 19:26           ` David Matlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.