linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: x86/mmu: pte_list_desc fix and cleanups
@ 2022-06-24 23:27 Sean Christopherson
  2022-06-24 23:27 ` [PATCH 1/4] KVM: x86/mmu: Track the number entries in a pte_list_desc with a ulong Sean Christopherson
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Sean Christopherson @ 2022-06-24 23:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

Reviewing the eager page splitting code made me realize that burning 14
rmap entries for nested TDP MMUs is extremely wasteful due to the per-vCPU
caches allocating 40 entries by default.  For nested TDP, aliasing L2 gfns
to L1 gfns is quite rare and is not performance critical (it's exclusively
pre-boot behavior for sane setups).

Patch 1 fixes a bug where pte_list_desc is not correctly aligned nor sized
on 32-bit kernels.  The primary motivation for the fix is to be able to add
a compile-time assertion on the size being a multiple of the cache line
size, I doubt anyone cares about the performance/memory impact.

Patch 2 tweaks MMU setup to support a dynamic pte_list_desc size.

Patch 3 reduces the number of sptes per pte_list_desc to 2 for nested TDP
MMUs, i.e. allocates the bare minimum to prioritize the memory footprint
over performance for sane setups.

Patch 4 fills the pte_list_desc cache if and only if rmaps are in use,
i.e. doesn't allocate pte_list_desc when using the TDP MMU until nested
TDP is used.

Sean Christopherson (4):
  KVM: x86/mmu: Track the number entries in a pte_list_desc with a ulong
  KVM: x86/mmu: Defer "full" MMU setup until after vendor
    hardware_setup()
  KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP
  KVM: x86/mmu: Topup pte_list_desc cache iff VM is using rmaps

 arch/x86/include/asm/kvm_host.h |  5 ++-
 arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++++++++++----------
 arch/x86/kvm/x86.c              | 17 ++++---
 3 files changed, 70 insertions(+), 30 deletions(-)


base-commit: 4b88b1a518b337de1252b8180519ca4c00015c9e
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] KVM: x86/mmu: Track the number entries in a pte_list_desc with a ulong
  2022-06-24 23:27 [PATCH 0/4] KVM: x86/mmu: pte_list_desc fix and cleanups Sean Christopherson
@ 2022-06-24 23:27 ` Sean Christopherson
  2022-06-24 23:27 ` [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup() Sean Christopherson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2022-06-24 23:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

Use an "unsigned long" instead of a "u64" to track the number of entries
in a pte_list_desc's sptes array.  Both sizes are overkill as the number
of entries would easily fit into a u8, the goal is purely to get sptes[]
aligned and to size the struct as a whole to be a multiple of a cache
line (64 bytes).

Using a u64 on 32-bit kernels fails on both accounts as "more" is only
4 bytes.  Dropping "spte_count" to 4 bytes on 32-bit kernels fixes the
alignment issue and the overall size.

Add a compile-time assert to ensure the size of pte_list_desc stays a
multiple of the cache line size on modern CPUs (hardcoded because
L1_CACHE_BYTES is configurable via CONFIG_X86_L1_CACHE_SHIFT).

Fixes: 13236e25ebab ("KVM: X86: Optimize pte_list_desc with per-array counter")
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd74a287b54a..17ac30b9e22c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -117,15 +117,17 @@ module_param(dbg, bool, 0644);
 /*
  * Slight optimization of cacheline layout, by putting `more' and `spte_count'
  * at the start; then accessing it will only use one single cacheline for
- * either full (entries==PTE_LIST_EXT) case or entries<=6.
+ * either full (entries==PTE_LIST_EXT) case or entries<=6.  On 32-bit kernels,
+ * the entire struct fits in a single cacheline.
  */
 struct pte_list_desc {
 	struct pte_list_desc *more;
 	/*
-	 * Stores number of entries stored in the pte_list_desc.  No need to be
-	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
+	 * The number of valid entries in sptes[].  Use an unsigned long to
+	 * naturally align sptes[] (a u8 for the count would suffice).  When
+	 * equal to PTE_LIST_EXT, this particular list is full.
 	 */
-	u64 spte_count;
+	unsigned long spte_count;
 	u64 *sptes[PTE_LIST_EXT];
 };
 
@@ -5640,6 +5642,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 	tdp_root_level = tdp_forced_root_level;
 	max_tdp_level = tdp_max_root_level;
 
+	BUILD_BUG_ON_MSG((sizeof(struct pte_list_desc) % 64),
+			 "pte_list_desc is not a multiple of cache line size (on modern CPUs)");
+
 	/*
 	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
 	 * of kernel support, e.g. KVM may be capable of using 1GB pages when
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup()
  2022-06-24 23:27 [PATCH 0/4] KVM: x86/mmu: pte_list_desc fix and cleanups Sean Christopherson
  2022-06-24 23:27 ` [PATCH 1/4] KVM: x86/mmu: Track the number entries in a pte_list_desc with a ulong Sean Christopherson
@ 2022-06-24 23:27 ` Sean Christopherson
  2022-06-25  0:16   ` David Matlack
  2022-07-12 21:56   ` Peter Xu
  2022-06-24 23:27 ` [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP Sean Christopherson
  2022-06-24 23:27 ` [PATCH 4/4] KVM: x86/mmu: Topup pte_list_desc cache iff VM is using rmaps Sean Christopherson
  3 siblings, 2 replies; 14+ messages in thread
From: Sean Christopherson @ 2022-06-24 23:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

Defer MMU setup, and in particular allocation of pte_list_desc_cache,
until after the vendor's hardware_setup() has run, i.e. until after the
MMU has been configured by vendor code.  This will allow a future commit
to dynamically size pte_list_desc's array of sptes based on whether or
not KVM is using TDP.

Alternatively, the setup could be done in kvm_configure_mmu(), but that
would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
and error paths, i.e. doesn't actually save code and is arguably uglier.

Note, keep the reset of PTE masks where it is to ensure that the masks
are reset before the vendor's hardware_setup() runs, i.e. before the
vendor code has a chance to manipulate the masks, e.g. VMX modifies masks
even before calling kvm_configure_mmu().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++--
 arch/x86/kvm/mmu/mmu.c          | 12 ++++++++----
 arch/x86/kvm/x86.c              | 17 +++++++++++------
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88a3026ee163..c670a9656257 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1711,8 +1711,9 @@ static inline int kvm_arch_flush_remote_tlb(struct kvm *kvm)
 	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
 
 void kvm_mmu_x86_module_init(void);
-int kvm_mmu_vendor_module_init(void);
-void kvm_mmu_vendor_module_exit(void);
+void kvm_mmu_vendor_module_init(void);
+int kvm_mmu_hardware_setup(void);
+void kvm_mmu_hardware_unsetup(void);
 
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 17ac30b9e22c..ceb81e04aea3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
  * loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
  * to be reset when a potentially different vendor module is loaded.
  */
-int kvm_mmu_vendor_module_init(void)
+void kvm_mmu_vendor_module_init(void)
 {
-	int ret = -ENOMEM;
-
 	/*
 	 * MMU roles use union aliasing which is, generally speaking, an
 	 * undefined behavior. However, we supposedly know how compilers behave
@@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
 	BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
 	BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
 
+	/* Reset the PTE masks before the vendor module's hardware setup. */
 	kvm_mmu_reset_all_pte_masks();
+}
+
+int kvm_mmu_hardware_setup(void)
+{
+	int ret = -ENOMEM;
 
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
 					    sizeof(struct pte_list_desc),
@@ -6723,7 +6727,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 	mmu_free_memory_caches(vcpu);
 }
 
-void kvm_mmu_vendor_module_exit(void)
+void kvm_mmu_hardware_unsetup(void)
 {
 	mmu_destroy_caches();
 	percpu_counter_destroy(&kvm_total_used_mmu_pages);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 031678eff28e..735543df829a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9204,9 +9204,7 @@ int kvm_arch_init(void *opaque)
 	}
 	kvm_nr_uret_msrs = 0;
 
-	r = kvm_mmu_vendor_module_init();
-	if (r)
-		goto out_free_percpu;
+	kvm_mmu_vendor_module_init();
 
 	kvm_timer_init();
 
@@ -9226,8 +9224,6 @@ int kvm_arch_init(void *opaque)
 
 	return 0;
 
-out_free_percpu:
-	free_percpu(user_return_msrs);
 out_free_x86_emulator_cache:
 	kmem_cache_destroy(x86_emulator_cache);
 out:
@@ -9252,7 +9248,6 @@ void kvm_arch_exit(void)
 	cancel_work_sync(&pvclock_gtod_work);
 #endif
 	kvm_x86_ops.hardware_enable = NULL;
-	kvm_mmu_vendor_module_exit();
 	free_percpu(user_return_msrs);
 	kmem_cache_destroy(x86_emulator_cache);
 #ifdef CONFIG_KVM_XEN
@@ -11937,6 +11932,10 @@ int kvm_arch_hardware_setup(void *opaque)
 
 	kvm_ops_update(ops);
 
+	r = kvm_mmu_hardware_setup();
+	if (r)
+		goto out_unsetup;
+
 	kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
 
 	if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
@@ -11960,12 +11959,18 @@ int kvm_arch_hardware_setup(void *opaque)
 	kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
 	kvm_init_msr_list();
 	return 0;
+
+out_unsetup:
+	static_call(kvm_x86_hardware_unsetup)();
+	return r;
 }
 
 void kvm_arch_hardware_unsetup(void)
 {
 	kvm_unregister_perf_callbacks();
 
+	kvm_mmu_hardware_unsetup();
+
 	static_call(kvm_x86_hardware_unsetup)();
 }
 
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP
  2022-06-24 23:27 [PATCH 0/4] KVM: x86/mmu: pte_list_desc fix and cleanups Sean Christopherson
  2022-06-24 23:27 ` [PATCH 1/4] KVM: x86/mmu: Track the number entries in a pte_list_desc with a ulong Sean Christopherson
  2022-06-24 23:27 ` [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup() Sean Christopherson
@ 2022-06-24 23:27 ` Sean Christopherson
  2022-07-12 22:35   ` Peter Xu
  2022-06-24 23:27 ` [PATCH 4/4] KVM: x86/mmu: Topup pte_list_desc cache iff VM is using rmaps Sean Christopherson
  3 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2022-06-24 23:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

Dynamically size struct pte_list_desc's array of sptes based on whether
or not KVM is using TDP.  Commit dc1cff969101 ("KVM: X86: MMU: Tune
PTE_LIST_EXT to be bigger") bumped the number of entries in order to
improve performance when using shadow paging, but its analysis that the
larger size would not affect TDP was wrong.  Consuming pte_list_desc
objects for nested TDP is indeed rare, but _allocating_ objects is not,
as KVM allocates 40 objects for each per-vCPU cache.  Reducing the size
from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to
1280, and also provides similar savings when eager page splitting for
nested MMUs kicks in.

The per-vCPU overhead could be further reduced by using a custom, smaller
capacity for the per-vCPU caches, but that's more of an "and" than
an "or" change, e.g. it wouldn't help the eager page split use case.

Set the list size to the bare minimum without completely defeating the
purpose of an array (and because pte_list_add() assumes the array is at
least two entries deep).  A larger size, e.g. 4, would reduce the number
of "allocations", but those "allocations" only become allocations in
truth if a single vCPU depletes its cache to where a topup is needed,
i.e. if a single vCPU "allocates" 30+ lists.  Conversely, those 2 extra
entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested
TDP is used.

In the unlikely event that performance of aliased gfns for nested TDP
really is (or becomes) a priority for oddball workloads, KVM could add a
knob to let the admin tune the array size for their environment.

Note, KVM also unnecessarily tops up the per-vCPU caches even when not
using rmaps; this can also be addressed separately.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ceb81e04aea3..2db328d28b7b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -101,6 +101,7 @@ bool tdp_enabled = false;
 static int max_huge_page_level __read_mostly;
 static int tdp_root_level __read_mostly;
 static int max_tdp_level __read_mostly;
+static int nr_sptes_per_pte_list __read_mostly;
 
 #ifdef MMU_DEBUG
 bool dbg = 0;
@@ -111,24 +112,21 @@ module_param(dbg, bool, 0644);
 
 #include <trace/events/kvm.h>
 
-/* make pte_list_desc fit well in cache lines */
-#define PTE_LIST_EXT 14
-
 /*
  * Slight optimization of cacheline layout, by putting `more' and `spte_count'
  * at the start; then accessing it will only use one single cacheline for
- * either full (entries==PTE_LIST_EXT) case or entries<=6.  On 32-bit kernels,
- * the entire struct fits in a single cacheline.
+ * either full (entries==nr_sptes_per_pte_list) case or entries<=6.  On 32-bit
+ * kernels, the entire struct fits in a single cacheline.
  */
 struct pte_list_desc {
 	struct pte_list_desc *more;
 	/*
 	 * The number of valid entries in sptes[].  Use an unsigned long to
 	 * naturally align sptes[] (a u8 for the count would suffice).  When
-	 * equal to PTE_LIST_EXT, this particular list is full.
+	 * equal to nr_sptes_per_pte_list, this particular list is full.
 	 */
 	unsigned long spte_count;
-	u64 *sptes[PTE_LIST_EXT];
+	u64 *sptes[];
 };
 
 struct kvm_shadow_walk_iterator {
@@ -883,8 +881,8 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	} else {
 		rmap_printk("%p %llx many->many\n", spte, *spte);
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-		while (desc->spte_count == PTE_LIST_EXT) {
-			count += PTE_LIST_EXT;
+		while (desc->spte_count == nr_sptes_per_pte_list) {
+			count += nr_sptes_per_pte_list;
 			if (!desc->more) {
 				desc->more = kvm_mmu_memory_cache_alloc(cache);
 				desc = desc->more;
@@ -1102,7 +1100,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
 	u64 *sptep;
 
 	if (iter->desc) {
-		if (iter->pos < PTE_LIST_EXT - 1) {
+		if (iter->pos < nr_sptes_per_pte_list - 1) {
 			++iter->pos;
 			sptep = iter->desc->sptes[iter->pos];
 			if (sptep)
@@ -5642,8 +5640,27 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 	tdp_root_level = tdp_forced_root_level;
 	max_tdp_level = tdp_max_root_level;
 
-	BUILD_BUG_ON_MSG((sizeof(struct pte_list_desc) % 64),
+	/*
+	 * Size the array of sptes in pte_list_desc based on whether or not KVM
+	 * is using TDP.  When using TDP, the shadow MMU is used only to shadow
+	 * L1's TDP entries for L2.  For TDP, prioritize the per-vCPU memory
+	 * footprint (due to using per-vCPU caches) as aliasing L2 gfns to L1
+	 * gfns is rare.  When using shadow paging, prioritize performace as
+	 * aliasing gfns with multiple gvas is very common, e.g. L1 will have
+	 * kernel mappings and multiple userspace mappings for a given gfn.
+	 *
+	 * For TDP, size the array for the bare minimum of two entries (without
+	 * requiring a "list" for every single entry).
+	 *
+	 * For !TDP, size the array so that the overall size of pte_list_desc
+	 * is a multiple of the cache line size (assert this as well).
+	 */
+	BUILD_BUG_ON_MSG((sizeof(struct pte_list_desc) + 14 * sizeof(u64 *)) % 64,
 			 "pte_list_desc is not a multiple of cache line size (on modern CPUs)");
+	if (tdp_enabled)
+		nr_sptes_per_pte_list = 2;
+	else
+		nr_sptes_per_pte_list = 14;
 
 	/*
 	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
@@ -6691,11 +6708,17 @@ void kvm_mmu_vendor_module_init(void)
 
 int kvm_mmu_hardware_setup(void)
 {
+	int pte_list_desc_size;
 	int ret = -ENOMEM;
 
+	if (WARN_ON_ONCE(!nr_sptes_per_pte_list))
+		return -EIO;
+
+	pte_list_desc_size = sizeof(struct pte_list_desc) +
+			     nr_sptes_per_pte_list * sizeof(u64 *);
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
-					    sizeof(struct pte_list_desc),
-					    0, SLAB_ACCOUNT, NULL);
+						pte_list_desc_size, 0,
+						SLAB_ACCOUNT, NULL);
 	if (!pte_list_desc_cache)
 		goto out;
 
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] KVM: x86/mmu: Topup pte_list_desc cache iff VM is using rmaps
  2022-06-24 23:27 [PATCH 0/4] KVM: x86/mmu: pte_list_desc fix and cleanups Sean Christopherson
                   ` (2 preceding siblings ...)
  2022-06-24 23:27 ` [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP Sean Christopherson
@ 2022-06-24 23:27 ` Sean Christopherson
  3 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2022-06-24 23:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

Topup the per-vCPU pte_list_desc caches if and only if the VM is using
rmaps, i.e. KVM is not using the TDP MMU or KVM is shadowing a nested TDP
MMU.  This avoids wasting 1280 bytes per vCPU when KVM is using the TDP
MMU and L1 is not utilizing nested TDP.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2db328d28b7b..fcbdd780075f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -646,11 +646,13 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
 
-	/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
-	if (r)
-		return r;
+	if (kvm_memslots_have_rmaps(vcpu->kvm)) {
+		/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+					       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
+		if (r)
+			return r;
+	}
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
 	if (r)
-- 
2.37.0.rc0.161.g10f37bed90-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup()
  2022-06-24 23:27 ` [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup() Sean Christopherson
@ 2022-06-25  0:16   ` David Matlack
  2022-06-27 15:40     ` Sean Christopherson
  2022-07-12 21:56   ` Peter Xu
  1 sibling, 1 reply; 14+ messages in thread
From: David Matlack @ 2022-06-25  0:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> Defer MMU setup, and in particular allocation of pte_list_desc_cache,
> until after the vendor's hardware_setup() has run, i.e. until after the
> MMU has been configured by vendor code.  This will allow a future commit
> to dynamically size pte_list_desc's array of sptes based on whether or
> not KVM is using TDP.
> 
> Alternatively, the setup could be done in kvm_configure_mmu(), but that
> would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
> and error paths, i.e. doesn't actually save code and is arguably uglier.
> 
> Note, keep the reset of PTE masks where it is to ensure that the masks
> are reset before the vendor's hardware_setup() runs, i.e. before the
> vendor code has a chance to manipulate the masks, e.g. VMX modifies masks
> even before calling kvm_configure_mmu().
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
[...]
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 17ac30b9e22c..ceb81e04aea3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
>   * loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
>   * to be reset when a potentially different vendor module is loaded.
>   */
> -int kvm_mmu_vendor_module_init(void)
> +void kvm_mmu_vendor_module_init(void)
>  {
> -	int ret = -ENOMEM;
> -
>  	/*
>  	 * MMU roles use union aliasing which is, generally speaking, an
>  	 * undefined behavior. However, we supposedly know how compilers behave
> @@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
>  	BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
>  	BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
>  
> +	/* Reset the PTE masks before the vendor module's hardware setup. */
>  	kvm_mmu_reset_all_pte_masks();
> +}
> +
> +int kvm_mmu_hardware_setup(void)
> +{

Instead of putting this code in a new function and calling it after
hardware_setup(), we could put it in kvm_configure_mmu().

This will result in a larger patch diff, but has it eliminates a subtle
and non-trivial-to-verify dependency ordering between
kvm_configure_mmu() and kvm_mmu_hardware_setup() and it will co-locate
the initialization of nr_sptes_per_pte_list and the code that uses it to
create pte_list_desc_cache in a single function.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup()
  2022-06-25  0:16   ` David Matlack
@ 2022-06-27 15:40     ` Sean Christopherson
  2022-06-27 22:50       ` David Matlack
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2022-06-27 15:40 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

On Sat, Jun 25, 2022, David Matlack wrote:
> On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> > Alternatively, the setup could be done in kvm_configure_mmu(), but that
> > would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
> > and error paths, i.e. doesn't actually save code and is arguably uglier.
> [...]
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 17ac30b9e22c..ceb81e04aea3 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
> >   * loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
> >   * to be reset when a potentially different vendor module is loaded.
> >   */
> > -int kvm_mmu_vendor_module_init(void)
> > +void kvm_mmu_vendor_module_init(void)
> >  {
> > -	int ret = -ENOMEM;
> > -
> >  	/*
> >  	 * MMU roles use union aliasing which is, generally speaking, an
> >  	 * undefined behavior. However, we supposedly know how compilers behave
> > @@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
> >  	BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
> >  	BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
> >  
> > +	/* Reset the PTE masks before the vendor module's hardware setup. */
> >  	kvm_mmu_reset_all_pte_masks();
> > +}
> > +
> > +int kvm_mmu_hardware_setup(void)
> > +{
> 
> Instead of putting this code in a new function and calling it after
> hardware_setup(), we could put it in kvm_configure_mmu().a

Ya, I noted that as an alternative in the changelog but obviously opted to not
do the allocation in kvm_configure_mmu().  I view kvm_configure_mmu() as a necessary
evil.  Ideally vendor code wouldn't call into the MMU during initialization, and
common x86 would fully dictate the order of calls so that MMU setup.  We could force
that, but it'd require something gross like filling a struct passed into
ops->hardware_setup(), and probably would be less robust (more likely to omit a
"required" field).

In other words, I like the explicit kvm_mmu_hardware_setup() call from common x86,
e.g. to show that vendor code needs to do setup before the MMU, and so that MMU
setup isn't buried in a somewhat arbitrary location in vendor hardware setup. 

I'm not dead set against handling this in kvm_configure_mmu() (though I'd probably
vote to rename it to kvm_mmu_hardware_setup()) if anyone has a super strong opinion.
 
> This will result in a larger patch diff, but has it eliminates a subtle
> and non-trivial-to-verify dependency ordering between

Verification is "trivial" in that this WARN will fire if the order is swapped:

	if (WARN_ON_ONCE(!nr_sptes_per_pte_list))
		return -EIO;

> kvm_configure_mmu() and kvm_mmu_hardware_setup() and it will co-locate
> the initialization of nr_sptes_per_pte_list and the code that uses it to
> create pte_list_desc_cache in a single function.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup()
  2022-06-27 15:40     ` Sean Christopherson
@ 2022-06-27 22:50       ` David Matlack
  0 siblings, 0 replies; 14+ messages in thread
From: David Matlack @ 2022-06-27 22:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Peter Xu

On Mon, Jun 27, 2022 at 03:40:49PM +0000, Sean Christopherson wrote:
> On Sat, Jun 25, 2022, David Matlack wrote:
> > On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> > > Alternatively, the setup could be done in kvm_configure_mmu(), but that
> > > would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
> > > and error paths, i.e. doesn't actually save code and is arguably uglier.
> > [...]
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 17ac30b9e22c..ceb81e04aea3 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
> > >   * loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
> > >   * to be reset when a potentially different vendor module is loaded.
> > >   */
> > > -int kvm_mmu_vendor_module_init(void)
> > > +void kvm_mmu_vendor_module_init(void)
> > >  {
> > > -	int ret = -ENOMEM;
> > > -
> > >  	/*
> > >  	 * MMU roles use union aliasing which is, generally speaking, an
> > >  	 * undefined behavior. However, we supposedly know how compilers behave
> > > @@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
> > >  	BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
> > >  	BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
> > >  
> > > +	/* Reset the PTE masks before the vendor module's hardware setup. */
> > >  	kvm_mmu_reset_all_pte_masks();
> > > +}
> > > +
> > > +int kvm_mmu_hardware_setup(void)
> > > +{
> > 
> > Instead of putting this code in a new function and calling it after
> > hardware_setup(), we could put it in kvm_configure_mmu().a
> 
> Ya, I noted that as an alternative in the changelog but obviously opted to not
> do the allocation in kvm_configure_mmu(). 

Doh! My mistake. The idea to use kvm_configure_mmu() came to me while
reviewing patch 3 and I totally forgot about that blurb in the commit
message when I came back here to leave the suggestion.

> I view kvm_configure_mmu() as a necessary
> evil.  Ideally vendor code wouldn't call into the MMU during initialization, and
> common x86 would fully dictate the order of calls so that MMU setup.  We could force
> that, but it'd require something gross like filling a struct passed into
> ops->hardware_setup(), and probably would be less robust (more likely to omit a
> "required" field).
> 
> In other words, I like the explicit kvm_mmu_hardware_setup() call from common x86,
> e.g. to show that vendor code needs to do setup before the MMU, and so that MMU
> setup isn't buried in a somewhat arbitrary location in vendor hardware setup. 

Agreed, but if we're not going to get rid of kvm_configure_mmu(), we're
stuck with vendor-specific code calling into the MMU code during
hardware setup either way.

> 
> I'm not dead set against handling this in kvm_configure_mmu() (though I'd probably
> vote to rename it to kvm_mmu_hardware_setup()) if anyone has a super strong opinion.

Your call. I'll put in a vote for using kvm_configure_mmu() and renaming
to kvm_mmu_hardware_setup().

>  
> > This will result in a larger patch diff, but has it eliminates a subtle
> > and non-trivial-to-verify dependency ordering between
> 
> Verification is "trivial" in that this WARN will fire if the order is swapped:
> 
> 	if (WARN_ON_ONCE(!nr_sptes_per_pte_list))
> 		return -EIO;

Ah I missed that, that's good. Although I was thinking more from a code
readability standpoint.

> 
> > kvm_configure_mmu() and kvm_mmu_hardware_setup() and it will co-locate
> > the initialization of nr_sptes_per_pte_list and the code that uses it to
> > create pte_list_desc_cache in a single function.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup()
  2022-06-24 23:27 ` [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup() Sean Christopherson
  2022-06-25  0:16   ` David Matlack
@ 2022-07-12 21:56   ` Peter Xu
  2022-07-14 18:23     ` Sean Christopherson
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Xu @ 2022-07-12 21:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> @@ -11937,6 +11932,10 @@ int kvm_arch_hardware_setup(void *opaque)
>  
>  	kvm_ops_update(ops);
>  
> +	r = kvm_mmu_hardware_setup();
> +	if (r)
> +		goto out_unsetup;
> +
>  	kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
>  
>  	if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> @@ -11960,12 +11959,18 @@ int kvm_arch_hardware_setup(void *opaque)
>  	kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
>  	kvm_init_msr_list();
>  	return 0;
> +
> +out_unsetup:
> +	static_call(kvm_x86_hardware_unsetup)();

Should this be kvm_mmu_hardware_unsetup()?  Or did I miss something?..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP
  2022-06-24 23:27 ` [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP Sean Christopherson
@ 2022-07-12 22:35   ` Peter Xu
  2022-07-12 22:53     ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2022-07-12 22:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> Dynamically size struct pte_list_desc's array of sptes based on whether
> or not KVM is using TDP.  Commit dc1cff969101 ("KVM: X86: MMU: Tune
> PTE_LIST_EXT to be bigger") bumped the number of entries in order to
> improve performance when using shadow paging, but its analysis that the
> larger size would not affect TDP was wrong.  Consuming pte_list_desc
> objects for nested TDP is indeed rare, but _allocating_ objects is not,
> as KVM allocates 40 objects for each per-vCPU cache.  Reducing the size
> from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to
> 1280, and also provides similar savings when eager page splitting for
> nested MMUs kicks in.
> 
> The per-vCPU overhead could be further reduced by using a custom, smaller
> capacity for the per-vCPU caches, but that's more of an "and" than
> an "or" change, e.g. it wouldn't help the eager page split use case.
> 
> Set the list size to the bare minimum without completely defeating the
> purpose of an array (and because pte_list_add() assumes the array is at
> least two entries deep).  A larger size, e.g. 4, would reduce the number
> of "allocations", but those "allocations" only become allocations in
> truth if a single vCPU depletes its cache to where a topup is needed,
> i.e. if a single vCPU "allocates" 30+ lists.  Conversely, those 2 extra
> entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested
> TDP is used.
> 
> In the unlikely event that performance of aliased gfns for nested TDP
> really is (or becomes) a priority for oddball workloads, KVM could add a
> knob to let the admin tune the array size for their environment.
> 
> Note, KVM also unnecessarily tops up the per-vCPU caches even when not
> using rmaps; this can also be addressed separately.

The only possible way of using pte_list_desc when tdp=1 is when the
hypervisor tries to map the same host pages with different GPAs?

And we don't really have a real use case of that, or.. do we?

Sorry to start with asking questions, it's just that if we know that
pte_list_desc is probably not gonna be used then could we simply skip the
cache layer as a whole?  IOW, we don't make the "array size of pte list
desc" dynamic, instead we make the whole "pte list desc cache layer"
dynamic.  Is it possible?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP
  2022-07-12 22:35   ` Peter Xu
@ 2022-07-12 22:53     ` Sean Christopherson
  2022-07-13  0:24       ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2022-07-12 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Tue, Jul 12, 2022, Peter Xu wrote:
> On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> > Dynamically size struct pte_list_desc's array of sptes based on whether
> > or not KVM is using TDP.  Commit dc1cff969101 ("KVM: X86: MMU: Tune
> > PTE_LIST_EXT to be bigger") bumped the number of entries in order to
> > improve performance when using shadow paging, but its analysis that the
> > larger size would not affect TDP was wrong.  Consuming pte_list_desc
> > objects for nested TDP is indeed rare, but _allocating_ objects is not,
> > as KVM allocates 40 objects for each per-vCPU cache.  Reducing the size
> > from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to
> > 1280, and also provides similar savings when eager page splitting for
> > nested MMUs kicks in.
> > 
> > The per-vCPU overhead could be further reduced by using a custom, smaller
> > capacity for the per-vCPU caches, but that's more of an "and" than
> > an "or" change, e.g. it wouldn't help the eager page split use case.
> > 
> > Set the list size to the bare minimum without completely defeating the
> > purpose of an array (and because pte_list_add() assumes the array is at
> > least two entries deep).  A larger size, e.g. 4, would reduce the number
> > of "allocations", but those "allocations" only become allocations in
> > truth if a single vCPU depletes its cache to where a topup is needed,
> > i.e. if a single vCPU "allocates" 30+ lists.  Conversely, those 2 extra
> > entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested
> > TDP is used.
> > 
> > In the unlikely event that performance of aliased gfns for nested TDP
> > really is (or becomes) a priority for oddball workloads, KVM could add a
> > knob to let the admin tune the array size for their environment.
> > 
> > Note, KVM also unnecessarily tops up the per-vCPU caches even when not
> > using rmaps; this can also be addressed separately.
> 
> The only possible way of using pte_list_desc when tdp=1 is when the
> hypervisor tries to map the same host pages with different GPAs?

Yes, if by "host pages" you mean L1 GPAs.  It happens if the L1 VMM maps multiple
L2 GFNs to a single L1 GFN, in which case KVM's nTDP shadow MMU needs to rmap
that single L1 GFN to multiple L2 GFNs.

> And we don't really have a real use case of that, or.. do we?

QEMU does it during boot/pre-boot when BIOS remaps the flash region into the lower
1mb, i.e. aliases high GPAs to low GPAs.

> Sorry to start with asking questions, it's just that if we know that
> pte_list_desc is probably not gonna be used then could we simply skip the
> cache layer as a whole?  IOW, we don't make the "array size of pte list
> desc" dynamic, instead we make the whole "pte list desc cache layer"
> dynamic.  Is it possible?

Not really?  It's theoretically possible, but it'd require pre-checking that aren't
aliases, and to do that race free we'd have to do it under mmu_lock, which means
having to support bailing from the page fault to topup the cache.  The memory
overhead for the cache isn't so significant that it's worth that level of complexity.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP
  2022-07-12 22:53     ` Sean Christopherson
@ 2022-07-13  0:24       ` Peter Xu
  2022-07-14 18:43         ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2022-07-13  0:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Tue, Jul 12, 2022 at 10:53:48PM +0000, Sean Christopherson wrote:
> On Tue, Jul 12, 2022, Peter Xu wrote:
> > On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> > > Dynamically size struct pte_list_desc's array of sptes based on whether
> > > or not KVM is using TDP.  Commit dc1cff969101 ("KVM: X86: MMU: Tune
> > > PTE_LIST_EXT to be bigger") bumped the number of entries in order to
> > > improve performance when using shadow paging, but its analysis that the
> > > larger size would not affect TDP was wrong.  Consuming pte_list_desc
> > > objects for nested TDP is indeed rare, but _allocating_ objects is not,
> > > as KVM allocates 40 objects for each per-vCPU cache.  Reducing the size
> > > from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to
> > > 1280, and also provides similar savings when eager page splitting for
> > > nested MMUs kicks in.
> > > 
> > > The per-vCPU overhead could be further reduced by using a custom, smaller
> > > capacity for the per-vCPU caches, but that's more of an "and" than
> > > an "or" change, e.g. it wouldn't help the eager page split use case.
> > > 
> > > Set the list size to the bare minimum without completely defeating the
> > > purpose of an array (and because pte_list_add() assumes the array is at
> > > least two entries deep).  A larger size, e.g. 4, would reduce the number
> > > of "allocations", but those "allocations" only become allocations in
> > > truth if a single vCPU depletes its cache to where a topup is needed,
> > > i.e. if a single vCPU "allocates" 30+ lists.  Conversely, those 2 extra
> > > entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested
> > > TDP is used.
> > > 
> > > In the unlikely event that performance of aliased gfns for nested TDP
> > > really is (or becomes) a priority for oddball workloads, KVM could add a
> > > knob to let the admin tune the array size for their environment.
> > > 
> > > Note, KVM also unnecessarily tops up the per-vCPU caches even when not
> > > using rmaps; this can also be addressed separately.
> > 
> > The only possible way of using pte_list_desc when tdp=1 is when the
> > hypervisor tries to map the same host pages with different GPAs?
> 
> Yes, if by "host pages" you mean L1 GPAs.  It happens if the L1 VMM maps multiple
> L2 GFNs to a single L1 GFN, in which case KVM's nTDP shadow MMU needs to rmap
> that single L1 GFN to multiple L2 GFNs.
> 
> > And we don't really have a real use case of that, or.. do we?
> 
> QEMU does it during boot/pre-boot when BIOS remaps the flash region into the lower
> 1mb, i.e. aliases high GPAs to low GPAs.
> 
> > Sorry to start with asking questions, it's just that if we know that
> > pte_list_desc is probably not gonna be used then could we simply skip the
> > cache layer as a whole?  IOW, we don't make the "array size of pte list
> > desc" dynamic, instead we make the whole "pte list desc cache layer"
> > dynamic.  Is it possible?
> 
> Not really?  It's theoretically possible, but it'd require pre-checking that aren't
> aliases, and to do that race free we'd have to do it under mmu_lock, which means
> having to support bailing from the page fault to topup the cache.  The memory
> overhead for the cache isn't so significant that it's worth that level of complexity.

Ah, okay..

So the other question is I'm curious how fundamentally this extra
complexity could help us to save spaces.

The thing is IIUC slub works in page sizes, so at least one slub cache eats
one page which is 4096 anyway.  In our case if there was 40 objects
allocated for 14 entries array, are you sure it'll still be 40 objects but
only smaller?  I'd thought after the change each obj is smaller but slub
could have cached more objects since min slub size is 4k for x86.

I don't remember the details of the eager split work on having per-vcpu
caches, but I'm also wondering if we cannot drop the whole cache layer
whether we can selectively use slub in this case, then we can cache much
less assuming we will use just less too.

Currently:

	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);

We could have the pte list desc cache layer to be managed manually
(e.g. using kmalloc()?) for tdp=1, then we'll at least in control of how
many objects we cache?  Then with a limited number of objects, the wasted
memory is much reduced too.

I think I'm fine with current approach too, but only if it really helps
reduce memory footprint as we expected.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup()
  2022-07-12 21:56   ` Peter Xu
@ 2022-07-14 18:23     ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2022-07-14 18:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Tue, Jul 12, 2022, Peter Xu wrote:
> On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> > @@ -11937,6 +11932,10 @@ int kvm_arch_hardware_setup(void *opaque)
> >  
> >  	kvm_ops_update(ops);
> >  
> > +	r = kvm_mmu_hardware_setup();
> > +	if (r)
> > +		goto out_unsetup;
> > +
> >  	kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
> >  
> >  	if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> > @@ -11960,12 +11959,18 @@ int kvm_arch_hardware_setup(void *opaque)
> >  	kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
> >  	kvm_init_msr_list();
> >  	return 0;
> > +
> > +out_unsetup:
> > +	static_call(kvm_x86_hardware_unsetup)();
> 
> Should this be kvm_mmu_hardware_unsetup()?  Or did I miss something?..

There is no kvm_mmu_hardware_unsetup().  This path is called if kvm_mmu_hardware_setup()
fails, i.e. the common code doesn't need to unwind anything.

The vendor call is not shown in the patch diff, but it's before this as:

	r = ops->hardware_setup();
	if (r != 0)
		return r

there's no existing error paths after that runs, which is why the vendor unsetup
call is new.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP
  2022-07-13  0:24       ` Peter Xu
@ 2022-07-14 18:43         ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2022-07-14 18:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Tue, Jul 12, 2022, Peter Xu wrote:
> On Tue, Jul 12, 2022 at 10:53:48PM +0000, Sean Christopherson wrote:
> > On Tue, Jul 12, 2022, Peter Xu wrote:
> > > On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> > > Sorry to start with asking questions, it's just that if we know that
> > > pte_list_desc is probably not gonna be used then could we simply skip the
> > > cache layer as a whole?  IOW, we don't make the "array size of pte list
> > > desc" dynamic, instead we make the whole "pte list desc cache layer"
> > > dynamic.  Is it possible?
> > 
> > Not really?  It's theoretically possible, but it'd require pre-checking that aren't
> > aliases, and to do that race free we'd have to do it under mmu_lock, which means
> > having to support bailing from the page fault to topup the cache.  The memory
> > overhead for the cache isn't so significant that it's worth that level of complexity.
> 
> Ah, okay..
> 
> So the other question is I'm curious how fundamentally this extra
> complexity could help us to save spaces.
> 
> The thing is IIUC slub works in page sizes, so at least one slub cache eats
> one page which is 4096 anyway.  In our case if there was 40 objects
> allocated for 14 entries array, are you sure it'll still be 40 objects but
> only smaller?

Definitely not 100% positive.

> I'd thought after the change each obj is smaller but slub could have cached
> more objects since min slub size is 4k for x86.


> I don't remember the details of the eager split work on having per-vcpu

The eager split logic uses a single per-VM cache, but it's large (513 entries).

> caches, but I'm also wondering if we cannot drop the whole cache layer
> whether we can selectively use slub in this case, then we can cache much
> less assuming we will use just less too.
> 
> Currently:
> 
> 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> 
> We could have the pte list desc cache layer to be managed manually
> (e.g. using kmalloc()?) for tdp=1, then we'll at least in control of how
> many objects we cache?  Then with a limited number of objects, the wasted
> memory is much reduced too.

I suspect that, without implementing something that looks an awful lot like the
kmem caches, manually handling allocations would degrade performance for shadow
paging and nested MMUs.

> I think I'm fine with current approach too, but only if it really helps
> reduce memory footprint as we expected.

Yeah, I'll get numbers before sending v2 (which will be quite some time at this
point).

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-07-14 18:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-24 23:27 [PATCH 0/4] KVM: x86/mmu: pte_list_desc fix and cleanups Sean Christopherson
2022-06-24 23:27 ` [PATCH 1/4] KVM: x86/mmu: Track the number entries in a pte_list_desc with a ulong Sean Christopherson
2022-06-24 23:27 ` [PATCH 2/4] KVM: x86/mmu: Defer "full" MMU setup until after vendor hardware_setup() Sean Christopherson
2022-06-25  0:16   ` David Matlack
2022-06-27 15:40     ` Sean Christopherson
2022-06-27 22:50       ` David Matlack
2022-07-12 21:56   ` Peter Xu
2022-07-14 18:23     ` Sean Christopherson
2022-06-24 23:27 ` [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP Sean Christopherson
2022-07-12 22:35   ` Peter Xu
2022-07-12 22:53     ` Sean Christopherson
2022-07-13  0:24       ` Peter Xu
2022-07-14 18:43         ` Sean Christopherson
2022-06-24 23:27 ` [PATCH 4/4] KVM: x86/mmu: Topup pte_list_desc cache iff VM is using rmaps Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).