[PATCH 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users
@ 2023-01-17 20:45 Mathias Krause
  2023-01-17 20:45 ` [PATCH 1/3] KVM: x86/mmu: avoid indirect call for get_cr3 Mathias Krause
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mathias Krause @ 2023-01-17 20:45 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

This series is a resurrection of the missing pieces of Paolo's previous
attempt[1] to avoid needless MMU roots unloading. The performance gap
between TDP and legacy MMU is still existent, especially noticeable under
grsecurity which implements kernel W^X by toggling CR0.WP, which happens
very frequently.

Patches 1-13 and 17 of the old series had been merged, but, unfortunately,
the remaining parts never saw a v3. I therefore took care of these, took
Sean's feedback into account[2] and simplified the whole approach to just
handle the case we care most about explicitly.

Patch 1 is a v3 of [3], addressing Sean's feedback.

Patch 2 is specifically useful for grsecurity, as handle_cr() is by far
*the* top vmexit reason.

Patch 3 is the most important one, as it skips unloading the MMU roots for
CR0.WP toggling.

While patches 1 and 2 bring small performance improvements already, the big
gains comes from patch 3.

However, as the performance impact is huge (and my knowledge about KVM
internals is little) it might very well be, I did miss an important aspect.
But KVM tests ran fine, so did manual ones I did that explicitly poke around
CR0.WP toggling corner cases.

Please give it a look!

This series builds on top of kvm.git/queue, namely commit de60733246ff
("Merge branch 'kvm-hw-enable-refactor' into HEAD").

Thanks,
Mathias

[1] https://lore.kernel.org/kvm/20220217210340.312449-1-pbonzini@redhat.com/
[2] https://lore.kernel.org/kvm/YhATewkkO%2Fl4P9UN@google.com/
[3] https://lore.kernel.org/kvm/YhAB1d1%2FnQbx6yvk@google.com/

Mathias Krause (2):
  KVM: VMX: avoid retpoline call for control register caused exits
  KVM: x86: do not unload MMU roots when only toggling CR0.WP

Paolo Bonzini (1):
  KVM: x86/mmu: avoid indirect call for get_cr3

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 38 +++++++++++++++++++++------------
 arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
 arch/x86/kvm/smm.c              |  4 ++--
 arch/x86/kvm/vmx/nested.c       |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  2 ++
 arch/x86/kvm/x86.c              | 28 ++++++++++++++++--------
 7 files changed, 50 insertions(+), 28 deletions(-)

-- 
2.39.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] KVM: x86/mmu: avoid indirect call for get_cr3
  2023-01-17 20:45 [PATCH 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
@ 2023-01-17 20:45 ` Mathias Krause
  2023-01-17 20:45 ` [PATCH 2/3] KVM: VMX: avoid retpoline call for control register caused exits Mathias Krause
  2023-01-17 20:45 ` [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP Mathias Krause
  2 siblings, 0 replies; 7+ messages in thread
From: Mathias Krause @ 2023-01-17 20:45 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

From: Paolo Bonzini <pbonzini@redhat.com>

Most of the time, calls to get_guest_pgd result in calling
kvm_read_cr3 (the exception is only nested TDP).  Hardcode
the default instead of using the get_cr3 function, avoiding
a retpoline if they are enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/kvm/mmu/mmu.c         | 31 ++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 2 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aeb240b339f5..505768631614 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -241,6 +241,20 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
 	return regs;
 }
 
+static unsigned long get_guest_cr3(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr3(vcpu);
+}
+
+static inline unsigned long kvm_mmu_get_guest_pgd(struct kvm_vcpu *vcpu,
+						  struct kvm_mmu *mmu)
+{
+	if (IS_ENABLED(CONFIG_RETPOLINE) && mmu->get_guest_pgd == get_guest_cr3)
+		return kvm_read_cr3(vcpu);
+
+	return mmu->get_guest_pgd(vcpu);
+}
+
 static inline bool kvm_available_flush_tlb_with_range(void)
 {
 	return kvm_x86_ops.tlb_remote_flush_with_range;
@@ -3722,7 +3736,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	int quadrant, i, r;
 	hpa_t root;
 
-	root_pgd = mmu->get_guest_pgd(vcpu);
+	root_pgd = kvm_mmu_get_guest_pgd(vcpu, mmu);
 	root_gfn = root_pgd >> PAGE_SHIFT;
 
 	if (mmu_check_root(vcpu, root_gfn))
@@ -4172,7 +4186,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	arch.token = alloc_apf_token(vcpu);
 	arch.gfn = gfn;
 	arch.direct_map = vcpu->arch.mmu->root_role.direct;
-	arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
+	arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
 
 	return kvm_setup_async_pf(vcpu, cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
@@ -4191,7 +4205,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 		return;
 
 	if (!vcpu->arch.mmu->root_role.direct &&
-	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
+	      work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
 		return;
 
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
@@ -4592,11 +4606,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
 
-static unsigned long get_cr3(struct kvm_vcpu *vcpu)
-{
-	return kvm_read_cr3(vcpu);
-}
-
 static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
 			   unsigned int access)
 {
@@ -5147,7 +5156,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
 	context->page_fault = kvm_tdp_page_fault;
 	context->sync_page = nonpaging_sync_page;
 	context->invlpg = NULL;
-	context->get_guest_pgd = get_cr3;
+	context->get_guest_pgd = get_guest_cr3;
 	context->get_pdptr = kvm_pdptr_read;
 	context->inject_page_fault = kvm_inject_page_fault;
 
@@ -5297,7 +5306,7 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
 
 	kvm_init_shadow_mmu(vcpu, cpu_role);
 
-	context->get_guest_pgd     = get_cr3;
+	context->get_guest_pgd     = get_guest_cr3;
 	context->get_pdptr         = kvm_pdptr_read;
 	context->inject_page_fault = kvm_inject_page_fault;
 }
@@ -5311,7 +5320,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu,
 		return;
 
 	g_context->cpu_role.as_u64   = new_mode.as_u64;
-	g_context->get_guest_pgd     = get_cr3;
+	g_context->get_guest_pgd     = get_guest_cr3;
 	g_context->get_pdptr         = kvm_pdptr_read;
 	g_context->inject_page_fault = kvm_inject_page_fault;
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index e5662dbd519c..78448fb84bd6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -324,7 +324,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	trace_kvm_mmu_pagetable_walk(addr, access);
 retry_walk:
 	walker->level = mmu->cpu_role.base.level;
-	pte           = mmu->get_guest_pgd(vcpu);
+	pte           = kvm_mmu_get_guest_pgd(vcpu, mmu);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
 #if PTTYPE == 64
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] KVM: VMX: avoid retpoline call for control register caused exits
  2023-01-17 20:45 [PATCH 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
  2023-01-17 20:45 ` [PATCH 1/3] KVM: x86/mmu: avoid indirect call for get_cr3 Mathias Krause
@ 2023-01-17 20:45 ` Mathias Krause
  2023-01-17 20:45 ` [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP Mathias Krause
  2 siblings, 0 replies; 7+ messages in thread
From: Mathias Krause @ 2023-01-17 20:45 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

Complement commit 4289d2728664 ("KVM: retpolines: x86: eliminate
retpoline from vmx.c exit handlers") and avoid a retpoline call for
control register accesses as well.

This speeds up guests that make heavy use of it, like grsecurity
kernels toggling CR0.WP to implement kernel W^X.

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
SVM may gain from a similar change as well, however, I've no AMD box to
test this on.

 arch/x86/kvm/vmx/vmx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c788aa382611..c8198c8a9b55 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6538,6 +6538,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 		return handle_external_interrupt(vcpu);
 	else if (exit_reason.basic == EXIT_REASON_HLT)
 		return kvm_emulate_halt(vcpu);
+	else if (exit_reason.basic == EXIT_REASON_CR_ACCESS)
+		return handle_cr(vcpu);
 	else if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG)
 		return handle_ept_misconfig(vcpu);
 #endif
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP
  2023-01-17 20:45 [PATCH 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
  2023-01-17 20:45 ` [PATCH 1/3] KVM: x86/mmu: avoid indirect call for get_cr3 Mathias Krause
  2023-01-17 20:45 ` [PATCH 2/3] KVM: VMX: avoid retpoline call for control register caused exits Mathias Krause
@ 2023-01-17 20:45 ` Mathias Krause
  2023-01-17 21:29   ` Sean Christopherson
  2 siblings, 1 reply; 7+ messages in thread
From: Mathias Krause @ 2023-01-17 20:45 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

There is no need to unload the MMU roots when only CR0.WP has changed --
the paging structures are still valid, only the permission bitmap needs
to be updated.

Change kvm_mmu_reset_context() to get passed the need for unloading MMU
roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
caused VMEXIT.

This change brings a huge performance gain as the following micro-
benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
shows (runtime in seconds, lower is better):

                      legacy MMU   TDP MMU
kvm.git/queue             11.55s    13.91s
kvm.git/queue+patch        7.44s     7.94s

For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.

[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  7 ++++---
 arch/x86/kvm/smm.c              |  4 ++--
 arch/x86/kvm/vmx/nested.c       |  2 +-
 arch/x86/kvm/x86.c              | 28 +++++++++++++++++++---------
 5 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4d2bc08794e4..e7851315ffa6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1812,7 +1812,7 @@ int kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
-void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
+void kvm_mmu_reset_context(struct kvm_vcpu *vcpu, bool unload_mmu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 505768631614..4022394d3a25 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5384,7 +5384,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.root_mmu.cpu_role.ext.valid = 0;
 	vcpu->arch.guest_mmu.cpu_role.ext.valid = 0;
 	vcpu->arch.nested_mmu.cpu_role.ext.valid = 0;
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 
 	/*
 	 * Changing guest CPUID after KVM_RUN is forbidden, see the comment in
@@ -5393,9 +5393,10 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm);
 }
 
-void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
+void kvm_mmu_reset_context(struct kvm_vcpu *vcpu, bool unload_mmu)
 {
-	kvm_mmu_unload(vcpu);
+	if (unload_mmu)
+		kvm_mmu_unload(vcpu);
 	kvm_init_mmu(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index cc43638d48a3..09f47048eb1b 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -131,7 +131,7 @@ void kvm_smm_changed(struct kvm_vcpu *vcpu, bool entering_smm)
 		vcpu->arch.pdptrs_from_userspace = false;
 	}
 
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 }
 
 void process_smi(struct kvm_vcpu *vcpu)
@@ -369,7 +369,7 @@ void enter_smm(struct kvm_vcpu *vcpu)
 #endif
 
 	kvm_update_cpuid_runtime(vcpu);
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 	return;
 error:
 	kvm_vm_dead(vcpu->kvm);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 557b9c468734..14815fd6dcb1 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4648,7 +4648,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 	if (enable_ept && is_pae_paging(vcpu))
 		ept_save_pdptrs(vcpu);
 
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 
 	/*
 	 * This nasty bit of open coding is a compromise between blindly
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 508074e47bc0..d7c326ab94de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -902,7 +902,9 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
 
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
-	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
+	unsigned long cr0_change = cr0 ^ old_cr0;
+
+	if (cr0_change & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
 
@@ -914,10 +916,18 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}
 
-	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+	if (cr0_change & KVM_MMU_CR0_ROLE_BITS) {
+		bool unload_mmu =
+			cr0_change & (KVM_MMU_CR0_ROLE_BITS & ~X86_CR0_WP);
 
-	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
+		/*
+		 * Toggling just CR0.WP doesn't invalidate page tables per se,
+		 * only the permission bits.
+		 */
+		kvm_mmu_reset_context(vcpu, unload_mmu);
+	}
+
+	if ((cr0_change & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
@@ -1117,7 +1127,7 @@ static bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
 {
 	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 
 	/*
 	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
@@ -1740,7 +1750,7 @@ static int set_efer(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	}
 
 	if ((efer ^ old_efer) & KVM_MMU_EFER_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 
 	return 0;
 }
@@ -11410,7 +11420,7 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		return ret;
 
 	if (mmu_reset_needed)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 
 	max_bits = KVM_NR_INTERRUPTS;
 	pending_vec = find_first_bit(
@@ -11452,7 +11462,7 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
 		vcpu->arch.pdptrs_from_userspace = true;
 	}
 	if (mmu_reset_needed)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 	return 0;
 }
 
@@ -11970,7 +11980,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	 */
 	if (old_cr0 & X86_CR0_PG) {
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 	}
 
 	/*
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP
  2023-01-17 20:45 ` [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP Mathias Krause
@ 2023-01-17 21:29   ` Sean Christopherson
  2023-01-18 10:17     ` Mathias Krause
  0 siblings, 1 reply; 7+ messages in thread
From: Sean Christopherson @ 2023-01-17 21:29 UTC (permalink / raw)
  To: Mathias Krause; +Cc: kvm, linux-kernel, Paolo Bonzini

On Tue, Jan 17, 2023, Mathias Krause wrote:
> There is no need to unload the MMU roots when only CR0.WP has changed --
> the paging structures are still valid, only the permission bitmap needs
> to be updated.

This doesn't hold true when KVM is using shadow paging, in which case CR0.WP
affects the shadow page tables.  I believe that also holds true for nNPT :-(

nEPT doesn't consume CR0.WP so we could expedite that case as well, though
identifying that case might be annoying.

> Change kvm_mmu_reset_context() to get passed the need for unloading MMU
> roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
> caused VMEXIT.

One thing we should explore on top of this is not intercepting CR0.WP (on Intel)
when TDP is enabled.  It could even trigger after toggling CR0.WP N times, e.g.
to optimize the grsecurity use case without negatively impacting workloads with
a static CR0.WP, as walking guest memory would require an "extra" VMREAD to get
CR0.WP in that case.

Unfortunately, AMD doesn't provide per-bit controls.

> This change brings a huge performance gain as the following micro-
> benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
> shows (runtime in seconds, lower is better):
> 
>                       legacy MMU   TDP MMU
> kvm.git/queue             11.55s    13.91s
> kvm.git/queue+patch        7.44s     7.94s
> 
> For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.
> 
> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
> 
> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
> ---

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 508074e47bc0..d7c326ab94de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -902,7 +902,9 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>  
>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>  {
> -	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
> +	unsigned long cr0_change = cr0 ^ old_cr0;
> +
> +	if (cr0_change & X86_CR0_PG) {
>  		kvm_clear_async_pf_completion_queue(vcpu);
>  		kvm_async_pf_hash_reset(vcpu);
>  
> @@ -914,10 +916,18 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  	}
>  
> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> -		kvm_mmu_reset_context(vcpu);
> +	if (cr0_change & KVM_MMU_CR0_ROLE_BITS) {
> +		bool unload_mmu =
> +			cr0_change & (KVM_MMU_CR0_ROLE_BITS & ~X86_CR0_WP);

As above, this needs to guarded with a check that the MMU is direct.  And rather
than add a flag to kvm_mmu_reset_context(), just call kvm_init_mmu() directly.
E.g. I think this would work?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d07563d0e204..8f9fac6d81d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -927,6 +927,11 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
 
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
+       if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
+               kvm_init_mmu(vcpu);
+               return;
+       }
+
        if ((cr0 ^ old_cr0) & X86_CR0_PG) {
                kvm_clear_async_pf_completion_queue(vcpu);
                kvm_async_pf_hash_reset(vcpu);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP
  2023-01-17 21:29   ` Sean Christopherson
@ 2023-01-18 10:17     ` Mathias Krause
  2023-01-27 16:15       ` Mathias Krause
  0 siblings, 1 reply; 7+ messages in thread
From: Mathias Krause @ 2023-01-18 10:17 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel, Paolo Bonzini

On 17.01.23 22:29, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Mathias Krause wrote:
>> There is no need to unload the MMU roots when only CR0.WP has changed --
>> the paging structures are still valid, only the permission bitmap needs
>> to be updated.
> 
> This doesn't hold true when KVM is using shadow paging, in which case CR0.WP
> affects the shadow page tables.  I believe that also holds true for nNPT :-(

Oh, I knew there would be a case I missed. Thank you for pointing it out!

> nEPT doesn't consume CR0.WP so we could expedite that case as well, though
> identifying that case might be annoying.

I'm fine with starting with optimizing L1 only as the performance gain
for this usual case is huge already. But sure, if more is possible, I'm
all for it. It's just that I lack the knowledge about KVM internals to
figure it out all by myself.

>> Change kvm_mmu_reset_context() to get passed the need for unloading MMU
>> roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
>> caused VMEXIT.
> 
> One thing we should explore on top of this is not intercepting CR0.WP (on Intel)
> when TDP is enabled.  It could even trigger after toggling CR0.WP N times, e.g.
> to optimize the grsecurity use case without negatively impacting workloads with
> a static CR0.WP, as walking guest memory would require an "extra" VMREAD to get
> CR0.WP in that case.

That would be even better, agreed. I'll look into it and will try to
come up with something.

> Unfortunately, AMD doesn't provide per-bit controls.
> 
>> This change brings a huge performance gain as the following micro-
>> benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
>> shows (runtime in seconds, lower is better):
>>
>>                       legacy MMU   TDP MMU
>> kvm.git/queue             11.55s    13.91s
>> kvm.git/queue+patch        7.44s     7.94s
>>
>> For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.
>>
>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>
>> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
>> ---
> 
> ...
> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 508074e47bc0..d7c326ab94de 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -902,7 +902,9 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>>  
>>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>>  {
>> -	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>> +	unsigned long cr0_change = cr0 ^ old_cr0;
>> +
>> +	if (cr0_change & X86_CR0_PG) {
>>  		kvm_clear_async_pf_completion_queue(vcpu);
>>  		kvm_async_pf_hash_reset(vcpu);
>>  
>> @@ -914,10 +916,18 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>>  			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>>  	}
>>  
>> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>> -		kvm_mmu_reset_context(vcpu);
>> +	if (cr0_change & KVM_MMU_CR0_ROLE_BITS) {
>> +		bool unload_mmu =
>> +			cr0_change & (KVM_MMU_CR0_ROLE_BITS & ~X86_CR0_WP);
> 
> As above, this needs to guarded with a check that the MMU is direct.  And rather
> than add a flag to kvm_mmu_reset_context(), just call kvm_init_mmu() directly.
> E.g. I think this would work?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d07563d0e204..8f9fac6d81d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -927,6 +927,11 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>  
>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>  {
> +       if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
> +               kvm_init_mmu(vcpu);
> +               return;
> +       }
> +
>         if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>                 kvm_clear_async_pf_completion_queue(vcpu);
>                 kvm_async_pf_hash_reset(vcpu);

Looks much simpler and more direct. Nice. :)

I'll re-test and send a v2 later today.

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP
  2023-01-18 10:17     ` Mathias Krause
@ 2023-01-27 16:15       ` Mathias Krause
  0 siblings, 0 replies; 7+ messages in thread
From: Mathias Krause @ 2023-01-27 16:15 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel, Paolo Bonzini

On 18.01.23 11:17, Mathias Krause wrote:
> On 17.01.23 22:29, Sean Christopherson wrote:
>> On Tue, Jan 17, 2023, Mathias Krause wrote:
>>> [...] 
>>> Change kvm_mmu_reset_context() to get passed the need for unloading MMU
>>> roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
>>> caused VMEXIT.
>>
>> One thing we should explore on top of this is not intercepting CR0.WP (on Intel)
>> when TDP is enabled.  It could even trigger after toggling CR0.WP N times, e.g.
>> to optimize the grsecurity use case without negatively impacting workloads with
>> a static CR0.WP, as walking guest memory would require an "extra" VMREAD to get
>> CR0.WP in that case.
> 
> That would be even better, agreed. I'll look into it and will try to
> come up with something.

I looked into it and we can gain quite a few more cycles from this, e.g.
the runtime for the 'ssdd 10 50000' test running with TDP MMU takes
another bump from 7.31s down to 4.89s. That's overall 2.8 times faster
than the 13.91s we started with. :)

I'll cook up a patch next week and send a v3 series with some more
cleanups I collected in the meantime.

>> Unfortunately, AMD doesn't provide per-bit controls.

Meanwhile I got my hands on an AMD system and it gains from this series
as well, not as much as my Intel system, though. We go down from 5.8s to
4.12s for the 'ssdd 10 50000' test with TDP MMU enabled -- a nearly 30%
runtime reduction.

>>> This change brings a huge performance gain as the following micro-
>>> benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
>>> shows (runtime in seconds, lower is better):
>>>
>>>                       legacy MMU   TDP MMU
>>> kvm.git/queue             11.55s    13.91s
>>> kvm.git/queue+patch        7.44s     7.94s
>>>
>>> For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.
>>>
>>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>>
>>> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
>>> ---
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-01-27 16:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-17 20:45 [PATCH 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
2023-01-17 20:45 ` [PATCH 1/3] KVM: x86/mmu: avoid indirect call for get_cr3 Mathias Krause
2023-01-17 20:45 ` [PATCH 2/3] KVM: VMX: avoid retpoline call for control register caused exits Mathias Krause
2023-01-17 20:45 ` [PATCH 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP Mathias Krause
2023-01-17 21:29   ` Sean Christopherson
2023-01-18 10:17     ` Mathias Krause
2023-01-27 16:15       ` Mathias Krause

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.