All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
@ 2023-03-22  1:37 Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 1/6] KVM: x86/mmu: Avoid indirect call for get_cr3 Mathias Krause
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  1:37 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/

This series is the fourth iteration of resurrecting the missing pieces of
Paolo's previous attempt[1] to avoid needless MMU roots unloading.

It's incorporating Sean's feedback to v3 and rebased on top of
kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
kvm_arch_vm_ioctl() to "int"").

The performance gap between TDP and legacy MMU is still existent,
especially noticeable under grsecurity which implements kernel W^X by
toggling CR0.WP, which happens very frequently. This series tries to fix
this needless performance loss.

Patch 1 is a v3 of [3], addressing Sean's feedback.

Patch 2 implements Sean's feedback[2] to Paolo's original approach and
skips unloading the MMU roots for CR0.WP toggling under TDP.

Patch 3 further micro-optimizes this for non-paging guests -- anyone still
running MS Singularity? ;)

Sean was suggesting another change on top of v2 of this series (and then
again a more elaborate version on top of v3), to skip intercepting CR0.WP
writes completely for VMX[4]. That turned out to be yet another
performance boost and is implemented in patch 6.

Patches 2 and 6 are the most important ones, as they bring the big
performance gains.

I used 'ssdd 10 50000' from rt-tests[5] as a micro-benchmark, running on a
grsecurity L1 VM. Below table shows the results (runtime in seconds, lower
is better):

                              legacy     TDP    shadow
    kvm-x86/next@d8708b        8.43s    9.45s    70.3s
    + patches 1-3              5.39s    5.63s    70.2s
    + patches 4-6              3.51s    3.47s    67.8s

I re-ran the benchmarks (again) and the numbers changed a little from the
ones in v3. We got a better baseline which is likely caused by the rebase
to a v6.3-rc2 based tree (was v6.2-rc3 based before).

Patches 1, 4 and 5 didn't change from v3, beside minor changelog tweaks.

Patches 2 and 6 have been rewritten.

Patch 3 is new to this series.

Bonus rant^Wbug report:

Actually re-running the benchmarks took me a while because my VM was
constantly crashing on me with a #GP during scheduling. Looking a little
closer, I noticed it was for a IA32_PRED_CMD MSR write which was puzzling,
as the VM's kernel didn't change for my tests (built it more than a month
ago!), so the old test runs should have triggered that code path (and #GP)
too! Digging through some kernel code let me see it's all tied to the x86
feature flag X86_FEATURE_USE_IBPB which gets set when X86_FEATURE_IBPB is,
i.e. the CPU supports IBPB.

*head-scratching pause* 

Foolish me probably did a system update of the host and got a microcode
update that added IBPB support to my CPU. Yayyy... NOT! As this implies
announcing IBPB support to the VM as well and, in turn, makes the guest
kernel try to use it, I'm doomed to hit that bug. *bummer*

Something must not be right with KVM / QEMU / the kernel, as this guest
behaviour clearly shouldn't cause KVM to inject a #GP into the guest.

The bugging call chain in the guest kernel is:
  -> switch_mm_irqs_off()
     -> cond_mitigation()
        -> indirect_branch_prediction_barrier()
           -> alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB))

So far I'm working around this by passing 'clearcpuid=ibpb' on the kernel
commandline. But this should probably be fixed, that's why I'm mentioning
it. It's just too late here already to debug this further today. Well, for
some definition of "today."


Thanks,
Mathias

[1] https://lore.kernel.org/kvm/20220217210340.312449-1-pbonzini@redhat.com/
[2] https://lore.kernel.org/kvm/YhATewkkO%2Fl4P9UN@google.com/
[3] https://lore.kernel.org/kvm/YhAB1d1%2FnQbx6yvk@google.com/
[4] https://lore.kernel.org/kvm/Y8cTMnyBzNdO5dY3@google.com/
[5] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git


Mathias Krause (5):
  KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP
    enabled
  KVM: x86: Ignore CR0.WP toggles in non-paging mode
  KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
  KVM: x86/mmu: Fix comment typo
  KVM: VMX: Make CR0.WP a guest owned bit

Paolo Bonzini (1):
  KVM: x86/mmu: Avoid indirect call for get_cr3

 arch/x86/kvm/kvm_cache_regs.h  |  2 +-
 arch/x86/kvm/mmu/mmu.c         | 31 ++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 arch/x86/kvm/mmu/spte.c        |  2 +-
 arch/x86/kvm/pmu.c             |  4 ++--
 arch/x86/kvm/vmx/nested.c      |  4 ++--
 arch/x86/kvm/vmx/vmx.c         |  6 +++---
 arch/x86/kvm/vmx/vmx.h         | 18 ++++++++++++++++++
 arch/x86/kvm/x86.c             | 18 ++++++++++++++++++
 9 files changed, 66 insertions(+), 21 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
@ 2023-03-22  1:37 ` Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled Mathias Krause
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  1:37 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

From: Paolo Bonzini <pbonzini@redhat.com>

Most of the time, calls to get_guest_pgd result in calling
kvm_read_cr3 (the exception is only nested TDP).  Hardcode
the default instead of using the get_cr3 function, avoiding
a retpoline if they are enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/kvm/mmu/mmu.c         | 31 ++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 2 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 144c5a01cd77..9046a892998e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -242,6 +242,20 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
 	return regs;
 }
 
+static unsigned long get_guest_cr3(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr3(vcpu);
+}
+
+static inline unsigned long kvm_mmu_get_guest_pgd(struct kvm_vcpu *vcpu,
+						  struct kvm_mmu *mmu)
+{
+	if (IS_ENABLED(CONFIG_RETPOLINE) && mmu->get_guest_pgd == get_guest_cr3)
+		return kvm_read_cr3(vcpu);
+
+	return mmu->get_guest_pgd(vcpu);
+}
+
 static inline bool kvm_available_flush_tlb_with_range(void)
 {
 	return kvm_x86_ops.tlb_remote_flush_with_range;
@@ -3731,7 +3745,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	int quadrant, i, r;
 	hpa_t root;
 
-	root_pgd = mmu->get_guest_pgd(vcpu);
+	root_pgd = kvm_mmu_get_guest_pgd(vcpu, mmu);
 	root_gfn = root_pgd >> PAGE_SHIFT;
 
 	if (mmu_check_root(vcpu, root_gfn))
@@ -4181,7 +4195,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	arch.token = alloc_apf_token(vcpu);
 	arch.gfn = gfn;
 	arch.direct_map = vcpu->arch.mmu->root_role.direct;
-	arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
+	arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
 
 	return kvm_setup_async_pf(vcpu, cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
@@ -4200,7 +4214,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 		return;
 
 	if (!vcpu->arch.mmu->root_role.direct &&
-	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
+	      work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
 		return;
 
 	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
@@ -4604,11 +4618,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
 
-static unsigned long get_cr3(struct kvm_vcpu *vcpu)
-{
-	return kvm_read_cr3(vcpu);
-}
-
 static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
 			   unsigned int access)
 {
@@ -5159,7 +5168,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
 	context->page_fault = kvm_tdp_page_fault;
 	context->sync_page = nonpaging_sync_page;
 	context->invlpg = NULL;
-	context->get_guest_pgd = get_cr3;
+	context->get_guest_pgd = get_guest_cr3;
 	context->get_pdptr = kvm_pdptr_read;
 	context->inject_page_fault = kvm_inject_page_fault;
 
@@ -5309,7 +5318,7 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
 
 	kvm_init_shadow_mmu(vcpu, cpu_role);
 
-	context->get_guest_pgd     = get_cr3;
+	context->get_guest_pgd     = get_guest_cr3;
 	context->get_pdptr         = kvm_pdptr_read;
 	context->inject_page_fault = kvm_inject_page_fault;
 }
@@ -5323,7 +5332,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu,
 		return;
 
 	g_context->cpu_role.as_u64   = new_mode.as_u64;
-	g_context->get_guest_pgd     = get_cr3;
+	g_context->get_guest_pgd     = get_guest_cr3;
 	g_context->get_pdptr         = kvm_pdptr_read;
 	g_context->inject_page_fault = kvm_inject_page_fault;
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index a056f2773dd9..8417ecbc3887 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -324,7 +324,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	trace_kvm_mmu_pagetable_walk(addr, access);
 retry_walk:
 	walker->level = mmu->cpu_role.base.level;
-	pte           = mmu->get_guest_pgd(vcpu);
+	pte           = kvm_mmu_get_guest_pgd(vcpu, mmu);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
 #if PTTYPE == 64
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 1/6] KVM: x86/mmu: Avoid indirect call for get_cr3 Mathias Krause
@ 2023-03-22  1:37 ` Mathias Krause
  2023-05-07  7:32   ` Robert Hoo
  2023-03-22  1:37 ` [PATCH v4 3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode Mathias Krause
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  1:37 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

There is no need to unload the MMU roots with TDP enabled when only
CR0.WP has changed -- the paging structures are still valid, only the
permission bitmap needs to be updated.

One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
implement kernel W^X.

The optimization brings a huge performance gain for this case as the
following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
grsecurity L1 VM shows (runtime in seconds, lower is better):

                       legacy     TDP    shadow
kvm-x86/next@d8708b     8.43s    9.45s    70.3s
             +patch     5.39s    5.63s    70.2s

For legacy MMU this is ~36% faster, for TTP MMU even ~40% faster. Also
TDP and legacy MMU now both have a similar runtime which vanishes the
need to disable TDP MMU for grsecurity.

Shadow MMU sees no measurable difference and is still slow, as expected.

[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/kvm/x86.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 237c483b1230..c6d909778b2c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -906,6 +906,18 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
 
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
+	/*
+	 * CR0.WP is incorporated into the MMU role, but only for non-nested,
+	 * indirect shadow MMUs.  If TDP is enabled, the MMU's metadata needs
+	 * to be updated, e.g. so that emulating guest translations does the
+	 * right thing, but there's no need to unload the root as CR0.WP
+	 * doesn't affect SPTEs.
+	 */
+	if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP) {
+		kvm_init_mmu(vcpu);
+		return;
+	}
+
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 1/6] KVM: x86/mmu: Avoid indirect call for get_cr3 Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled Mathias Krause
@ 2023-03-22  1:37 ` Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits Mathias Krause
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  1:37 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

If paging is disabled, there are no permission bits to emulate.
Micro-optimize this case to avoid unnecessary work.

Suggested-and-co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/kvm/x86.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c6d909778b2c..8a66ac7a4878 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -908,14 +908,20 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 {
 	/*
 	 * CR0.WP is incorporated into the MMU role, but only for non-nested,
-	 * indirect shadow MMUs.  If TDP is enabled, the MMU's metadata needs
-	 * to be updated, e.g. so that emulating guest translations does the
-	 * right thing, but there's no need to unload the root as CR0.WP
-	 * doesn't affect SPTEs.
+	 * indirect shadow MMUs.  If paging is disabled, no updates are needed
+	 * as there are no permission bits to emulate.  If TDP is enabled, the
+	 * MMU's metadata needs to be updated, e.g. so that emulating guest
+	 * translations does the right thing, but there's no need to unload the
+	 * root as CR0.WP doesn't affect SPTEs.
 	 */
-	if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP) {
-		kvm_init_mmu(vcpu);
-		return;
+	if ((cr0 ^ old_cr0) == X86_CR0_WP) {
+		if (!(cr0 & X86_CR0_PG))
+			return;
+
+		if (tdp_enabled) {
+			kvm_init_mmu(vcpu);
+			return;
+		}
 	}
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
                   ` (2 preceding siblings ...)
  2023-03-22  1:37 ` [PATCH v4 3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode Mathias Krause
@ 2023-03-22  1:37 ` Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 5/6] KVM: x86/mmu: Fix comment typo Mathias Krause
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  1:37 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
want to know the state of certain bits instead of the whole register.

This not only makes the intent cleaner, it also avoids a potential
VMREAD in case the tested bits aren't guest owned.

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/kvm/pmu.c     | 4 ++--
 arch/x86/kvm/vmx/vmx.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 612e6c70ce2e..f4aa170b5b97 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -540,9 +540,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	if (!pmc)
 		return 1;
 
-	if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
+	if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
 	    (static_call(kvm_x86_get_cpl)(vcpu) != 0) &&
-	    (kvm_read_cr0(vcpu) & X86_CR0_PE))
+	    (kvm_read_cr0_bits(vcpu, X86_CR0_PE)))
 		return 1;
 
 	*data = pmc_read_counter(pmc) & mask;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d7bf14abdba1..8fc1a0c7856f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5517,7 +5517,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
 		break;
 	case 3: /* lmsw */
 		val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f;
-		trace_kvm_cr_write(0, (kvm_read_cr0(vcpu) & ~0xful) | val);
+		trace_kvm_cr_write(0, (kvm_read_cr0_bits(vcpu, ~0xful) | val));
 		kvm_lmsw(vcpu, val);
 
 		return kvm_skip_emulated_instruction(vcpu);
@@ -7575,7 +7575,7 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
-	if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
+	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
 			cache = MTRR_TYPE_WRBACK;
 		else
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 5/6] KVM: x86/mmu: Fix comment typo
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
                   ` (3 preceding siblings ...)
  2023-03-22  1:37 ` [PATCH v4 4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits Mathias Krause
@ 2023-03-22  1:37 ` Mathias Krause
  2023-03-22  1:37 ` [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit Mathias Krause
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  1:37 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

Fix a small comment typo in make_spte().

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/kvm/mmu/spte.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index c15bfca3ed15..cf2c6426a6fc 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -164,7 +164,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	/*
 	 * For simplicity, enforce the NX huge page mitigation even if not
 	 * strictly necessary.  KVM could ignore the mitigation if paging is
-	 * disabled in the guest, as the guest doesn't have an page tables to
+	 * disabled in the guest, as the guest doesn't have any page tables to
 	 * abuse.  But to safely ignore the mitigation, KVM would have to
 	 * ensure a new MMU is loaded (or all shadow pages zapped) when CR0.PG
 	 * is toggled on, and that's a net negative for performance when TDP is
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
                   ` (4 preceding siblings ...)
  2023-03-22  1:37 ` [PATCH v4 5/6] KVM: x86/mmu: Fix comment typo Mathias Krause
@ 2023-03-22  1:37 ` Mathias Krause
  2023-03-27  8:33   ` Xiaoyao Li
  2023-03-30  8:45   ` Mathias Krause
  2023-03-22  7:41 ` [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
  2023-03-23 22:50 ` Sean Christopherson
  7 siblings, 2 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  1:37 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini, Mathias Krause

Guests like grsecurity that make heavy use of CR0.WP to implement kernel
level W^X will suffer from the implied VMEXITs.

With EPT there is no need to intercept a guest change of CR0.WP, so
simply make it a guest owned bit if we can do so.

This implies that a read of a guest's CR0.WP bit might need a VMREAD.
However, the only potentially affected user seems to be kvm_init_mmu()
which is a heavy operation to begin with. But also most callers already
cache the full value of CR0 anyway, so no additional VMREAD is needed.
The only exception is nested_vmx_load_cr3().

This change is VMX-specific, as SVM has no such fine grained control
register intercept control.

Suggested-and-co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/kvm/kvm_cache_regs.h |  2 +-
 arch/x86/kvm/vmx/nested.c     |  4 ++--
 arch/x86/kvm/vmx/vmx.c        |  2 +-
 arch/x86/kvm/vmx/vmx.h        | 18 ++++++++++++++++++
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 4c91f626c058..e50d353b5c1c 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -4,7 +4,7 @@
 
 #include <linux/kvm_host.h>
 
-#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
+#define KVM_POSSIBLE_CR0_GUEST_BITS	(X86_CR0_TS | X86_CR0_WP)
 #define KVM_POSSIBLE_CR4_GUEST_BITS				  \
 	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
 	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f63b28f46a71..61d940fc91ba 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4481,7 +4481,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	 * CR0_GUEST_HOST_MASK is already set in the original vmcs01
 	 * (KVM doesn't change it);
 	 */
-	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
 	vmx_set_cr0(vcpu, vmcs12->host_cr0);
 
 	/* Same as above - no reason to call set_cr4_guest_host_mask().  */
@@ -4632,7 +4632,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 	 */
 	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
 
-	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
 	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
 
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8fc1a0c7856f..e501f6864a72 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4790,7 +4790,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	/* 22.2.1, 20.8.1 */
 	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
 
-	vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+	vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
 
 	set_cr4_guest_host_mask(vmx);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2acdc54bc34b..423e9d3c9c40 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -640,6 +640,24 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
 				(1 << VCPU_EXREG_EXIT_INFO_1) | \
 				(1 << VCPU_EXREG_EXIT_INFO_2))
 
+static inline unsigned long vmx_l1_guest_owned_cr0_bits(void)
+{
+	unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+
+	/*
+	 * CR0.WP needs to be intercepted when KVM is shadowing legacy paging
+	 * in order to construct shadow PTEs with the correct protections.
+	 * Note!  CR0.WP technically can be passed through to the guest if
+	 * paging is disabled, but checking CR0.PG would generate a cyclical
+	 * dependency of sorts due to forcing the caller to ensure CR0 holds
+	 * the correct value prior to determining which CR0 bits can be owned
+	 * by L1.  Keep it simple and limit the optimization to EPT.
+	 */
+	if (!enable_ept)
+		bits &= ~X86_CR0_WP;
+	return bits;
+}
+
 static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
 {
 	return container_of(kvm, struct kvm_vmx, kvm);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
                   ` (5 preceding siblings ...)
  2023-03-22  1:37 ` [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit Mathias Krause
@ 2023-03-22  7:41 ` Mathias Krause
  2023-03-23 22:50 ` Sean Christopherson
  7 siblings, 0 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-22  7:41 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini

On 22.03.23 02:37, Mathias Krause wrote:
> Bonus rant^Wbug report:
> [VMs #GP'ing on WRMSR(IA32_PRED_CMD,...)]

I see that this is already a well known bug, fully analyzed and taken
care off:

https://lore.kernel.org/kvm/20230322011440.2195485-1-seanjc@google.com/

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
                   ` (6 preceding siblings ...)
  2023-03-22  7:41 ` [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
@ 2023-03-23 22:50 ` Sean Christopherson
  2023-03-25 11:39   ` Mathias Krause
  7 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2023-03-23 22:50 UTC (permalink / raw)
  To: Sean Christopherson, kvm, Mathias Krause; +Cc: linux-kernel, Paolo Bonzini

On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
> 
> This series is the fourth iteration of resurrecting the missing pieces of
> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
> 
> It's incorporating Sean's feedback to v3 and rebased on top of
> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
> kvm_arch_vm_ioctl() to "int"").
> 
> [...]

Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!

[1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
      https://github.com/kvm-x86/linux/commit/2fdcc1b32418
[2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
      https://github.com/kvm-x86/linux/commit/01b31714bd90
[3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
      https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
[4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
      https://github.com/kvm-x86/linux/commit/74cdc836919b
[5/6] KVM: x86/mmu: Fix comment typo
      https://github.com/kvm-x86/linux/commit/50f13998451e
[6/6] KVM: VMX: Make CR0.WP a guest owned bit
      https://github.com/kvm-x86/linux/commit/fb509f76acc8

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-03-23 22:50 ` Sean Christopherson
@ 2023-03-25 11:39   ` Mathias Krause
  2023-03-25 12:25     ` Greg KH
  0 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-03-25 11:39 UTC (permalink / raw)
  To: Sean Christopherson, Greg KH; +Cc: kvm, Paolo Bonzini, stable

On 23.03.23 23:50, Sean Christopherson wrote:
> On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
>> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
>>
>> This series is the fourth iteration of resurrecting the missing pieces of
>> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
>>
>> It's incorporating Sean's feedback to v3 and rebased on top of
>> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
>> kvm_arch_vm_ioctl() to "int"").
>>
>> [...]
> 
> Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!
> 
> [1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
>       https://github.com/kvm-x86/linux/commit/2fdcc1b32418
> [2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
>       https://github.com/kvm-x86/linux/commit/01b31714bd90
> [3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
>       https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
> [4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
>       https://github.com/kvm-x86/linux/commit/74cdc836919b
> [5/6] KVM: x86/mmu: Fix comment typo
>       https://github.com/kvm-x86/linux/commit/50f13998451e
> [6/6] KVM: VMX: Make CR0.WP a guest owned bit
>       https://github.com/kvm-x86/linux/commit/fb509f76acc8

Thanks a lot, Sean!

As this is a huge performance fix for us, we'd like to get it integrated
into current stable kernels as well -- not without having the changes
get some wider testing, of course, i.e. not before they end up in a
non-rc version released by Linus. But I already did a backport to 5.4 to
get a feeling how hard it would be and for the impact it has on older
kernels.

Using the 'ssdd 10 50000' test I used before, I get promising results
there as well. Without the patches it takes 9.31s, while with them we're
down to 4.64s. Taking into account that this is the runtime of a
workload in a VM that gets cut in half, I hope this qualifies as stable
material, as it's a huge performance fix.

Greg, what's your opinion on it? Original series here:
https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-03-25 11:39   ` Mathias Krause
@ 2023-03-25 12:25     ` Greg KH
  2023-04-06  2:25       ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Greg KH @ 2023-03-25 12:25 UTC (permalink / raw)
  To: Mathias Krause; +Cc: Sean Christopherson, kvm, Paolo Bonzini, stable

On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
> On 23.03.23 23:50, Sean Christopherson wrote:
> > On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
> >> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
> >>
> >> This series is the fourth iteration of resurrecting the missing pieces of
> >> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
> >>
> >> It's incorporating Sean's feedback to v3 and rebased on top of
> >> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
> >> kvm_arch_vm_ioctl() to "int"").
> >>
> >> [...]
> > 
> > Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!
> > 
> > [1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
> >       https://github.com/kvm-x86/linux/commit/2fdcc1b32418
> > [2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
> >       https://github.com/kvm-x86/linux/commit/01b31714bd90
> > [3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
> >       https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
> > [4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
> >       https://github.com/kvm-x86/linux/commit/74cdc836919b
> > [5/6] KVM: x86/mmu: Fix comment typo
> >       https://github.com/kvm-x86/linux/commit/50f13998451e
> > [6/6] KVM: VMX: Make CR0.WP a guest owned bit
> >       https://github.com/kvm-x86/linux/commit/fb509f76acc8
> 
> Thanks a lot, Sean!
> 
> As this is a huge performance fix for us, we'd like to get it integrated
> into current stable kernels as well -- not without having the changes
> get some wider testing, of course, i.e. not before they end up in a
> non-rc version released by Linus. But I already did a backport to 5.4 to
> get a feeling how hard it would be and for the impact it has on older
> kernels.
> 
> Using the 'ssdd 10 50000' test I used before, I get promising results
> there as well. Without the patches it takes 9.31s, while with them we're
> down to 4.64s. Taking into account that this is the runtime of a
> workload in a VM that gets cut in half, I hope this qualifies as stable
> material, as it's a huge performance fix.
> 
> Greg, what's your opinion on it? Original series here:
> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/

I'll leave the judgement call up to the KVM maintainers, as they are the
ones that need to ack any KVM patch added to stable trees.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-22  1:37 ` [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit Mathias Krause
@ 2023-03-27  8:33   ` Xiaoyao Li
  2023-03-27  8:37     ` Mathias Krause
  2023-03-30  8:45   ` Mathias Krause
  1 sibling, 1 reply; 34+ messages in thread
From: Xiaoyao Li @ 2023-03-27  8:33 UTC (permalink / raw)
  To: Mathias Krause, kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini

On 3/22/2023 9:37 AM, Mathias Krause wrote:
> Guests like grsecurity that make heavy use of CR0.WP to implement kernel
> level W^X will suffer from the implied VMEXITs.
> 
> With EPT there is no need to intercept a guest change of CR0.WP, so
> simply make it a guest owned bit if we can do so.

I'm interested in the performance gain. Do you have data like Patch 2?

> This implies that a read of a guest's CR0.WP bit might need a VMREAD.
> However, the only potentially affected user seems to be kvm_init_mmu()
> which is a heavy operation to begin with. But also most callers already
> cache the full value of CR0 anyway, so no additional VMREAD is needed.
> The only exception is nested_vmx_load_cr3().
> 
> This change is VMX-specific, as SVM has no such fine grained control
> register intercept control.
> 
> Suggested-and-co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
> ---
>   arch/x86/kvm/kvm_cache_regs.h |  2 +-
>   arch/x86/kvm/vmx/nested.c     |  4 ++--
>   arch/x86/kvm/vmx/vmx.c        |  2 +-
>   arch/x86/kvm/vmx/vmx.h        | 18 ++++++++++++++++++
>   4 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 4c91f626c058..e50d353b5c1c 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -4,7 +4,7 @@
>   
>   #include <linux/kvm_host.h>
>   
> -#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
> +#define KVM_POSSIBLE_CR0_GUEST_BITS	(X86_CR0_TS | X86_CR0_WP)
>   #define KVM_POSSIBLE_CR4_GUEST_BITS				  \
>   	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
>   	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index f63b28f46a71..61d940fc91ba 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4481,7 +4481,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>   	 * CR0_GUEST_HOST_MASK is already set in the original vmcs01
>   	 * (KVM doesn't change it);
>   	 */
> -	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>   	vmx_set_cr0(vcpu, vmcs12->host_cr0);
>   
>   	/* Same as above - no reason to call set_cr4_guest_host_mask().  */
> @@ -4632,7 +4632,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>   	 */
>   	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
>   
> -	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>   	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
>   
>   	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8fc1a0c7856f..e501f6864a72 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4790,7 +4790,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>   	/* 22.2.1, 20.8.1 */
>   	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
>   
> -	vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>   	vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
>   
>   	set_cr4_guest_host_mask(vmx);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 2acdc54bc34b..423e9d3c9c40 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -640,6 +640,24 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
>   				(1 << VCPU_EXREG_EXIT_INFO_1) | \
>   				(1 << VCPU_EXREG_EXIT_INFO_2))
>   
> +static inline unsigned long vmx_l1_guest_owned_cr0_bits(void)
> +{
> +	unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +
> +	/*
> +	 * CR0.WP needs to be intercepted when KVM is shadowing legacy paging
> +	 * in order to construct shadow PTEs with the correct protections.
> +	 * Note!  CR0.WP technically can be passed through to the guest if
> +	 * paging is disabled, but checking CR0.PG would generate a cyclical
> +	 * dependency of sorts due to forcing the caller to ensure CR0 holds
> +	 * the correct value prior to determining which CR0 bits can be owned
> +	 * by L1.  Keep it simple and limit the optimization to EPT.
> +	 */
> +	if (!enable_ept)
> +		bits &= ~X86_CR0_WP;
> +	return bits;
> +}
> +
>   static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
>   {
>   	return container_of(kvm, struct kvm_vmx, kvm);


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-27  8:33   ` Xiaoyao Li
@ 2023-03-27  8:37     ` Mathias Krause
  2023-03-27 13:48       ` Xiaoyao Li
  0 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-03-27  8:37 UTC (permalink / raw)
  To: Xiaoyao Li, kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini

On 27.03.23 10:33, Xiaoyao Li wrote:
> On 3/22/2023 9:37 AM, Mathias Krause wrote:
>> Guests like grsecurity that make heavy use of CR0.WP to implement kernel
>> level W^X will suffer from the implied VMEXITs.
>>
>> With EPT there is no need to intercept a guest change of CR0.WP, so
>> simply make it a guest owned bit if we can do so.
> 
> I'm interested in the performance gain. Do you have data like Patch 2?

It's mentioned in the cover letter[1], quoted below:

[1]
https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/

: I used 'ssdd 10 50000' from rt-tests[5] as a micro-benchmark, running on a
: grsecurity L1 VM. Below table shows the results (runtime in seconds, lower
: is better):
:
:                               legacy     TDP    shadow
:     kvm-x86/next@d8708b        8.43s    9.45s    70.3s
:     + patches 1-3              5.39s    5.63s    70.2s
:     + patches 4-6              3.51s    3.47s    67.8s


Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-27  8:37     ` Mathias Krause
@ 2023-03-27 13:48       ` Xiaoyao Li
  0 siblings, 0 replies; 34+ messages in thread
From: Xiaoyao Li @ 2023-03-27 13:48 UTC (permalink / raw)
  To: Mathias Krause, kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini

On 3/27/2023 4:37 PM, Mathias Krause wrote:
> On 27.03.23 10:33, Xiaoyao Li wrote:
>> On 3/22/2023 9:37 AM, Mathias Krause wrote:
>>> Guests like grsecurity that make heavy use of CR0.WP to implement kernel
>>> level W^X will suffer from the implied VMEXITs.
>>>
>>> With EPT there is no need to intercept a guest change of CR0.WP, so
>>> simply make it a guest owned bit if we can do so.
>>
>> I'm interested in the performance gain. Do you have data like Patch 2?
> 
> It's mentioned in the cover letter[1], quoted below:

Sorry I missed it. The data of not intercepting CR0.WP looks great as well.

> [1]
> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
> 
> : I used 'ssdd 10 50000' from rt-tests[5] as a micro-benchmark, running on a
> : grsecurity L1 VM. Below table shows the results (runtime in seconds, lower
> : is better):
> :
> :                               legacy     TDP    shadow
> :     kvm-x86/next@d8708b        8.43s    9.45s    70.3s
> :     + patches 1-3              5.39s    5.63s    70.2s
> :     + patches 4-6              3.51s    3.47s    67.8s
> 
> 
> Thanks,
> Mathias


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-22  1:37 ` [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit Mathias Krause
  2023-03-27  8:33   ` Xiaoyao Li
@ 2023-03-30  8:45   ` Mathias Krause
  2023-03-30 17:12     ` Sean Christopherson
  1 sibling, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-03-30  8:45 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini; +Cc: linux-kernel

On 22.03.23 02:37, Mathias Krause wrote:
> Guests like grsecurity that make heavy use of CR0.WP to implement kernel
> level W^X will suffer from the implied VMEXITs.
> 
> With EPT there is no need to intercept a guest change of CR0.WP, so
> simply make it a guest owned bit if we can do so.
> 
> This implies that a read of a guest's CR0.WP bit might need a VMREAD.
> However, the only potentially affected user seems to be kvm_init_mmu()
> which is a heavy operation to begin with. But also most callers already
> cache the full value of CR0 anyway, so no additional VMREAD is needed.
> The only exception is nested_vmx_load_cr3().
> 
> This change is VMX-specific, as SVM has no such fine grained control
> register intercept control.

Just a heads up! We did more tests, especially with the backports we did
internally already, and ran into a bug when running a nested guest on an
ESXi host.

Setup is like: ESXi (L0) -> Linux (L1) -> Linux (L2)

The Linux system, especially the kernel, is the same for L1 and L2. It's
a grsecurity kernel, so makes use of toggling CR0.WP at runtime.

The bug we see is that when L2 disables CR0.WP and tries to write to an
r/o memory region (implicitly to the r/o GDT via LTR in our use case),
this triggers a fault (EPT violation?) that gets ignored by L1, as,
apparently, everything is fine from its point of view.

I suspect the L0 VMM to be at fault here, as the VMCS structures look
good, IMO. Here is a dump of vmx->loaded_vmcs in handle_triple_fault():

[…] VMX: TRIPLE FAULT!
[…] VMCS ffff8883a9f18000, last attempted VM-entry on CPU 8
[…] *** Guest State ***
[…] CR0: actual=0x0000000080040033, shadow=0x0000000080050033,
gh_mask=fffffffffffefff7

CR0 in the L2 VM has CR0.WP disabled. However, it is set in the shadow
CR0 but masked out via CR0_MASK, so should be read as the guest's value,
according to the SDM.

I also tried masking the shadow CR0 value in vmx_set_cr0(), but that
makes no difference.

[…] CR4: actual=0x00000000003220f0, shadow=0x00000000003208b0,
gh_mask=fffffffffffff871
[…] CR3 = 0x0000000002684000
[…] PDPTR0 = 0x0000000007d39001  PDPTR1 = 0x00000000033b5001
[…] PDPTR2 = 0x000000000238c001  PDPTR3 = 0x0000000001c54001
[…] RSP = 0xfffffe8040087f50  RIP = 0xffffffff8105d435
[…] RFLAGS=0x00010006         DR7 = 0x0000000000000400
[…] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
[…] CS:   sel=0x0010, attr=0x0a09b, limit=0xffffffff,
base=0x0000000000000000
[…] DS:   sel=0x0000, attr=0x1c001, limit=0xffffffff,
base=0x0000000000000000
[…] SS:   sel=0x0000, attr=0x1c001, limit=0xffffffff,
base=0x0000000000000000
[…] ES:   sel=0x0000, attr=0x1c001, limit=0xffffffff,
base=0x0000000000000000
[…] FS:   sel=0x0000, attr=0x1c001, limit=0xffffffff,
base=0x0000000000000000
[…] GS:   sel=0x0000, attr=0x1c001, limit=0xffffffff,
base=0xffff888232e00000
[…] GDTR:                           limit=0x00000097,
base=0xfffffe0000201000
[…] LDTR: sel=0x0000, attr=0x00082, limit=0x0000ffff,
base=0x0000000000000000
[…] IDTR:                           limit=0x000001ff,
base=0xffffffff84004000
[…] TR:   sel=0x0000, attr=0x0008b, limit=0x0000ffff,
base=0x0000000000000000
[…] EFER= 0x0000000000000d01 (effective)
[…] PAT = 0x0007040600070406
[…] DebugCtl = 0x0000000000000000  DebugExceptions = 0x0000000000000000
[…] Interruptibility = 00000000  ActivityState = 00000000
[…] *** Host State ***
[…] RIP = 0xffffffff86d28db6  RSP = 0xfffffe8040927d50
[…] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040
[…] FSBase=0000639563fce700 GSBase=ffff88881a800000 TRBase=fffffe0001003000
[…] GDTBase=fffffe0001001000 IDTBase=fffffe0000000000
[…] CR0=0000000080050033 CR3=00000000026a0004 CR4=00000000007626f0

The "host" (which is our L1 VMM, I guess) has CR0.WP enabled and that is
what I think confuses ESXi to enforce the read-only property to the L2
guest as well -- for unknown reasons so far.

[…] Sysenter RSP=fffffe0001003000 CS:RIP=0010:ffffffff810031a0
[…] PAT = 0x0407050600070106
[…] *** Control State ***
[…] PinBased=0000003f CPUBased=b5a06dfa SecondaryExec=001034ee
[…] EntryControls=000053ff ExitControls=000befff
[…] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
[…] VMEntry: intr_info=00000000 errcode=00000003 ilen=00000000
[…] VMExit: intr_info=00000000 errcode=00000000 ilen=00000000
[…]         reason=00000002 qualification=0000000000000000
[…] IDTVectoring: info=00000000 errcode=00000000
[…] TSC Offset = 0xffffad7a1057f4cc
[…] TPR Threshold = 0x00
[…] virt-APIC addr = 0x000000015716b000
[…] EPT pointer = 0x0000000583c1e05e
[…] PLE Gap=00000080 Window=00001000
[…] Virtual processor ID = 0x0002

I tried to reproduce the bug with different KVM based L0 VMMs (with and
without this series; vanilla and grsecurity kernels) but no luck. That's
why I'm suspecting a ESXi bug.

I'm leaning to make CR0.WP guest owned only iff we're running on bare
metal or the VMM is KVM to not play whack-a-mole for all the VMMs that
might have similar bugs. (Will try to test a few others here as well.)
However, that would prevent them from getting the performance gain, so
I'd rather have this fixed / worked around in KVM instead.

Any ideas how to investigate this further?

Thanks,
Mathias

PS: ...should have left the chicken bit of v3 to be able to disable the
feature by a module parameter ;)

> 
> Suggested-and-co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
> ---
>  arch/x86/kvm/kvm_cache_regs.h |  2 +-
>  arch/x86/kvm/vmx/nested.c     |  4 ++--
>  arch/x86/kvm/vmx/vmx.c        |  2 +-
>  arch/x86/kvm/vmx/vmx.h        | 18 ++++++++++++++++++
>  4 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 4c91f626c058..e50d353b5c1c 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -4,7 +4,7 @@
>  
>  #include <linux/kvm_host.h>
>  
> -#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
> +#define KVM_POSSIBLE_CR0_GUEST_BITS	(X86_CR0_TS | X86_CR0_WP)
>  #define KVM_POSSIBLE_CR4_GUEST_BITS				  \
>  	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
>  	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index f63b28f46a71..61d940fc91ba 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4481,7 +4481,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  	 * CR0_GUEST_HOST_MASK is already set in the original vmcs01
>  	 * (KVM doesn't change it);
>  	 */
> -	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>  	vmx_set_cr0(vcpu, vmcs12->host_cr0);
>  
>  	/* Same as above - no reason to call set_cr4_guest_host_mask().  */
> @@ -4632,7 +4632,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  	 */
>  	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
>  
> -	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>  	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
>  
>  	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8fc1a0c7856f..e501f6864a72 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4790,7 +4790,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  	/* 22.2.1, 20.8.1 */
>  	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
>  
> -	vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>  	vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
>  
>  	set_cr4_guest_host_mask(vmx);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 2acdc54bc34b..423e9d3c9c40 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -640,6 +640,24 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
>  				(1 << VCPU_EXREG_EXIT_INFO_1) | \
>  				(1 << VCPU_EXREG_EXIT_INFO_2))
>  
> +static inline unsigned long vmx_l1_guest_owned_cr0_bits(void)
> +{
> +	unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +
> +	/*
> +	 * CR0.WP needs to be intercepted when KVM is shadowing legacy paging
> +	 * in order to construct shadow PTEs with the correct protections.
> +	 * Note!  CR0.WP technically can be passed through to the guest if
> +	 * paging is disabled, but checking CR0.PG would generate a cyclical
> +	 * dependency of sorts due to forcing the caller to ensure CR0 holds
> +	 * the correct value prior to determining which CR0 bits can be owned
> +	 * by L1.  Keep it simple and limit the optimization to EPT.
> +	 */
> +	if (!enable_ept)
> +		bits &= ~X86_CR0_WP;
> +	return bits;
> +}
> +
>  static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
>  {
>  	return container_of(kvm, struct kvm_vmx, kvm);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-30  8:45   ` Mathias Krause
@ 2023-03-30 17:12     ` Sean Christopherson
  2023-03-30 20:15       ` Mathias Krause
  2023-03-30 20:33       ` Sean Christopherson
  0 siblings, 2 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-03-30 17:12 UTC (permalink / raw)
  To: Mathias Krause; +Cc: kvm, Paolo Bonzini, linux-kernel

On Thu, Mar 30, 2023, Mathias Krause wrote:
> On 22.03.23 02:37, Mathias Krause wrote:
> > Guests like grsecurity that make heavy use of CR0.WP to implement kernel
> > level W^X will suffer from the implied VMEXITs.
> > 
> > With EPT there is no need to intercept a guest change of CR0.WP, so
> > simply make it a guest owned bit if we can do so.
> > 
> > This implies that a read of a guest's CR0.WP bit might need a VMREAD.
> > However, the only potentially affected user seems to be kvm_init_mmu()
> > which is a heavy operation to begin with. But also most callers already
> > cache the full value of CR0 anyway, so no additional VMREAD is needed.
> > The only exception is nested_vmx_load_cr3().
> > 
> > This change is VMX-specific, as SVM has no such fine grained control
> > register intercept control.
> 
> Just a heads up! We did more tests, especially with the backports we did
> internally already, and ran into a bug when running a nested guest on an
> ESXi host.
> 
> Setup is like: ESXi (L0) -> Linux (L1) -> Linux (L2)
> 
> The Linux system, especially the kernel, is the same for L1 and L2. It's
> a grsecurity kernel, so makes use of toggling CR0.WP at runtime.
> 
> The bug we see is that when L2 disables CR0.WP and tries to write to an
> r/o memory region (implicitly to the r/o GDT via LTR in our use case),
> this triggers a fault (EPT violation?) that gets ignored by L1, as,
> apparently, everything is fine from its point of view.

It's not an EPT violation if there's a triple fault.  Given that you're dumping
the VMCS from handle_triple_fault(), I assume that L2 gets an unexpected #PF that
leads to a triple fault in L2.

Just to make sure we're on the same page: L1 is still alive after this, correct?

> I suspect the L0 VMM to be at fault here, as the VMCS structures look
> good, IMO. Here is a dump of vmx->loaded_vmcs in handle_triple_fault():

...

> The "host" (which is our L1 VMM, I guess) has CR0.WP enabled and that is
> what I think confuses ESXi to enforce the read-only property to the L2
> guest as well -- for unknown reasons so far.

...

> I tried to reproduce the bug with different KVM based L0 VMMs (with and
> without this series; vanilla and grsecurity kernels) but no luck. That's
> why I'm suspecting a ESXi bug.
   
...

> I'm leaning to make CR0.WP guest owned only iff we're running on bare
> metal or the VMM is KVM to not play whack-a-mole for all the VMMs that
> might have similar bugs. (Will try to test a few others here as well.)
> However, that would prevent them from getting the performance gain, so
> I'd rather have this fixed / worked around in KVM instead.

Before we start putting bandaids on this, let's (a) confirm this appears to be
an issue with ESXi and (b) pull in VMware folks to get their input.

> Any ideas how to investigate this further?

Does the host in question support UMIP?

Can you capture a tracepoint log from L1 to verify that L1 KVM is _not_ injecting
any kind of exception?  E.g. to get the KVM kitchen sink:

  echo 1 > /sys/kernel/debug/tracing/tracing_on
  echo 1 > /sys/kernel/debug/tracing/events/kvm/enable

  cat /sys/kernel/debug/tracing/trace > log

Or if that's too noisy, a more targeted trace (exception injection + emulation)
woud be:

  echo 1 > /sys/kernel/debug/tracing/tracing_on

  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_emulate_insn/enable
  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_inj_exception/enable
  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_entry/enable
  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable

> PS: ...should have left the chicken bit of v3 to be able to disable the
> feature by a module parameter ;)

A chicken bit isn't a good solution for this sort of thing.  Toggling a KVM module
param requires (a) knowing that it exists and (b) knowing the conditions under which
it is/isn't safe to toggle the bit.

E.g. if this ends up being an ESXi L0 bug, then an option might be to add something
in vmware_platform_setup() to communicate the bug to KVM so that KVM can precisely
disable the optimization on affected platforms.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-30 17:12     ` Sean Christopherson
@ 2023-03-30 20:15       ` Mathias Krause
  2023-03-30 20:30         ` Mathias Krause
  2023-03-30 20:33       ` Sean Christopherson
  1 sibling, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-03-30 20:15 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, linux-kernel

On 30.03.23 19:12, Sean Christopherson wrote:
> On Thu, Mar 30, 2023, Mathias Krause wrote:
>> Just a heads up! We did more tests, especially with the backports we did
>> internally already, and ran into a bug when running a nested guest on an
>> ESXi host.
>>
>> Setup is like: ESXi (L0) -> Linux (L1) -> Linux (L2)
>>
>> The Linux system, especially the kernel, is the same for L1 and L2. It's
>> a grsecurity kernel, so makes use of toggling CR0.WP at runtime.
>>
>> The bug we see is that when L2 disables CR0.WP and tries to write to an
>> r/o memory region (implicitly to the r/o GDT via LTR in our use case),
>> this triggers a fault (EPT violation?) that gets ignored by L1, as,
>> apparently, everything is fine from its point of view.
> 
> It's not an EPT violation if there's a triple fault.  Given that you're dumping
> the VMCS from handle_triple_fault(), I assume that L2 gets an unexpected #PF that
> leads to a triple fault in L2.

Indeed it does, more on this below.

> 
> Just to make sure we're on the same page: L1 is still alive after this, correct?

Yes, L1 is still alive and doing fine.

>> I suspect the L0 VMM to be at fault here, as the VMCS structures look
>> good, IMO. Here is a dump of vmx->loaded_vmcs in handle_triple_fault():
> 
> ...
> 
>> The "host" (which is our L1 VMM, I guess) has CR0.WP enabled and that is
>> what I think confuses ESXi to enforce the read-only property to the L2
>> guest as well -- for unknown reasons so far.
> 
> ...
> 
>> I tried to reproduce the bug with different KVM based L0 VMMs (with and
>> without this series; vanilla and grsecurity kernels) but no luck. That's
>> why I'm suspecting a ESXi bug.
>    
> ...
> 
>> I'm leaning to make CR0.WP guest owned only iff we're running on bare
>> metal or the VMM is KVM to not play whack-a-mole for all the VMMs that
>> might have similar bugs. (Will try to test a few others here as well.)
>> However, that would prevent them from getting the performance gain, so
>> I'd rather have this fixed / worked around in KVM instead.
> 
> Before we start putting bandaids on this, let's (a) confirm this appears to be
> an issue with ESXi and (b) pull in VMware folks to get their input.
> 
>> Any ideas how to investigate this further?
> 
> Does the host in question support UMIP?

It should, but I've no access to the host. Within L1 I don't see umip
in /proc/cpuinfo. It's a "Intel(R) Xeon(R) Gold 6258R", though, so
Cascade Lake.

> 
> Can you capture a tracepoint log from L1 to verify that L1 KVM is _not_ injecting
> any kind of exception?  E.g. to get the KVM kitchen sink:
> 
>   echo 1 > /sys/kernel/debug/tracing/tracing_on
>   echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
> 
>   cat /sys/kernel/debug/tracing/trace > log
> 
> Or if that's too noisy, a more targeted trace (exception injection + emulation)
> woud be:

I've even added 'echo 1 > /sys/kernel/tracing/events/kvmmmu/enable' to
the mix to get some more info and that leads to below (trimmed to the
part covering the error; L2 has 2 vCPUs bound to L1's vCPUs 1 and 2):

 qemu-system-x86-8397    [002] d....  1404.389366: kvm_exit: vcpu 1 reason CR_ACCESS rip 0xffffffff8100007a info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389373: kvm_cr: cr_write 0 = 0x80050033

That's the initialization of L2's CR0 in secondary_startup_64() and
also the last CR0 write we'll see, as toggling CR0.WP won't generate a
VMEXIT because the bit is guest owned. So this is the value KVM will
cache as vcpu->arch.cr0.

 qemu-system-x86-8397    [002] .....  1404.389384: kvm_tdp_mmu_spte_changed: as id 0 gfn 0 level 4 old_spte 2c583e907 new_spte 0
 qemu-system-x86-8397    [002] .....  1404.389384: kvm_mmu_prepare_zap_page: sp gen 0 gfn 0 l3 8-byte q0 direct wux nxe ad root 0 sync
 qemu-system-x86-8397    [002] .....  1404.389386: kvm_tdp_mmu_spte_changed: as id 0 gfn 0 level 3 old_spte 2c583d907 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389386: kvm_mmu_prepare_zap_page: sp gen 0 gfn 0 l2 8-byte q0 direct wux nxe ad root 0 sync
 qemu-system-x86-8397    [002] .....  1404.389388: kvm_tdp_mmu_spte_changed: as id 0 gfn 0 level 2 old_spte 2c583c907 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389388: kvm_mmu_prepare_zap_page: sp gen 0 gfn 0 l1 8-byte q0 direct wux nxe ad root 0 sync
 qemu-system-x86-8397    [002] .....  1404.389391: kvm_tdp_mmu_spte_changed: as id 0 gfn 99 level 1 old_spte 600000455299b77 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389393: kvm_tdp_mmu_spte_changed: as id 0 gfn 9a level 1 old_spte 60000045529ab77 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389395: kvm_tdp_mmu_spte_changed: as id 0 gfn 9c level 1 old_spte 60000045529cb77 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389408: kvm_tdp_mmu_spte_changed: as id 0 gfn 9d level 1 old_spte 60000045529db77 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389411: kvm_tdp_mmu_spte_changed: as id 0 gfn 9e level 1 old_spte 60000045529eb77 new_spte 5a0
 qemu-system-x86-8396    [001] d....  1404.389421: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT rip 0xffffffff8108aca7 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x800000fb error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389425: kvm_tdp_mmu_spte_changed: as id 0 gfn 1000 level 2 old_spte 2c583b907 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389425: kvm_mmu_prepare_zap_page: sp gen 0 gfn 1000 l1 8-byte q0 direct wux nxe ad root 0 sync
 qemu-system-x86-8397    [002] .....  1404.389427: kvm_tdp_mmu_spte_changed: as id 0 gfn 1000 level 1 old_spte 6000002c7c00b77 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389430: kvm_tdp_mmu_spte_changed: as id 0 gfn 2400 level 2 old_spte 60000021be00bf3 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389434: kvm_tdp_mmu_spte_changed: as id 0 gfn 2800 level 2 old_spte 60000021ba00bf3 new_spte 5a0
 qemu-system-x86-8396    [001] d....  1404.389435: kvm_entry: vcpu 0, rip 0xffffffff8108aca7
 qemu-system-x86-8397    [002] .....  1404.389436: kvm_tdp_mmu_spte_changed: as id 0 gfn 2a00 level 2 old_spte 60000021d400bf3 new_spte 5a0
 qemu-system-x86-8397    [002] .....  1404.389439: kvm_tdp_mmu_spte_changed: as id 0 gfn 2c00 level 2 old_spte 60000021d600bf3 new_spte 5a0
 qemu-system-x86-8396    [001] d....  1404.389448: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT rip 0xffffffff8113b1f9 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x800000fb error_code 0x00000000
 qemu-system-x86-8396    [001] d....  1404.389456: kvm_entry: vcpu 0, rip 0xffffffff8113b1f9
 qemu-system-x86-8397    [002] d....  1404.389461: kvm_entry: vcpu 1, rip 0xffffffff8100007d
 qemu-system-x86-8396    [001] d....  1404.389462: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT rip 0xffffffff8113b1f9 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x800000f6 error_code 0x00000000
 qemu-system-x86-8396    [001] d....  1404.389462: kvm_entry: vcpu 0, rip 0xffffffff8113b1f9
 qemu-system-x86-8397    [002] d....  1404.389481: kvm_exit: vcpu 1 reason CR_ACCESS rip 0xffffffff810000a0 info1 0x0000000000000104 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389488: kvm_cr: cr_write 4 = 0xb0

Above is L2's init of CR4, still in secondary_startup_64().

 qemu-system-x86-8397    [002] d....  1404.389491: kvm_entry: vcpu 1, rip 0xffffffff810000a3
 qemu-system-x86-8397    [002] d....  1404.389501: kvm_exit: vcpu 1 reason CPUID rip 0xffffffff81000127 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389501: kvm_cpuid: func 80000001 idx d3f0eef rax 50657 rbx 0 rcx 121 rdx 2c100800, cpuid entry found
 qemu-system-x86-8397    [002] d....  1404.389501: kvm_entry: vcpu 1, rip 0xffffffff81000129
 qemu-system-x86-8397    [002] d....  1404.389509: kvm_exit: vcpu 1 reason MSR_READ rip 0xffffffff81000130 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389509: kvm_msr: msr_read c0000080 = 0xd01
 qemu-system-x86-8397    [002] d....  1404.389510: kvm_entry: vcpu 1, rip 0xffffffff81000132
 qemu-system-x86-8397    [002] d....  1404.389517: kvm_exit: vcpu 1 reason MSR_WRITE rip 0xffffffff81000149 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389519: kvm_msr: msr_write c0000080 = 0xd01

That was the EFER setup...

 qemu-system-x86-8397    [002] d....  1404.389520: kvm_entry: vcpu 1, rip 0xffffffff8100014b
 qemu-system-x86-8397    [002] d....  1404.389536: kvm_exit: vcpu 1 reason CR_ACCESS rip 0xffffffff81053aee info1 0x0000000000000004 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389537: kvm_cr: cr_write 4 = 0x3208b0

...remaining CR4 bits in cr4_init()...

 qemu-system-x86-8397    [002] d....  1404.389562: kvm_entry: vcpu 1, rip 0xffffffff81053af1
 qemu-system-x86-8397    [002] d....  1404.389574: kvm_exit: vcpu 1 reason EXTERNAL_INTERRUPT rip 0xffffffff81053af1 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x800000f6 error_code 0x00000000
 qemu-system-x86-8397    [002] d....  1404.389583: kvm_entry: vcpu 1, rip 0xffffffff81053af1
 qemu-system-x86-8397    [002] d....  1404.389594: kvm_exit: vcpu 1 reason MSR_WRITE rip 0xffffffff81054faa info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 qemu-system-x86-8397    [002] .....  1404.389595: kvm_msr: msr_write c0000103 = 0x1

...TSC_AUX in setup_getcpu()...

 qemu-system-x86-8397    [002] d....  1404.389595: kvm_entry: vcpu 1, rip 0xffffffff81054fac
 qemu-system-x86-8397    [002] d....  1404.389646: kvm_exit: vcpu 1 reason LDTR_TR rip 0xffffffff810551b8 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000

...and, finally, the trapping LTR in cpu_init_exception_handling()!

Now I get why you where asking for UMIP. No, that warning in
handle_desc() doesn't trigger for me. ;)

But now the interesting part: the emulation of LTR:

 qemu-system-x86-8397    [002] .....  1404.389656: kvm_mmu_pagetable_walk: addr ffffffff810551b8 pferr 10 F
                                                                                                 ~~~~~
Just a nit, but this 'pferr' should probably be 'access' as it's not an ---------------------------^
an error code (yet) but the access mode for this page table walk.

 qemu-system-x86-8397    [002] .....  1404.389661: kvm_mmu_paging_element: pte 2929067 level 4
 qemu-system-x86-8397    [002] .....  1404.389662: kvm_mmu_paging_element: pte 2b2c063 level 3
 qemu-system-x86-8397    [002] .....  1404.389662: kvm_mmu_paging_element: pte 10001e3 level 2
 qemu-system-x86-8397    [002] .....  1404.389664: kvm_emulate_insn: 0:ffffffff810551b8:0f 00 d8 (prot64)

Instruction fetch went fine.

 qemu-system-x86-8397    [002] .....  1404.389666: kvm_mmu_pagetable_walk: addr fffffe0000201040 pferr 0
 qemu-system-x86-8397    [002] .....  1404.389666: kvm_mmu_paging_element: pte 800000000292b063 level 4
 qemu-system-x86-8397    [002] .....  1404.389666: kvm_mmu_paging_element: pte 8000000002b2f063 level 3
 qemu-system-x86-8397    [002] .....  1404.389666: kvm_mmu_paging_element: pte 8000000002d31063 level 2
 qemu-system-x86-8397    [002] .....  1404.389666: kvm_mmu_paging_element: pte 8000000002f3a161 level 1
 qemu-system-x86-8397    [002] .....  1404.389667: kvm_mmu_pagetable_walk: addr fffffe0000201048 pferr 0
 qemu-system-x86-8397    [002] .....  1404.389667: kvm_mmu_paging_element: pte 800000000292b063 level 4
 qemu-system-x86-8397    [002] .....  1404.389667: kvm_mmu_paging_element: pte 8000000002b2f063 level 3
 qemu-system-x86-8397    [002] .....  1404.389667: kvm_mmu_paging_element: pte 8000000002d31063 level 2
 qemu-system-x86-8397    [002] .....  1404.389667: kvm_mmu_paging_element: pte 8000000002f3a161 level 1

Read of the descriptor as well, but...

 qemu-system-x86-8397    [002] .....  1404.389668: kvm_mmu_pagetable_walk: addr fffffe0000201040 pferr 2 W
 qemu-system-x86-8397    [002] .....  1404.389668: kvm_mmu_paging_element: pte 800000000292b063 level 4
 qemu-system-x86-8397    [002] .....  1404.389668: kvm_mmu_paging_element: pte 8000000002b2f063 level 3
 qemu-system-x86-8397    [002] .....  1404.389668: kvm_mmu_paging_element: pte 8000000002d31063 level 2
 qemu-system-x86-8397    [002] .....  1404.389669: kvm_mmu_paging_element: pte 8000000002f3a161 level 1
 qemu-system-x86-8397    [002] .....  1404.389669: kvm_mmu_walker_error: pferr 3 P|W

 qemu-system-x86-8397    [002] .....  1404.389675: kvm_mmu_pagetable_walk: addr fffffe0000201040 pferr 2 W
 qemu-system-x86-8397    [002] .....  1404.389677: kvm_mmu_paging_element: pte 800000000292b063 level 4
 qemu-system-x86-8397    [002] .....  1404.389677: kvm_mmu_paging_element: pte 8000000002b2f063 level 3
 qemu-system-x86-8397    [002] .....  1404.389677: kvm_mmu_paging_element: pte 8000000002d31063 level 2
 qemu-system-x86-8397    [002] .....  1404.389678: kvm_mmu_paging_element: pte 8000000002f3a161 level 1
 qemu-system-x86-8397    [002] .....  1404.389678: kvm_mmu_walker_error: pferr 3 P|W

 qemu-system-x86-8397    [002] .....  1404.389684: kvm_inj_exception: #PF (0x3)

...the write failed and that's the page fault you were looking for...

 qemu-system-x86-8397    [002] d....  1404.389711: kvm_entry: vcpu 1, rip 0xffffffff810551b8
 qemu-system-x86-8397    [002] d....  1404.389730: kvm_exit: vcpu 1 reason TRIPLE_FAULT rip 0xffffffff810551b8 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000

...and here the triple fault I was seeing.

The page faults look legitimate on first sight, as the write bit isn't
set in the PTE. However, CR0.WP is 0, as show earlier in the VMCS dump,
so this shouldn't trigger a page fault.

Looks like I was too early to blame ESXi and this is more like a bug in
KVM's emulation of guest writes. I just don't see where the
kvm_read_cr0() is missing, yet -- obviously before we do the page table
walk, but even after adding one to init_emulate_ctxt() I still get the
fault, so there must be more to it. :/

Maybe it's not a stale CR0 value but the page table walker not taking
the guest's CR0.WP into account? Maybe a missing role update?

> [...]
> 
>> PS: ...should have left the chicken bit of v3 to be able to disable the
>> feature by a module parameter ;)
> 
> A chicken bit isn't a good solution for this sort of thing.  Toggling a KVM module
> param requires (a) knowing that it exists and (b) knowing the conditions under which
> it is/isn't safe to toggle the bit.

Sure. But it would have allowed me to notice that's indeed the last
patch of the series that's the culprit.

> 
> E.g. if this ends up being an ESXi L0 bug, then an option might be to add something
> in vmware_platform_setup() to communicate the bug to KVM so that KVM can precisely
> disable the optimization on affected platforms.
Agreed. But, as noted above, the signs are pointing to a bug in KVM.

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-30 20:15       ` Mathias Krause
@ 2023-03-30 20:30         ` Mathias Krause
  2023-03-30 20:36           ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-03-30 20:30 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, linux-kernel

On 30.03.23 22:15, Mathias Krause wrote:
> [...]
> Maybe it's not a stale CR0 value but the page table walker not taking
> the guest's CR0.WP into account? Maybe a missing role update?

Indeed, it is. This does the trick for me:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 31be188aa842..6a9e90725c84 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8372,6 +8372,9 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu)

        init_decode_cache(ctxt);
        vcpu->arch.emulate_regs_need_sync_from_vcpu = false;
+       /* Ensure we're doing page table walks with an up2date MMU role */
+       if ((vcpu->arch.cr0 ^ kvm_read_cr0(vcpu)) == X86_CR0_WP)
+               kvm_init_mmu(vcpu);
 }

 void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip)

Very heavy weight and misplaced, but a start :)

It should (1) be limited to VMX as that's the only one that would make
CR0.WP a guest owned bit and (2) limited to emulated instructions that
actually do write operations, as read are harmless, obviously.

Mathias

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-30 17:12     ` Sean Christopherson
  2023-03-30 20:15       ` Mathias Krause
@ 2023-03-30 20:33       ` Sean Christopherson
  2023-03-30 20:55         ` Mathias Krause
  1 sibling, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2023-03-30 20:33 UTC (permalink / raw)
  To: Mathias Krause; +Cc: kvm, Paolo Bonzini, linux-kernel

On Thu, Mar 30, 2023, Sean Christopherson wrote:
> On Thu, Mar 30, 2023, Mathias Krause wrote:
> > On 22.03.23 02:37, Mathias Krause wrote:
> > I'm leaning to make CR0.WP guest owned only iff we're running on bare
> > metal or the VMM is KVM to not play whack-a-mole for all the VMMs that
> > might have similar bugs. (Will try to test a few others here as well.)
> > However, that would prevent them from getting the performance gain, so
> > I'd rather have this fixed / worked around in KVM instead.
> 
> Before we start putting bandaids on this, let's (a) confirm this appears to be
> an issue with ESXi and (b) pull in VMware folks to get their input.
> 
> > Any ideas how to investigate this further?
> 
> Does the host in question support UMIP?
> 
> Can you capture a tracepoint log from L1 to verify that L1 KVM is _not_ injecting
> any kind of exception?  E.g. to get the KVM kitchen sink:
> 
>   echo 1 > /sys/kernel/debug/tracing/tracing_on
>   echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
> 
>   cat /sys/kernel/debug/tracing/trace > log
> 
> Or if that's too noisy, a more targeted trace (exception injection + emulation)
> woud be:
> 
>   echo 1 > /sys/kernel/debug/tracing/tracing_on
> 
>   echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_emulate_insn/enable
>   echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_inj_exception/enable
>   echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_entry/enable
>   echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable

Duh, this is likely a KVM bug.  I expect the issue will go away if you revert

  fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit")

KVM doesn't consume CR0.WP for _its_ MMU, but it does consume CR0.WP for the
guest walker.  By passing through CR0.WP, toggling only CR0.WP will not trap
(obviously) and thus won't run through kvm_post_set_cr0(), thus resulting in stle
information due to not invoking kvm_init_mmu().

I'm preetty sure I even called that we needed to refresh the permissions, but then
obviously forgot to actually make that happen.

I believe this will remedy the issue.  If it does, I'll post a proper patch
(likely won't be until next week).  Compile tested only.

---
 arch/x86/kvm/mmu.h     |  8 +++++++-
 arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 89f532516a45..4a303aa735dd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -113,6 +113,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len);
+void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu);
 
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
@@ -184,8 +185,13 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
 	bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
 	int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
-	bool fault = (mmu->permissions[index] >> pte_access) & 1;
 	u32 errcode = PFERR_PRESENT_MASK;
+	bool fault;
+
+	if (tdp_enabled)
+		kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
+
+	fault = (mmu->permissions[index] >> pte_access) & 1;
 
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
 	if (unlikely(mmu->pkru_mask)) {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4c874d4ec68f..2a63b5725f36 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5186,6 +5186,20 @@ static union kvm_cpu_role kvm_calc_cpu_role(struct kvm_vcpu *vcpu,
 	return role;
 }
 
+void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
+{
+	const bool cr0_wp = kvm_is_cr0_bit_set(vcpu, X86_CR0_WP);
+
+	BUILD_BUG_ON((KVM_MMU_CR0_ROLE_BITS & KVM_POSSIBLE_CR0_GUEST_BITS) != X86_CR0_WP);
+	BUILD_BUG_ON((KVM_MMU_CR4_ROLE_BITS & KVM_POSSIBLE_CR4_GUEST_BITS));
+
+	if (is_cr0_wp(mmu) == cr0_wp)
+		return;
+
+	mmu->cpu_role.base.cr0_wp = cr0_wp;
+	reset_guest_paging_metadata(vcpu, mmu);
+}
+
 static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 {
 	/* tdp_root_level is architecture forced level, use it if nonzero */

base-commit: 27d6845d258b67f4eb3debe062b7dacc67e0c393
-- 



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-30 20:30         ` Mathias Krause
@ 2023-03-30 20:36           ` Sean Christopherson
  0 siblings, 0 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-03-30 20:36 UTC (permalink / raw)
  To: Mathias Krause; +Cc: kvm, Paolo Bonzini, linux-kernel

On Thu, Mar 30, 2023, Mathias Krause wrote:
> On 30.03.23 22:15, Mathias Krause wrote:
> > [...]
> > Maybe it's not a stale CR0 value but the page table walker not taking
> > the guest's CR0.WP into account? Maybe a missing role update?
> 
> Indeed, it is. This does the trick for me:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 31be188aa842..6a9e90725c84 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8372,6 +8372,9 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
> 
>         init_decode_cache(ctxt);
>         vcpu->arch.emulate_regs_need_sync_from_vcpu = false;
> +       /* Ensure we're doing page table walks with an up2date MMU role */
> +       if ((vcpu->arch.cr0 ^ kvm_read_cr0(vcpu)) == X86_CR0_WP)
> +               kvm_init_mmu(vcpu);
>  }
> 
>  void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip)
> 
> Very heavy weight and misplaced, but a start :)
> 
> It should (1) be limited to VMX as that's the only one that would make
> CR0.WP a guest owned bit and (2) limited to emulated instructions that
> actually do write operations, as read are harmless, obviously.

For the record, I wrote my email before I saw this ;-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-30 20:33       ` Sean Christopherson
@ 2023-03-30 20:55         ` Mathias Krause
  2023-03-31 14:18           ` Mathias Krause
  0 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-03-30 20:55 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, linux-kernel

On 30.03.23 22:33, Sean Christopherson wrote:
> Duh, this is likely a KVM bug.  I expect the issue will go away if you revert
> 
>   fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit")
> 
> KVM doesn't consume CR0.WP for _its_ MMU, but it does consume CR0.WP for the
> guest walker.  By passing through CR0.WP, toggling only CR0.WP will not trap
> (obviously) and thus won't run through kvm_post_set_cr0(), thus resulting in stle
> information due to not invoking kvm_init_mmu().

That reasoning sounds familiar ;)

> 
> I'm preetty sure I even called that we needed to refresh the permissions, but then
> obviously forgot to actually make that happen.

:(

> 
> I believe this will remedy the issue.  If it does, I'll post a proper patch
> (likely won't be until next week).  Compile tested only.
> 
> ---
>  arch/x86/kvm/mmu.h     |  8 +++++++-
>  arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++++
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 89f532516a45..4a303aa735dd 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -113,6 +113,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
> +void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu);
>  
>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> @@ -184,8 +185,13 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  	u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
>  	bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
>  	int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
> -	bool fault = (mmu->permissions[index] >> pte_access) & 1;
>  	u32 errcode = PFERR_PRESENT_MASK;
> +	bool fault;
> +
> +	if (tdp_enabled)
> +		kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> +
> +	fault = (mmu->permissions[index] >> pte_access) & 1;
>  
>  	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
>  	if (unlikely(mmu->pkru_mask)) {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4c874d4ec68f..2a63b5725f36 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5186,6 +5186,20 @@ static union kvm_cpu_role kvm_calc_cpu_role(struct kvm_vcpu *vcpu,
>  	return role;
>  }
>  
> +void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> +{
> +	const bool cr0_wp = kvm_is_cr0_bit_set(vcpu, X86_CR0_WP);
> +
> +	BUILD_BUG_ON((KVM_MMU_CR0_ROLE_BITS & KVM_POSSIBLE_CR0_GUEST_BITS) != X86_CR0_WP);
> +	BUILD_BUG_ON((KVM_MMU_CR4_ROLE_BITS & KVM_POSSIBLE_CR4_GUEST_BITS));
> +
> +	if (is_cr0_wp(mmu) == cr0_wp)
> +		return;
> +
> +	mmu->cpu_role.base.cr0_wp = cr0_wp;
> +	reset_guest_paging_metadata(vcpu, mmu);
> +}
> +
>  static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>  {
>  	/* tdp_root_level is architecture forced level, use it if nonzero */
> 
> base-commit: 27d6845d258b67f4eb3debe062b7dacc67e0c393

I tested a backported version of this patch on v6.1 as that's what I was
testing with and it worked fine. :)

I'll do more thorough tests tomorrow and actually on kvm-x86/next's HEAD.

Thanks a lot, Sean!

Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
  2023-03-30 20:55         ` Mathias Krause
@ 2023-03-31 14:18           ` Mathias Krause
  0 siblings, 0 replies; 34+ messages in thread
From: Mathias Krause @ 2023-03-31 14:18 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, linux-kernel

On 30.03.23 22:55, Mathias Krause wrote:
> On 30.03.23 22:33, Sean Christopherson wrote:
>> I believe this will remedy the issue.  If it does, I'll post a proper patch
>> (likely won't be until next week).  Compile tested only.
>>
>> ---
>>  arch/x86/kvm/mmu.h     |  8 +++++++-
>>  arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++++
>>  2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 89f532516a45..4a303aa735dd 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -113,6 +113,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>>  bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
>>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>>  				u64 fault_address, char *insn, int insn_len);
>> +void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu);
>>  
>>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
>>  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
>> @@ -184,8 +185,13 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>  	u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
>>  	bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
>>  	int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
>> -	bool fault = (mmu->permissions[index] >> pte_access) & 1;
>>  	u32 errcode = PFERR_PRESENT_MASK;
>> +	bool fault;
>> +
>> +	if (tdp_enabled)
>> +		kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
>> +
>> +	fault = (mmu->permissions[index] >> pte_access) & 1;
>>  
>>  	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
>>  	if (unlikely(mmu->pkru_mask)) {
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 4c874d4ec68f..2a63b5725f36 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5186,6 +5186,20 @@ static union kvm_cpu_role kvm_calc_cpu_role(struct kvm_vcpu *vcpu,
>>  	return role;
>>  }
>>  
>> +void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>> +{
>> +	const bool cr0_wp = kvm_is_cr0_bit_set(vcpu, X86_CR0_WP);
>> +
>> +	BUILD_BUG_ON((KVM_MMU_CR0_ROLE_BITS & KVM_POSSIBLE_CR0_GUEST_BITS) != X86_CR0_WP);
>> +	BUILD_BUG_ON((KVM_MMU_CR4_ROLE_BITS & KVM_POSSIBLE_CR4_GUEST_BITS));
>> +
>> +	if (is_cr0_wp(mmu) == cr0_wp)
>> +		return;
>> +
>> +	mmu->cpu_role.base.cr0_wp = cr0_wp;
>> +	reset_guest_paging_metadata(vcpu, mmu);
>> +}
>> +
>>  static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>>  {
>>  	/* tdp_root_level is architecture forced level, use it if nonzero */
>>
>> base-commit: 27d6845d258b67f4eb3debe062b7dacc67e0c393
> 
> I tested a backported version of this patch on v6.1 as that's what I was
> testing with and it worked fine. :)
> 
> I'll do more thorough tests tomorrow and actually on kvm-x86/next's HEAD.

I extended the KUT CR0.WP tests and could confirm that (a) commit
fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit") without the
above change triggers errors both ways in the emulator (denies access
while it should be allowed when CR0.WP=0; allows access while it should
trigger a page fault when CR0.WP=1) and (b) the above patch fixes these.

The tests are available here:
https://lore.kernel.org/kvm/20230331135709.132713-1-minipli@grsecurity.net/

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-03-25 12:25     ` Greg KH
@ 2023-04-06  2:25       ` Sean Christopherson
  2023-04-06 13:22         ` Mathias Krause
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2023-04-06  2:25 UTC (permalink / raw)
  To: Greg KH; +Cc: Mathias Krause, kvm, Paolo Bonzini, stable

On Sat, Mar 25, 2023, Greg KH wrote:
> On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
> > On 23.03.23 23:50, Sean Christopherson wrote:
> > > On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
> > >> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
> > >>
> > >> This series is the fourth iteration of resurrecting the missing pieces of
> > >> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
> > >>
> > >> It's incorporating Sean's feedback to v3 and rebased on top of
> > >> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
> > >> kvm_arch_vm_ioctl() to "int"").
> > >>
> > >> [...]
> > > 
> > > Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!
> > > 
> > > [1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
> > >       https://github.com/kvm-x86/linux/commit/2fdcc1b32418
> > > [2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
> > >       https://github.com/kvm-x86/linux/commit/01b31714bd90
> > > [3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
> > >       https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
> > > [4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
> > >       https://github.com/kvm-x86/linux/commit/74cdc836919b
> > > [5/6] KVM: x86/mmu: Fix comment typo
> > >       https://github.com/kvm-x86/linux/commit/50f13998451e
> > > [6/6] KVM: VMX: Make CR0.WP a guest owned bit
> > >       https://github.com/kvm-x86/linux/commit/fb509f76acc8
> > 
> > Thanks a lot, Sean!
> > 
> > As this is a huge performance fix for us, we'd like to get it integrated
> > into current stable kernels as well -- not without having the changes
> > get some wider testing, of course, i.e. not before they end up in a
> > non-rc version released by Linus. But I already did a backport to 5.4 to
> > get a feeling how hard it would be and for the impact it has on older
> > kernels.
> > 
> > Using the 'ssdd 10 50000' test I used before, I get promising results
> > there as well. Without the patches it takes 9.31s, while with them we're
> > down to 4.64s. Taking into account that this is the runtime of a
> > workload in a VM that gets cut in half, I hope this qualifies as stable
> > material, as it's a huge performance fix.
> > 
> > Greg, what's your opinion on it? Original series here:
> > https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
> 
> I'll leave the judgement call up to the KVM maintainers, as they are the
> ones that need to ack any KVM patch added to stable trees.

These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
mainline.

I'm not totally opposed to the idea since our tests _should_ be provide solid
coverage, e.g. existing tests caught my subtle bug, but I don't think we should
backport these without a solid usecase, as there is a fairly high risk of breaking
random KVM users that wouldn't see any meaningful benefit.

In other words, who cares enough about the performance of running grsecurity kernels
in VMs to want these backported, but doesn't have the resources to maintain (or pay
someone to maintain) their own host kernel?

[*] https://lkml.kernel.org/r/20230405002608.418442-1-seanjc%40google.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-04-06  2:25       ` Sean Christopherson
@ 2023-04-06 13:22         ` Mathias Krause
  2023-04-14  9:29           ` Mathias Krause
  0 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-04-06 13:22 UTC (permalink / raw)
  To: Sean Christopherson, Greg KH; +Cc: kvm, Paolo Bonzini, stable

On 06.04.23 04:25, Sean Christopherson wrote:
> On Sat, Mar 25, 2023, Greg KH wrote:
>> On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
>>> As this is a huge performance fix for us, we'd like to get it integrated
>>> into current stable kernels as well -- not without having the changes
>>> get some wider testing, of course, i.e. not before they end up in a
>>> non-rc version released by Linus. But I already did a backport to 5.4 to
>>> get a feeling how hard it would be and for the impact it has on older
>>> kernels.
>>>
>>> Using the 'ssdd 10 50000' test I used before, I get promising results
>>> there as well. Without the patches it takes 9.31s, while with them we're
>>> down to 4.64s. Taking into account that this is the runtime of a
>>> workload in a VM that gets cut in half, I hope this qualifies as stable
>>> material, as it's a huge performance fix.
>>>
>>> Greg, what's your opinion on it? Original series here:
>>> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
>>
>> I'll leave the judgement call up to the KVM maintainers, as they are the
>> ones that need to ack any KVM patch added to stable trees.
> 
> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
> mainline.

I totally agree. Getting the changes to work with older kernels needs
more work. The MMU role handling was refactored in 5.14 and down to 5.4
it differs even more, so backports to earlier kernels definitely needs
more care.

My plan would be to limit backporting of the whole series to kernels
down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
before that only without patch 6. That would leave out the problematic
change but still give us the benefits of dropping the needless mmu
unloads for only toggling CR0.WP in the VM. This already helps us a lot!

> 
> I'm not totally opposed to the idea since our tests _should_ be provide solid
> coverage, e.g. existing tests caught my subtle bug, but I don't think we should
> backport these without a solid usecase, as there is a fairly high risk of breaking
> random KVM users that wouldn't see any meaningful benefit.
> 
> In other words, who cares enough about the performance of running grsecurity kernels
> in VMs to want these backported, but doesn't have the resources to maintain (or pay
> someone to maintain) their own host kernel?

The ones who care are, obviously, our customers -- and we, of course!
Customers that can run their own infrastructure don't need these
backports in upstream LTS kernels, as we will provide them as well.
However, customers that rent VMs in the cloud have no control of what
runs as host kernel. It'll likely be some distribution kernel or some
tailored version of that, which is likely based on one of the LTS kernels.

Proxmox[1], for example, is a Debian based virtualization management
system. They do provide their own kernels, based on 5.15. However, the
official Debian stable kernel is based on 5.10. So it would be nice to
get backports down to this version at least.

[1] https://www.proxmox.com/en/proxmox-ve/features

> 
> [*] https://lkml.kernel.org/r/20230405002608.418442-1-seanjc%40google.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-04-06 13:22         ` Mathias Krause
@ 2023-04-14  9:29           ` Mathias Krause
  2023-04-14 16:49             ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-04-14  9:29 UTC (permalink / raw)
  To: Sean Christopherson, Greg KH; +Cc: kvm, Paolo Bonzini, stable

On 06.04.23 15:22, Mathias Krause wrote:
> On 06.04.23 04:25, Sean Christopherson wrote:
>> On Sat, Mar 25, 2023, Greg KH wrote:
>>> On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
>>>> As this is a huge performance fix for us, we'd like to get it integrated
>>>> into current stable kernels as well -- not without having the changes
>>>> get some wider testing, of course, i.e. not before they end up in a
>>>> non-rc version released by Linus. But I already did a backport to 5.4 to
>>>> get a feeling how hard it would be and for the impact it has on older
>>>> kernels.
>>>>
>>>> Using the 'ssdd 10 50000' test I used before, I get promising results
>>>> there as well. Without the patches it takes 9.31s, while with them we're
>>>> down to 4.64s. Taking into account that this is the runtime of a
>>>> workload in a VM that gets cut in half, I hope this qualifies as stable
>>>> material, as it's a huge performance fix.
>>>>
>>>> Greg, what's your opinion on it? Original series here:
>>>> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
>>>
>>> I'll leave the judgement call up to the KVM maintainers, as they are the
>>> ones that need to ack any KVM patch added to stable trees.
>>
>> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
>> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
>> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
>> mainline.
> 
> I totally agree. Getting the changes to work with older kernels needs
> more work. The MMU role handling was refactored in 5.14 and down to 5.4
> it differs even more, so backports to earlier kernels definitely needs
> more care.
> 
> My plan would be to limit backporting of the whole series to kernels
> down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
> before that only without patch 6. That would leave out the problematic
> change but still give us the benefits of dropping the needless mmu
> unloads for only toggling CR0.WP in the VM. This already helps us a lot!

To back up the "helps us a lot" with some numbers, here are the results
I got from running the 'ssdd 10 50000' micro-benchmark on the backports
I did, running on a grsecurity L1 VM (host is a vanilla kernel, as
stated below; runtime in seconds, lower is better):

                          legacy     TDP    shadow
    Linux v5.4.240          -        8.87s   56.8s
    + patches               -        5.84s   55.4s

    Linux v5.10.177       10.37s    88.7s    69.7s
    + patches              4.88s     4.92s   70.1s

    Linux v5.15.106        9.94s    66.1s    64.9s
    + patches              4.81s     4.79s   64.6s

    Linux v6.1.23          7.65s    8.23s    68.7s
    + patches              3.36s    3.36s    69.1s

    Linux v6.2.10          7.61s    7.98s    68.6s
    + patches              3.37s    3.41s    70.2s

I guess we can grossly ignore the shadow MMU numbers, beside noting them
to regress from v5.4 to v5.10 (something to investigate?). The backports
don't help (much) for shadow MMU setups and the flux in the measurements
is likely related to the slab allocations involved.

Another unrelated data point is that TDP MMU is really broken for our
use case on v5.10 and v5.15 -- it's even slower that shadow paging!

OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
on v5.10.

I backported the whole series down to v5.10 but left out the CR0.WP
guest owning patch+fix for v5.4 as the code base is too different to get
all the nuances right, as Sean already hinted. However, even this
limited backport provides a big performance fix for our use case!

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-04-14  9:29           ` Mathias Krause
@ 2023-04-14 16:49             ` Sean Christopherson
  2023-04-14 20:09               ` Jeremi Piotrowski
  2023-05-08  9:19               ` Mathias Krause
  0 siblings, 2 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-04-14 16:49 UTC (permalink / raw)
  To: Mathias Krause; +Cc: Greg KH, kvm, Paolo Bonzini, stable

+Jeremi

On Fri, Apr 14, 2023, Mathias Krause wrote:
> On 06.04.23 15:22, Mathias Krause wrote:
> > On 06.04.23 04:25, Sean Christopherson wrote:
> >> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
> >> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
> >> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
> >> mainline.
> > 
> > I totally agree. Getting the changes to work with older kernels needs
> > more work. The MMU role handling was refactored in 5.14 and down to 5.4
> > it differs even more, so backports to earlier kernels definitely needs
> > more care.
> > 
> > My plan would be to limit backporting of the whole series to kernels
> > down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
> > before that only without patch 6. That would leave out the problematic
> > change but still give us the benefits of dropping the needless mmu
> > unloads for only toggling CR0.WP in the VM. This already helps us a lot!
> 
> To back up the "helps us a lot" with some numbers, here are the results
> I got from running the 'ssdd 10 50000' micro-benchmark on the backports
> I did, running on a grsecurity L1 VM (host is a vanilla kernel, as
> stated below; runtime in seconds, lower is better):
> 
>                           legacy     TDP    shadow
>     Linux v5.4.240          -        8.87s   56.8s
>     + patches               -        5.84s   55.4s

I believe "legacy" and "TDP" are flip-flopped, the TDP MMU does't exist in v5.4.

>     Linux v5.10.177       10.37s    88.7s    69.7s
>     + patches              4.88s     4.92s   70.1s
> 
>     Linux v5.15.106        9.94s    66.1s    64.9s
>     + patches              4.81s     4.79s   64.6s
> 
>     Linux v6.1.23          7.65s    8.23s    68.7s
>     + patches              3.36s    3.36s    69.1s
> 
>     Linux v6.2.10          7.61s    7.98s    68.6s
>     + patches              3.37s    3.41s    70.2s
> 
> I guess we can grossly ignore the shadow MMU numbers, beside noting them
> to regress from v5.4 to v5.10 (something to investigate?). The backports
> don't help (much) for shadow MMU setups and the flux in the measurements
> is likely related to the slab allocations involved.
> 
> Another unrelated data point is that TDP MMU is really broken for our
> use case on v5.10 and v5.15 -- it's even slower that shadow paging!
> 
> OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
> for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
> on v5.10.

Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
TDP MMU by default until v5.14 for very good reasons.

> I backported the whole series down to v5.10 but left out the CR0.WP
> guest owning patch+fix for v5.4 as the code base is too different to get
> all the nuances right, as Sean already hinted. However, even this
> limited backport provides a big performance fix for our use case!

As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
I think that's perfectly acceptable since KVM has had the suboptimal behavior
literally since EPT/NPT support was first added.

I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
level of confidence that we aren't overlooking some subtle dependency.

For v5.15, I am less confident in the safety of a backport, and more importantly,
I think we should disable the TDP MMU by default to mitigate the underlying flaw
that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
SMM.

We mitigated the CR0.WP case between v5.15 and v6.1[1], which is why v6.1 doesn't
exhibit the same pain as v5.10, but Jeremi discovered that the SMM case badly affects
KVM-on-HyperV[2], e.g. when lauching KVM guests using WSL.  I posted a fix[3] to
finally resolve the underlying bug, but as Jeremi discovered[4], backporting the fix
to v5.15 is going to be gnarly, to say the least.  It'll be far worse than backporting
these CR0.WP patches, and maybe even infeasible without a large scale rework (no thanks).

Anyone that will realize meaningful benefits from the TDP MMU is all but guaranteed
to be rolling their own kernels, i.e. can do the backports themselves if they want
to use a v5.15 based kernel.  The big selling point of the TDP MMU is that it scales
better to hundreds of vCPUs, particularly when live migrating such VMs.  I highly
doubt that anyone running a stock kernel is running 100+ vCPU VMs, let alone trying
to live migrate them.

[1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
[2] https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
[3] https://lore.kernel.org/all/20230413231251.1481410-1-seanjc@google.com
[4] https://lore.kernel.org/all/7332d846-fada-eb5c-6068-18ff267bd37f@linux.microsoft.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-04-14 16:49             ` Sean Christopherson
@ 2023-04-14 20:09               ` Jeremi Piotrowski
  2023-04-14 20:17                 ` Sean Christopherson
  2023-05-08  9:19               ` Mathias Krause
  1 sibling, 1 reply; 34+ messages in thread
From: Jeremi Piotrowski @ 2023-04-14 20:09 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Mathias Krause, Greg KH, kvm, Paolo Bonzini, stable

On Fri, Apr 14, 2023 at 09:49:28AM -0700, Sean Christopherson wrote:
> +Jeremi
> 

Adding myself :)

> On Fri, Apr 14, 2023, Mathias Krause wrote:

...

> > OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
> > for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
> > on v5.10.
> 
> Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
> TDP MMU by default until v5.14 for very good reasons.
> 
> > I backported the whole series down to v5.10 but left out the CR0.WP
> > guest owning patch+fix for v5.4 as the code base is too different to get
> > all the nuances right, as Sean already hinted. However, even this
> > limited backport provides a big performance fix for our use case!
> 
> As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
> and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
> I think that's perfectly acceptable since KVM has had the suboptimal behavior
> literally since EPT/NPT support was first added.
> 

Disabling TDP MMU for v5.15, and backporting things to v6.1 works for me.

> I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
> substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
> level of confidence that we aren't overlooking some subtle dependency.
> 
> For v5.15, I am less confident in the safety of a backport, and more importantly,
> I think we should disable the TDP MMU by default to mitigate the underlying flaw
> that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
> rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
> SMM.
> 

The interesting thing here is that these CR0.WP fixes seem to improve things
with legacy MMU as well, and legacy MMU is not affected/touched by [3].

So I think you can consider Mathias' ask independent of disabling TDP MMU. On the one
hand: there is no regression here. On the other: the gain is big and seems important
to him.

I didn't have time to review these patches so I can't judge risk-benefit, or
whether any single patch might be a silver bullet on its own.

> We mitigated the CR0.WP case between v5.15 and v6.1[1], which is why v6.1 doesn't
> exhibit the same pain as v5.10, but Jeremi discovered that the SMM case badly affects
> KVM-on-HyperV[2], e.g. when lauching KVM guests using WSL.  I posted a fix[3] to
> finally resolve the underlying bug, but as Jeremi discovered[4], backporting the fix
> to v5.15 is going to be gnarly, to say the least.  It'll be far worse than backporting
> these CR0.WP patches, and maybe even infeasible without a large scale rework (no thanks).
> 
> Anyone that will realize meaningful benefits from the TDP MMU is all but guaranteed
> to be rolling their own kernels, i.e. can do the backports themselves if they want
> to use a v5.15 based kernel.  The big selling point of the TDP MMU is that it scales
> better to hundreds of vCPUs, particularly when live migrating such VMs.  I highly
> doubt that anyone running a stock kernel is running 100+ vCPU VMs, let alone trying
> to live migrate them.
> 
> [1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
> [2] https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
> [3] https://lore.kernel.org/all/20230413231251.1481410-1-seanjc@google.com
> [4] https://lore.kernel.org/all/7332d846-fada-eb5c-6068-18ff267bd37f@linux.microsoft.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-04-14 20:09               ` Jeremi Piotrowski
@ 2023-04-14 20:17                 ` Sean Christopherson
  2023-05-02 17:38                   ` Jeremi Piotrowski
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2023-04-14 20:17 UTC (permalink / raw)
  To: Jeremi Piotrowski; +Cc: Mathias Krause, Greg KH, kvm, Paolo Bonzini, stable

On Fri, Apr 14, 2023, Jeremi Piotrowski wrote:
> On Fri, Apr 14, 2023 at 09:49:28AM -0700, Sean Christopherson wrote:
> > +Jeremi
> > 
> 
> Adding myself :)

/facepalm

This isn't some mundane detail, Michael!!!

> > On Fri, Apr 14, 2023, Mathias Krause wrote:
> 
> ...
> 
> > > OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
> > > for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
> > > on v5.10.
> > 
> > Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
> > TDP MMU by default until v5.14 for very good reasons.
> > 
> > > I backported the whole series down to v5.10 but left out the CR0.WP
> > > guest owning patch+fix for v5.4 as the code base is too different to get
> > > all the nuances right, as Sean already hinted. However, even this
> > > limited backport provides a big performance fix for our use case!
> > 
> > As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
> > and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
> > I think that's perfectly acceptable since KVM has had the suboptimal behavior
> > literally since EPT/NPT support was first added.
> > 
> 
> Disabling TDP MMU for v5.15, and backporting things to v6.1 works for me.
> 
> > I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
> > substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
> > level of confidence that we aren't overlooking some subtle dependency.
> > 
> > For v5.15, I am less confident in the safety of a backport, and more importantly,
> > I think we should disable the TDP MMU by default to mitigate the underlying flaw
> > that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
> > rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
> > SMM.
> > 
> 
> The interesting thing here is that these CR0.WP fixes seem to improve things
> with legacy MMU as well, and legacy MMU is not affected/touched by [3].

Yep, that's totally expected.  The final patch in this series allows KVM to elide
VM-Exits when the guest toggles CR0.WP (but only on Intel hardware).  Avoiding
VM-Exit entirely is a big performance win when the guest is constantly toggling
CR0.WP, e.g. each exit is roughly 1500 cycles, versus probalby something like ~50
for a native write to CR0.WP.

> So I think you can consider Mathias' ask independent of disabling TDP MMU. On the one
> hand: there is no regression here. On the other: the gain is big and seems important
> to him.

Ya, that's the compromise I am proposing.  Give v6.1 the full tune-up, but only
do the super safe change for v5.15.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-04-14 20:17                 ` Sean Christopherson
@ 2023-05-02 17:38                   ` Jeremi Piotrowski
  0 siblings, 0 replies; 34+ messages in thread
From: Jeremi Piotrowski @ 2023-05-02 17:38 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Mathias Krause, Greg KH, kvm, Paolo Bonzini, stable

On 4/14/2023 10:17 PM, Sean Christopherson wrote:
> On Fri, Apr 14, 2023, Jeremi Piotrowski wrote:
>> On Fri, Apr 14, 2023 at 09:49:28AM -0700, Sean Christopherson wrote:
>>> +Jeremi
>>>
>>
>> Adding myself :)
> 
> /facepalm
> 
> This isn't some mundane detail, Michael!!!
> 
>>> On Fri, Apr 14, 2023, Mathias Krause wrote:
>>
>> ...
>>
>>>> OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
>>>> for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
>>>> on v5.10.
>>>
>>> Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
>>> TDP MMU by default until v5.14 for very good reasons.
>>>
>>>> I backported the whole series down to v5.10 but left out the CR0.WP
>>>> guest owning patch+fix for v5.4 as the code base is too different to get
>>>> all the nuances right, as Sean already hinted. However, even this
>>>> limited backport provides a big performance fix for our use case!
>>>
>>> As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
>>> and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
>>> I think that's perfectly acceptable since KVM has had the suboptimal behavior
>>> literally since EPT/NPT support was first added.
>>>
>>
>> Disabling TDP MMU for v5.15, and backporting things to v6.1 works for me.
>>
>>> I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
>>> substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
>>> level of confidence that we aren't overlooking some subtle dependency.
>>>
>>> For v5.15, I am less confident in the safety of a backport, and more importantly,
>>> I think we should disable the TDP MMU by default to mitigate the underlying flaw
>>> that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
>>> rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
>>> SMM.
>>>
>>
So given that there hasn't been any further comms, I assume we stick to the plan outlined
above: disable tdp_mmu by default for 5.15.

Should that just be a revert of 71ba3f3189c78f756a659568fb473600fd78f207
("KVM: x86: enable TDP MMU by default") or a new patch and more importantly - do you want
to post the patch Sean, or are you busy and would prefer if someone else did?

Jeremi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
  2023-03-22  1:37 ` [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled Mathias Krause
@ 2023-05-07  7:32   ` Robert Hoo
  2023-05-08  9:30     ` Mathias Krause
  0 siblings, 1 reply; 34+ messages in thread
From: Robert Hoo @ 2023-05-07  7:32 UTC (permalink / raw)
  To: Mathias Krause, kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini

On 3/22/2023 9:37 AM, Mathias Krause wrote:
> There is no need to unload the MMU roots with TDP enabled when only
> CR0.WP has changed -- the paging structures are still valid, only the
> permission bitmap needs to be updated.
> 
> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
> implement kernel W^X.
> 
> The optimization brings a huge performance gain for this case as the
> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
> grsecurity L1 VM shows (runtime in seconds, lower is better):
> 
>                         legacy     TDP    shadow
> kvm-x86/next@d8708b     8.43s    9.45s    70.3s
>               +patch     5.39s    5.63s    70.2s
> 
> For legacy MMU this is ~36% faster, for TTP MMU even ~40% faster. 

TTP --> TDP

>   void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>   {
> +	/*
> +	 * CR0.WP is incorporated into the MMU role, but only for non-nested,
> +	 * indirect shadow MMUs.  If TDP is enabled, the MMU's metadata needs
> +	 * to be updated, e.g. so that emulating guest translations does the
> +	 * right thing, but there's no need to unload the root as CR0.WP
> +	 * doesn't affect SPTEs.
> +	 */
> +	if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP) {

Curiously, this patch only affects tdp_enabled, why does legacy MMU also 
see comparable performance gains?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-04-14 16:49             ` Sean Christopherson
  2023-04-14 20:09               ` Jeremi Piotrowski
@ 2023-05-08  9:19               ` Mathias Krause
  2023-05-08 15:57                 ` Mathias Krause
  1 sibling, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-05-08  9:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Greg KH, kvm, Paolo Bonzini, stable, Jeremi Piotrowski

Sorry for the late reply, I've been traveling the past three weeks.

On 14.04.23 18:49, Sean Christopherson wrote:
> +Jeremi
> 
> On Fri, Apr 14, 2023, Mathias Krause wrote:
>> On 06.04.23 15:22, Mathias Krause wrote:
>>> On 06.04.23 04:25, Sean Christopherson wrote:
>>>> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
>>>> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
>>>> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
>>>> mainline.
>>>
>>> I totally agree. Getting the changes to work with older kernels needs
>>> more work. The MMU role handling was refactored in 5.14 and down to 5.4
>>> it differs even more, so backports to earlier kernels definitely needs
>>> more care.
>>>
>>> My plan would be to limit backporting of the whole series to kernels
>>> down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
>>> before that only without patch 6. That would leave out the problematic
>>> change but still give us the benefits of dropping the needless mmu
>>> unloads for only toggling CR0.WP in the VM. This already helps us a lot!
>>
>> To back up the "helps us a lot" with some numbers, here are the results
>> I got from running the 'ssdd 10 50000' micro-benchmark on the backports
>> I did, running on a grsecurity L1 VM (host is a vanilla kernel, as
>> stated below; runtime in seconds, lower is better):
>>
>>                           legacy     TDP    shadow
>>     Linux v5.4.240          -        8.87s   56.8s
>>     + patches               -        5.84s   55.4s
> 
> I believe "legacy" and "TDP" are flip-flopped, the TDP MMU does't exist in v5.4.

Well, whatever the meaning of "TDP" is in v5.4 -- 'tdp_enabled'
completely mirrors the value of 'enable_ept' / 'npt_enabled'. But yeah,
it probably means what "legacy" is for newer kernels.

> 
>>     Linux v5.10.177       10.37s    88.7s    69.7s
>>     + patches              4.88s     4.92s   70.1s
>>
>>     Linux v5.15.106        9.94s    66.1s    64.9s
>>     + patches              4.81s     4.79s   64.6s
>>
>>     Linux v6.1.23          7.65s    8.23s    68.7s
>>     + patches              3.36s    3.36s    69.1s
>>
>>     Linux v6.2.10          7.61s    7.98s    68.6s
>>     + patches              3.37s    3.41s    70.2s
>>
>> I guess we can grossly ignore the shadow MMU numbers, beside noting them
>> to regress from v5.4 to v5.10 (something to investigate?). The backports
>> don't help (much) for shadow MMU setups and the flux in the measurements
>> is likely related to the slab allocations involved.
>>
>> Another unrelated data point is that TDP MMU is really broken for our
>> use case on v5.10 and v5.15 -- it's even slower that shadow paging!
>>
>> OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
>> for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
>> on v5.10.
> 
> Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
> TDP MMU by default until v5.14 for very good reasons.

Fair enough. But the numbers don't look much better for v5.15, so we
still want to fix that performance degradation when using TDP MMU (we
used to have a patch that disables TDP MMU in grsec by default but this,
of course, has no impact on setups making use of a vanilla / distro host
kernel and using grsec in the guest VM).

> 
>> I backported the whole series down to v5.10 but left out the CR0.WP
>> guest owning patch+fix for v5.4 as the code base is too different to get
>> all the nuances right, as Sean already hinted. However, even this
>> limited backport provides a big performance fix for our use case!
> 
> As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
> and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
> I think that's perfectly acceptable since KVM has had the suboptimal behavior
> literally since EPT/NPT support was first added.

The issue only started to get really bad when TDP MMU was enabled by
default in 5.14. That's why we reverted that change in grsecurity right
away. Integrating that change upstream will get us back to the pre-5.14
performance numbers but why not do better and fix the underlying bug by
backporting this series?

> 
> I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
> substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
> level of confidence that we aren't overlooking some subtle dependency.

Agreed, the backports down to v6.1 were trivial.

> 
> For v5.15, I am less confident in the safety of a backport, and more importantly,
> I think we should disable the TDP MMU by default to mitigate the underlying flaw
> that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
> rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
> SMM.

For v5.15 a few more commits are needed to ensure all requirements are
met, like no guest owned CR4 bits overlap with KVM's MMU role. But
that's manageable, IMHO, as some parts already went into previous stable
updates, making the missing net diff for the backport +6 -5 lines.

> 
> We mitigated the CR0.WP case between v5.15 and v6.1[1], which is why v6.1 doesn't
> exhibit the same pain as v5.10, but Jeremi discovered that the SMM case badly affects
> KVM-on-HyperV[2], e.g. when lauching KVM guests using WSL.  I posted a fix[3] to
> finally resolve the underlying bug, but as Jeremi discovered[4], backporting the fix
> to v5.15 is going to be gnarly, to say the least.  It'll be far worse than backporting
> these CR0.WP patches, and maybe even infeasible without a large scale rework (no thanks).

So disabling TDP MMU for v5.15 seems to mitigate both performance
degradation, Jeremi's and ours. However, I'd like to get the extra
speedup still. As I wrote, I already did the backports and it wasn't all
that bad in the end. I just had to read and map the code of older
kernels to what newer kernels would do and it's not that far away,
actually. It's the cleanup patches that just make it look a lot a
different, but for the MMU role it's actually not all that different.
It's completely different story for v5.4, sure. But I don't propose to
backport the full series to that kernel either.

> 
> Anyone that will realize meaningful benefits from the TDP MMU is all but guaranteed
> to be rolling their own kernels, i.e. can do the backports themselves if they want
> to use a v5.15 based kernel.  The big selling point of the TDP MMU is that it scales
> better to hundreds of vCPUs, particularly when live migrating such VMs.  I highly
> doubt that anyone running a stock kernel is running 100+ vCPU VMs, let alone trying
> to live migrate them.

I'll post the backports I did and maybe can convince you as well that
it's not all that bad ;) But I see your proposal patch [3] got merged in
the meantime and is cc:stable. Might make sense to re-do my benchmarks
after it got applied to the older kernels.

Thanks,
Mathias

> 
> [1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
> [2] https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
> [3] https://lore.kernel.org/all/20230413231251.1481410-1-seanjc@google.com
> [4] https://lore.kernel.org/all/7332d846-fada-eb5c-6068-18ff267bd37f@linux.microsoft.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
  2023-05-07  7:32   ` Robert Hoo
@ 2023-05-08  9:30     ` Mathias Krause
  2023-05-09  1:04       ` Robert Hoo
  0 siblings, 1 reply; 34+ messages in thread
From: Mathias Krause @ 2023-05-08  9:30 UTC (permalink / raw)
  To: Robert Hoo, kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini

On 07.05.23 09:32, Robert Hoo wrote:
> On 3/22/2023 9:37 AM, Mathias Krause wrote:
>> There is no need to unload the MMU roots with TDP enabled when only
>> CR0.WP has changed -- the paging structures are still valid, only the
>> permission bitmap needs to be updated.
>>
>> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
>> implement kernel W^X.
>>
>> The optimization brings a huge performance gain for this case as the
>> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
>> grsecurity L1 VM shows (runtime in seconds, lower is better):
>>
>>                         legacy     TDP    shadow
>> kvm-x86/next@d8708b     8.43s    9.45s    70.3s
>>               +patch     5.39s    5.63s    70.2s
>>
>> For legacy MMU this is ~36% faster, for TTP MMU even ~40% faster. 
> 
> TTP --> TDP

Thanks, Sean fixed it up in the final commit:
https://git.kernel.org/linus/01b31714bd90

> 
>>   void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0,
>> unsigned long cr0)
>>   {
>> +    /*
>> +     * CR0.WP is incorporated into the MMU role, but only for
>> non-nested,
>> +     * indirect shadow MMUs.  If TDP is enabled, the MMU's metadata
>> needs
>> +     * to be updated, e.g. so that emulating guest translations does the
>> +     * right thing, but there's no need to unload the root as CR0.WP
>> +     * doesn't affect SPTEs.
>> +     */
>> +    if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP) {
> 
> Curiously, this patch only affects tdp_enabled, why does legacy MMU also
> see comparable performance gains?

Because 'tdp_enabled' just implies EPT / NPT and only 'tdp_mmu_enabled'
decides which MMU mode to use -- either legacy or TDP MMU (see
kvm_configure_mmu() and now gets invoked from vmx.c / svm.c).

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users
  2023-05-08  9:19               ` Mathias Krause
@ 2023-05-08 15:57                 ` Mathias Krause
  0 siblings, 0 replies; 34+ messages in thread
From: Mathias Krause @ 2023-05-08 15:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Greg KH, kvm, Paolo Bonzini, stable, Jeremi Piotrowski

On 08.05.23 11:19, Mathias Krause wrote:
> [...]
> 
> I'll post the backports I did and maybe can convince you as well that
> it's not all that bad ;) But I see your proposal patch [3] got merged in
> the meantime and is cc:stable. Might make sense to re-do my benchmarks
> after it got applied to the older kernels.

I just sent out the backports to the mailing list, please have a look!

The benchmark numbers are still the old ones I did three weeks ago.
However, seeing what's needed for the backport might give you a better
feeling for the impact.

I'll refresh the series if there's demand and when edbdb43fc96b ("KVM:
x86: Preserve TDP MMU roots until they are explicitly invalidated") got
merged into the relevant kernels.

v6.2:
https://lore.kernel.org/stable/20230508154457.29956-1-minipli@grsecurity.net/
v6.1:
https://lore.kernel.org/stable/20230508154602.30008-1-minipli@grsecurity.net/
v5.15:
https://lore.kernel.org/stable/20230508154709.30043-1-minipli@grsecurity.net/
v5.10:
https://lore.kernel.org/stable/20230508154804.30078-1-minipli@grsecurity.net/
v5.4:
https://lore.kernel.org/stable/20230508154943.30113-1-minipli@grsecurity.net/

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
  2023-05-08  9:30     ` Mathias Krause
@ 2023-05-09  1:04       ` Robert Hoo
  0 siblings, 0 replies; 34+ messages in thread
From: Robert Hoo @ 2023-05-09  1:04 UTC (permalink / raw)
  To: Mathias Krause, kvm; +Cc: linux-kernel, Sean Christopherson, Paolo Bonzini

On 5/8/2023 5:30 PM, Mathias Krause wrote:
>>>    void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0,
>>> unsigned long cr0)
>>>    {
>>> +    /*
>>> +     * CR0.WP is incorporated into the MMU role, but only for
>>> non-nested,
>>> +     * indirect shadow MMUs.  If TDP is enabled, the MMU's metadata
>>> needs
>>> +     * to be updated, e.g. so that emulating guest translations does the
>>> +     * right thing, but there's no need to unload the root as CR0.WP
>>> +     * doesn't affect SPTEs.
>>> +     */
>>> +    if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP) {
>>
>> Curiously, this patch only affects tdp_enabled, why does legacy MMU also
>> see comparable performance gains?
> 
> Because 'tdp_enabled' just implies EPT / NPT and only 'tdp_mmu_enabled'
> decides which MMU mode to use -- either legacy or TDP MMU (see
> kvm_configure_mmu() and now gets invoked from vmx.c / svm.c).
> 
Ah, get it, thanks. The name indeed confuses me (and perhaps others).
After dig into,
1. kvm modules has a param "tdp_mmu_enabled", (in the first place) 
indicates KVM level's willingness on enable two dimensional paging. 
However, it in the end depends on ept/npt enabled or not on vendor layer.
So, uses a "tdp_mmu_allowed" to intermediately record this willness in kvm 
module init phase.
	/*
	 * Snapshot userspace's desire to enable the TDP MMU. Whether or not the
	 * TDP MMU is actually enabled is determined in kvm_configure_mmu()
	 * when the vendor module is loaded.
	 */
	tdp_mmu_allowed = tdp_mmu_enabled;
2. When vendor module init --> kvm_configure_mmu()
	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;

    tdp_mmu_enabled's semantics becomes, as its name indicates, the 
eventual tdp mmu enablement status.

    And, tdp_enabled, is the general (ept_enabled | npt_enabled).


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-05-09  1:04 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
2023-03-22  1:37 ` [PATCH v4 1/6] KVM: x86/mmu: Avoid indirect call for get_cr3 Mathias Krause
2023-03-22  1:37 ` [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled Mathias Krause
2023-05-07  7:32   ` Robert Hoo
2023-05-08  9:30     ` Mathias Krause
2023-05-09  1:04       ` Robert Hoo
2023-03-22  1:37 ` [PATCH v4 3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode Mathias Krause
2023-03-22  1:37 ` [PATCH v4 4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits Mathias Krause
2023-03-22  1:37 ` [PATCH v4 5/6] KVM: x86/mmu: Fix comment typo Mathias Krause
2023-03-22  1:37 ` [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit Mathias Krause
2023-03-27  8:33   ` Xiaoyao Li
2023-03-27  8:37     ` Mathias Krause
2023-03-27 13:48       ` Xiaoyao Li
2023-03-30  8:45   ` Mathias Krause
2023-03-30 17:12     ` Sean Christopherson
2023-03-30 20:15       ` Mathias Krause
2023-03-30 20:30         ` Mathias Krause
2023-03-30 20:36           ` Sean Christopherson
2023-03-30 20:33       ` Sean Christopherson
2023-03-30 20:55         ` Mathias Krause
2023-03-31 14:18           ` Mathias Krause
2023-03-22  7:41 ` [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
2023-03-23 22:50 ` Sean Christopherson
2023-03-25 11:39   ` Mathias Krause
2023-03-25 12:25     ` Greg KH
2023-04-06  2:25       ` Sean Christopherson
2023-04-06 13:22         ` Mathias Krause
2023-04-14  9:29           ` Mathias Krause
2023-04-14 16:49             ` Sean Christopherson
2023-04-14 20:09               ` Jeremi Piotrowski
2023-04-14 20:17                 ` Sean Christopherson
2023-05-02 17:38                   ` Jeremi Piotrowski
2023-05-08  9:19               ` Mathias Krause
2023-05-08 15:57                 ` Mathias Krause

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.