All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] KVM: random nested fixes
@ 2021-02-17 14:57 Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier Maxim Levitsky
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

This is a set of mostly random fixes I have in my patch queue.

- Patches 1,2 are minor tracing fixes from a patch series I sent
  some time ago which I don't want to get lost in the noise.

- Patches 3,4 are for fixing a theoretical bug in VMX with ept=0, but also to
  allow to move nested_vmx_load_cr3 call a bit, to make sure that update to
  .inject_page_fault is not lost while entering a nested guest.

- Patch 5 fixes running nested guests with npt=0 on host, which is sometimes
  useful for debug and such (especially nested).

- Patch 6 fixes the (mostly theoretical) issue with PDPTR loading on VMX after
  nested migration.

- Patch 7 is hopefully the correct fix to eliminate a L0 crash in some rare
  cases when a HyperV guest is migrated.

This was tested with kvm_unit_tests on both VMX and SVM,
both native and in a VM.
Some tests fail on VMX, but I haven't observed new tests failing
due to the changes.

This patch series was also tested by doing my nested migration with:
    1. npt/ept disabled on the host
    2. npt/ept enabled on the host and disabled in the L1
    3. npt/ept enabled on both.

In case of npt/ept=0 on the host (both on Intel and AMD),
the L2 eventually crashed but I strongly suspect a bug in shadow mmu,
which I track separately.
(see below for full explanation).

This patch series is based on kvm/queue branch.

Best regards,
	Maxim Levitsky

PS: The shadow mmu bug which I spent most of this week on:

In my testing I am not able to boot win10 (without nesting, HyperV or
anything special) on either Intel nor AMD without two dimensional paging
enabled (ept/npt).
It always crashes in various ways during the boot.

I found out (accidentally) that if I make KVM's shadow mmu not unsync last level
shadow pages, it starts working.
In addition to that, as I mentioned above this bug can happen on Linux as well,
while stressing the shadow mmu with repeated migrations
(and again with the same shadow unsync hack it just works).

While running without two dimensional paging is very obsolete by now, a
bug in shadow mmu is relevant to nesting, since it uses it as well.

Maxim Levitsky (7):
  KVM: VMX: read idt_vectoring_info a bit earlier
  KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode
  KVM: x86: add .complete_mmu_init arch callback
  KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  KVM: nSVM: fix running nested guests when npt=0
  KVM: nVMX: don't load PDPTRS right after nested state set
  KVM: nSVM: call nested_svm_load_cr3 on nested state load

 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 +
 arch/x86/kvm/mmu/mmu.c             |  2 +
 arch/x86/kvm/svm/nested.c          | 84 +++++++++++++++++++-----------
 arch/x86/kvm/svm/svm.c             |  9 ++++
 arch/x86/kvm/svm/svm.h             |  1 +
 arch/x86/kvm/vmx/nested.c          | 22 ++++----
 arch/x86/kvm/vmx/nested.h          |  1 +
 arch/x86/kvm/vmx/vmx.c             | 13 ++++-
 9 files changed, 92 insertions(+), 43 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier
  2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
@ 2021-02-17 14:57 ` Maxim Levitsky
  2021-02-17 16:06   ` Paolo Bonzini
  2021-02-17 14:57 ` [PATCH 2/7] KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode Maxim Levitsky
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

trace_kvm_exit prints this value (using vmx_get_exit_info)
so it makes sense to read it before the trace point.

Fixes: dcf068da7eb2 ("KVM: VMX: Introduce generic fastpath handler")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b3e36dc3f164..e428d69e21c0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6921,13 +6921,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
 		kvm_machine_check();
 
+	if (likely(!vmx->exit_reason.failed_vmentry))
+		vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+
 	trace_kvm_exit(vmx->exit_reason.full, vcpu, KVM_ISA_VMX);
 
 	if (unlikely(vmx->exit_reason.failed_vmentry))
 		return EXIT_FASTPATH_NONE;
 
 	vmx->loaded_vmcs->launched = 1;
-	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
 	vmx_recover_nmi_blocking(vmx);
 	vmx_complete_interrupts(vmx);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/7] KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode
  2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier Maxim Levitsky
@ 2021-02-17 14:57 ` Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 3/7] KVM: x86: add .complete_mmu_init arch callback Maxim Levitsky
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

This way trace will capture all the nested mode entries
(including entries after migration, and from smm)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/nested.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 519fe84f2100..1bc31e2e8fe0 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -500,6 +500,20 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb12_gpa,
 {
 	int ret;
 
+	trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb12_gpa,
+			       vmcb12->save.rip,
+			       vmcb12->control.int_ctl,
+			       vmcb12->control.event_inj,
+			       vmcb12->control.nested_ctl);
+
+	trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 0xffff,
+				    vmcb12->control.intercepts[INTERCEPT_CR] >> 16,
+				    vmcb12->control.intercepts[INTERCEPT_EXCEPTION],
+				    vmcb12->control.intercepts[INTERCEPT_WORD3],
+				    vmcb12->control.intercepts[INTERCEPT_WORD4],
+				    vmcb12->control.intercepts[INTERCEPT_WORD5]);
+
+
 	svm->nested.vmcb12_gpa = vmcb12_gpa;
 
 	WARN_ON(svm->vmcb == svm->nested.vmcb02.ptr);
@@ -559,18 +573,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 		goto out;
 	}
 
-	trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb12_gpa,
-			       vmcb12->save.rip,
-			       vmcb12->control.int_ctl,
-			       vmcb12->control.event_inj,
-			       vmcb12->control.nested_ctl);
-
-	trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 0xffff,
-				    vmcb12->control.intercepts[INTERCEPT_CR] >> 16,
-				    vmcb12->control.intercepts[INTERCEPT_EXCEPTION],
-				    vmcb12->control.intercepts[INTERCEPT_WORD3],
-				    vmcb12->control.intercepts[INTERCEPT_WORD4],
-				    vmcb12->control.intercepts[INTERCEPT_WORD5]);
 
 	/* Clear internal status */
 	kvm_clear_exception_queue(vcpu);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] KVM: x86: add .complete_mmu_init arch callback
  2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 2/7] KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode Maxim Levitsky
@ 2021-02-17 14:57 ` Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init Maxim Levitsky
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

This callback will be used to tweak the mmu context
in arch specific code after it was reset.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 1 +
 arch/x86/include/asm/kvm_host.h    | 2 ++
 arch/x86/kvm/mmu/mmu.c             | 2 ++
 arch/x86/kvm/svm/svm.c             | 6 ++++++
 arch/x86/kvm/vmx/vmx.c             | 6 ++++++
 5 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 355a2ab8fc09..041e5765dc67 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -86,6 +86,7 @@ KVM_X86_OP(set_tss_addr)
 KVM_X86_OP(set_identity_map_addr)
 KVM_X86_OP(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(complete_mmu_init)
 KVM_X86_OP_NULL(has_wbinvd_exit)
 KVM_X86_OP(write_l1_tsc_offset)
 KVM_X86_OP(get_exit_info)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a8e1b57b1532..01a08f936781 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1251,6 +1251,8 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, unsigned long pgd,
 			     int pgd_level);
 
+	void (*complete_mmu_init) (struct kvm_vcpu *vcpu);
+
 	bool (*has_wbinvd_exit)(void);
 
 	/* Returns actual tsc_offset set in active VMCS */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e507568cd55d..00bf9ff2e469 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4774,6 +4774,8 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots)
 		init_kvm_tdp_mmu(vcpu);
 	else
 		init_kvm_softmmu(vcpu);
+
+	static_call(kvm_x86_complete_mmu_init)(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_init_mmu);
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 754e07538b4a..74a334c9902a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3913,6 +3913,11 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root,
 	vmcb_mark_dirty(svm->vmcb, VMCB_CR);
 }
 
+static void svm_complete_mmu_init(struct kvm_vcpu *vcpu)
+{
+
+}
+
 static int is_disabled(void)
 {
 	u64 vm_cr;
@@ -4522,6 +4527,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.write_l1_tsc_offset = svm_write_l1_tsc_offset,
 
 	.load_mmu_pgd = svm_load_mmu_pgd,
+	.complete_mmu_init = svm_complete_mmu_init,
 
 	.check_intercept = svm_check_intercept,
 	.handle_exit_irqoff = svm_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e428d69e21c0..bf6ef674d688 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3252,6 +3252,11 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long pgd,
 		vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
+static void vmx_complete_mmu_init(struct kvm_vcpu *vcpu)
+{
+
+}
+
 static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	/*
@@ -7849,6 +7854,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.write_l1_tsc_offset = vmx_write_l1_tsc_offset,
 
 	.load_mmu_pgd = vmx_load_mmu_pgd,
+	.complete_mmu_init = vmx_complete_mmu_init,
 
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
                   ` (2 preceding siblings ...)
  2021-02-17 14:57 ` [PATCH 3/7] KVM: x86: add .complete_mmu_init arch callback Maxim Levitsky
@ 2021-02-17 14:57 ` Maxim Levitsky
  2021-02-17 17:29   ` Sean Christopherson
  2021-02-17 14:57 ` [PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0 Maxim Levitsky
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

This fixes a (mostly theoretical) bug which can happen if ept=0
on host and we run a nested guest which triggers a mmu context
reset while running nested.
In this case the .inject_page_fault callback will be lost.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 8 +-------
 arch/x86/kvm/vmx/nested.h | 1 +
 arch/x86/kvm/vmx/vmx.c    | 5 ++++-
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0b6dab6915a3..f9de729dbea6 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -419,7 +419,7 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit
 }
 
 
-static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
+void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
 		struct x86_exception *fault)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
@@ -2620,9 +2620,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
 	}
 
-	if (!enable_ept)
-		vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
-
 	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
 	    WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 				     vmcs12->guest_ia32_perf_global_ctrl)))
@@ -4224,9 +4221,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &ignored))
 		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 
-	if (!enable_ept)
-		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
-
 	nested_vmx_transition_tlb_flush(vcpu, vmcs12, false);
 
 	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index 197148d76b8f..2ab279744d38 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -36,6 +36,7 @@ void nested_vmx_pmu_entry_exit_ctls_update(struct kvm_vcpu *vcpu);
 void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu);
 bool nested_vmx_check_io_bitmaps(struct kvm_vcpu *vcpu, unsigned int port,
 				 int size);
+void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,struct x86_exception *fault);
 
 static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bf6ef674d688..c43324df4877 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3254,7 +3254,10 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long pgd,
 
 static void vmx_complete_mmu_init(struct kvm_vcpu *vcpu)
 {
-
+	if (!enable_ept && is_guest_mode(vcpu)) {
+		WARN_ON(mmu_is_nested(vcpu));
+		vcpu->arch.mmu->inject_page_fault = vmx_inject_page_fault_nested;
+	}
 }
 
 static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0
  2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
                   ` (3 preceding siblings ...)
  2021-02-17 14:57 ` [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init Maxim Levitsky
@ 2021-02-17 14:57 ` Maxim Levitsky
  2021-02-17 15:27   ` Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set Maxim Levitsky
  2021-02-17 14:57 ` [PATCH 7/7] KVM: nSVM: call nested_svm_load_cr3 on nested state load Maxim Levitsky
  6 siblings, 1 reply; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

In case of npt=0 on host,
nSVM needs the same .inject_page_fault tweak as VMX has,
to make sure that shadow mmu faults are injected as vmexits.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/nested.c | 18 ++++++++++++++++++
 arch/x86/kvm/svm/svm.c    |  5 ++++-
 arch/x86/kvm/svm/svm.h    |  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 1bc31e2e8fe0..53b9037259b5 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -53,6 +53,23 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
 	nested_svm_vmexit(svm);
 }
 
+void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_exception *fault)
+{
+       struct vcpu_svm *svm = to_svm(vcpu);
+       WARN_ON(!is_guest_mode(vcpu));
+
+       if (vmcb_is_intercept(&svm->nested.ctl, INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) &&
+	   !svm->nested.nested_run_pending) {
+               svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
+               svm->vmcb->control.exit_code_hi = 0;
+               svm->vmcb->control.exit_info_1 = fault->error_code;
+               svm->vmcb->control.exit_info_2 = fault->address;
+               nested_svm_vmexit(svm);
+       } else {
+               kvm_inject_page_fault(vcpu, fault);
+       }
+}
+
 static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -531,6 +548,7 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb12_gpa,
 	if (ret)
 		return ret;
 
+
 	svm_set_gif(svm, true);
 
 	return 0;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 74a334c9902a..59e1767df030 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3915,7 +3915,10 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root,
 
 static void svm_complete_mmu_init(struct kvm_vcpu *vcpu)
 {
-
+	if (!npt_enabled && is_guest_mode(vcpu)) {
+		WARN_ON(mmu_is_nested(vcpu));
+		vcpu->arch.mmu->inject_page_fault = svm_inject_page_fault_nested;
+	}
 }
 
 static int is_disabled(void)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7b6ca0e49a14..fda80d56c6e3 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -437,6 +437,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
 	return vmcb_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
 }
 
+void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, struct vmcb *vmcb12);
 void svm_leave_nested(struct vcpu_svm *svm);
 void svm_free_nested(struct vcpu_svm *svm);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set
  2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
                   ` (4 preceding siblings ...)
  2021-02-17 14:57 ` [PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0 Maxim Levitsky
@ 2021-02-17 14:57 ` Maxim Levitsky
  2021-02-17 17:52   ` Sean Christopherson
  2021-02-17 14:57 ` [PATCH 7/7] KVM: nSVM: call nested_svm_load_cr3 on nested state load Maxim Levitsky
  6 siblings, 1 reply; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

Just like all other nested memory accesses, after a migration loading
PDPTRs should be delayed to first VM entry to ensure
that guest memory is fully initialized.

Just move the call to nested_vmx_load_cr3 to nested_get_vmcs12_pages
to implement this.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f9de729dbea6..26084f8eee82 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2596,11 +2596,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 		return -EINVAL;
 	}
 
-	/* Shadow page tables on either EPT or shadow page tables. */
-	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
-				entry_failure_code))
-		return -EINVAL;
-
 	/*
 	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
 	 * on nested VM-Exit, which can occur without actually running L2 and
@@ -3138,11 +3133,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	enum vm_entry_failure_code entry_failure_code;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_host_map *map;
 	struct page *page;
 	u64 hpa;
 
+	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
+				&entry_failure_code))
+		return false;
+
 	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
 		/*
 		 * Translate L1 physical address to host physical
@@ -3386,6 +3386,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	}
 
 	if (from_vmentry) {
+		if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
+		    nested_cpu_has_ept(vmcs12), &entry_failure_code))
+			goto vmentry_fail_vmexit_guest_mode;
+
 		failed_index = nested_vmx_load_msr(vcpu,
 						   vmcs12->vm_entry_msr_load_addr,
 						   vmcs12->vm_entry_msr_load_count);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] KVM: nSVM: call nested_svm_load_cr3 on nested state load
  2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
                   ` (5 preceding siblings ...)
  2021-02-17 14:57 ` [PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set Maxim Levitsky
@ 2021-02-17 14:57 ` Maxim Levitsky
  6 siblings, 0 replies; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 14:57 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar, Maxim Levitsky

While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
only root_mmu is reset.

On regular nested entries we call nested_svm_load_cr3 which both updates the
guest's CR3 in the MMU when it is needed, and it also initializes
the mmu again which makes it initialize the walk_mmu as well when nested
paging is enabled in both host and guest.

Since we don't call nested_svm_load_cr3 on nested state load,
the walk_mmu can be left uninitialized, which can lead to a NULL pointer
dereference while accessing it, if we happen to get a nested page fault
right after entering the nested guest first time after the migration and
if we decide to emulate it.
This makes the emulator access NULL walk_mmu->gva_to_gpa.

Therefore we should call this function on nested state load as well.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/nested.c | 40 +++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 53b9037259b5..ebc7dfaa9f13 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -215,24 +215,6 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
 	return true;
 }
 
-static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-
-	if (WARN_ON(!is_guest_mode(vcpu)))
-		return true;
-
-	if (!nested_svm_vmrun_msrpm(svm)) {
-		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
-		vcpu->run->internal.suberror =
-			KVM_INTERNAL_ERROR_EMULATION;
-		vcpu->run->internal.ndata = 0;
-		return false;
-	}
-
-	return true;
-}
-
 static bool nested_vmcb_check_controls(struct vmcb_control_area *control)
 {
 	if (CC(!vmcb_is_intercept(control, INTERCEPT_VMRUN)))
@@ -1311,6 +1293,28 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (WARN_ON(!is_guest_mode(vcpu)))
+		return true;
+
+	if (nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
+				nested_npt_enabled(svm)))
+		return false;
+
+	if (!nested_svm_vmrun_msrpm(svm)) {
+		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+		vcpu->run->internal.suberror =
+			KVM_INTERNAL_ERROR_EMULATION;
+		vcpu->run->internal.ndata = 0;
+		return false;
+	}
+
+	return true;
+}
+
 struct kvm_x86_nested_ops svm_nested_ops = {
 	.check_events = svm_check_nested_events,
 	.get_nested_state_pages = svm_get_nested_state_pages,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0
  2021-02-17 14:57 ` [PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0 Maxim Levitsky
@ 2021-02-17 15:27   ` Maxim Levitsky
  0 siblings, 0 replies; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 15:27 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, 2021-02-17 at 16:57 +0200, Maxim Levitsky wrote:
> In case of npt=0 on host,
> nSVM needs the same .inject_page_fault tweak as VMX has,
> to make sure that shadow mmu faults are injected as vmexits.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/svm/nested.c | 18 ++++++++++++++++++
>  arch/x86/kvm/svm/svm.c    |  5 ++++-
>  arch/x86/kvm/svm/svm.h    |  1 +
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 1bc31e2e8fe0..53b9037259b5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -53,6 +53,23 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
>  	nested_svm_vmexit(svm);
>  }
>  
> +void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> +{
> +       struct vcpu_svm *svm = to_svm(vcpu);
> +       WARN_ON(!is_guest_mode(vcpu));
> +
> +       if (vmcb_is_intercept(&svm->nested.ctl, INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) &&
> +	   !svm->nested.nested_run_pending) {
> +               svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
> +               svm->vmcb->control.exit_code_hi = 0;
> +               svm->vmcb->control.exit_info_1 = fault->error_code;
> +               svm->vmcb->control.exit_info_2 = fault->address;
> +               nested_svm_vmexit(svm);
> +       } else {
> +               kvm_inject_page_fault(vcpu, fault);
> +       }
> +}
> +
>  static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -531,6 +548,7 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb12_gpa,
>  	if (ret)
>  		return ret;
>  
> +
Sorry for this whitespace change.
Best regards,
	Maxim Levitsky
>  	svm_set_gif(svm, true);
>  
>  	return 0;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 74a334c9902a..59e1767df030 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3915,7 +3915,10 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root,
>  
>  static void svm_complete_mmu_init(struct kvm_vcpu *vcpu)
>  {
> -
> +	if (!npt_enabled && is_guest_mode(vcpu)) {
> +		WARN_ON(mmu_is_nested(vcpu));
> +		vcpu->arch.mmu->inject_page_fault = svm_inject_page_fault_nested;
> +	}
>  }
>  
>  static int is_disabled(void)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 7b6ca0e49a14..fda80d56c6e3 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -437,6 +437,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
>  	return vmcb_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
>  }
>  
> +void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_exception *fault);
>  int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, struct vmcb *vmcb12);
>  void svm_leave_nested(struct vcpu_svm *svm);
>  void svm_free_nested(struct vcpu_svm *svm);



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier
  2021-02-17 14:57 ` [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier Maxim Levitsky
@ 2021-02-17 16:06   ` Paolo Bonzini
  2021-02-17 16:18     ` Maxim Levitsky
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2021-02-17 16:06 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Joerg Roedel,
	Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On 17/02/21 15:57, Maxim Levitsky wrote:
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b3e36dc3f164..e428d69e21c0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6921,13 +6921,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
>  		kvm_machine_check();
>  
> +	if (likely(!vmx->exit_reason.failed_vmentry))
> +		vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +

Any reason for the if?

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier
  2021-02-17 16:06   ` Paolo Bonzini
@ 2021-02-17 16:18     ` Maxim Levitsky
  2021-02-17 16:21       ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 16:18 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, Wanpeng Li, Borislav Petkov, Joerg Roedel,
	Jim Mattson, H. Peter Anvin, Sean Christopherson,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, 2021-02-17 at 17:06 +0100, Paolo Bonzini wrote:
> On 17/02/21 15:57, Maxim Levitsky wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index b3e36dc3f164..e428d69e21c0 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6921,13 +6921,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  	if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
> >  		kvm_machine_check();
> >  
> > +	if (likely(!vmx->exit_reason.failed_vmentry))
> > +		vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> > +
> 
> Any reason for the if?

Sean Christopherson asked me to do this to avoid updating idt_vectoring_info on failed
VM entry, to keep things as they were logically before this patch.

Best regards,
	Maxim Levitsky

> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier
  2021-02-17 16:18     ` Maxim Levitsky
@ 2021-02-17 16:21       ` Sean Christopherson
  2021-02-17 16:29         ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-02-17 16:21 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, kvm, linux-kernel, Wanpeng Li, Borislav Petkov,
	Joerg Roedel, Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, Feb 17, 2021, Maxim Levitsky wrote:
> On Wed, 2021-02-17 at 17:06 +0100, Paolo Bonzini wrote:
> > On 17/02/21 15:57, Maxim Levitsky wrote:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index b3e36dc3f164..e428d69e21c0 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -6921,13 +6921,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> > >  	if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
> > >  		kvm_machine_check();
> > >  
> > > +	if (likely(!vmx->exit_reason.failed_vmentry))
> > > +		vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> > > +
> > 
> > Any reason for the if?
> 
> Sean Christopherson asked me to do this to avoid updating idt_vectoring_info on failed
> VM entry, to keep things as they were logically before this patch.

Ya, specifically because the field isn't valid if VM-Enter fails.  I'm also ok
with an unconditional VMREAD if we add a comment stating that it's unnecessary
if VM-Enter failed, but faster in the common case.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier
  2021-02-17 16:21       ` Sean Christopherson
@ 2021-02-17 16:29         ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-02-17 16:29 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Joerg Roedel,
	Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On 17/02/21 17:21, Sean Christopherson wrote:
> On Wed, Feb 17, 2021, Maxim Levitsky wrote:
>> On Wed, 2021-02-17 at 17:06 +0100, Paolo Bonzini wrote:
>>> On 17/02/21 15:57, Maxim Levitsky wrote:
>>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>>> index b3e36dc3f164..e428d69e21c0 100644
>>>> --- a/arch/x86/kvm/vmx/vmx.c
>>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>>> @@ -6921,13 +6921,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>>>   	if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
>>>>   		kvm_machine_check();
>>>>   
>>>> +	if (likely(!vmx->exit_reason.failed_vmentry))
>>>> +		vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
>>>> +
>>>
>>> Any reason for the if?
>>
>> Sean Christopherson asked me to do this to avoid updating idt_vectoring_info on failed
>> VM entry, to keep things as they were logically before this patch.
> 
> Ya, specifically because the field isn't valid if VM-Enter fails.  I'm also ok
> with an unconditional VMREAD if we add a comment stating that it's unnecessary
> if VM-Enter failed, but faster in the common case.

It's okay, just good to know!

Thanks,

paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 14:57 ` [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init Maxim Levitsky
@ 2021-02-17 17:29   ` Sean Christopherson
  2021-02-17 17:37     ` Paolo Bonzini
  2021-02-17 18:43     ` Maxim Levitsky
  0 siblings, 2 replies; 22+ messages in thread
From: Sean Christopherson @ 2021-02-17 17:29 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, Feb 17, 2021, Maxim Levitsky wrote:
> This fixes a (mostly theoretical) bug which can happen if ept=0
> on host and we run a nested guest which triggers a mmu context
> reset while running nested.
> In this case the .inject_page_fault callback will be lost.
>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 8 +-------
>  arch/x86/kvm/vmx/nested.h | 1 +
>  arch/x86/kvm/vmx/vmx.c    | 5 ++++-
>  3 files changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 0b6dab6915a3..f9de729dbea6 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -419,7 +419,7 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit
>  }
>  
>  
> -static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
> +void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
>  		struct x86_exception *fault)
>  {
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> @@ -2620,9 +2620,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
>  	}
>  
> -	if (!enable_ept)
> -		vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
> -
>  	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
>  	    WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
>  				     vmcs12->guest_ia32_perf_global_ctrl)))
> @@ -4224,9 +4221,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &ignored))
>  		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>  
> -	if (!enable_ept)
> -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;

Oof, please explicitly call out these types of side effects in the changelog,
it took me a while to piece together that this can be dropped because a MMU
reset is guaranteed and is also guaranteed to restore inject_page_fault.

I would even go so far as to say this particular line of code should be removed
in a separate commit.  Unless I'm overlooking something, this code is
effectively a nop, which means it doesn't need to be removed to make the bug fix
functionally correct.

All that being said, I'm pretty we can eliminate setting inject_page_fault
dynamically.  I think that would yield more maintainable code.  Following these
flows is a nightmare.  The change itself will be scarier, but I'm pretty sure
the end result will be a lot cleaner.

And I believe there's also a second bug that would be fixed by such an approach.
Doesn't vmx_inject_page_fault_nested() need to be used for the nested_mmu when
ept=1?  E.g. if the emulator injects a #PF to L2, L1 should still be able to
intercept the #PF even if L1 is using EPT.  This likely hasn't been noticed
because hypervisors typically don't intercept #PF when EPT is enabled.

Something like this (very incomplete):

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 30e9b0cb9abd..f957514a4d65 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4497,7 +4497,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
        context->direct_map = true;
        context->get_guest_pgd = get_cr3;
        context->get_pdptr = kvm_pdptr_read;
-       context->inject_page_fault = kvm_inject_page_fault;

        if (!is_paging(vcpu)) {
                context->nx = false;
@@ -4687,7 +4686,6 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu)

        context->get_guest_pgd     = get_cr3;
        context->get_pdptr         = kvm_pdptr_read;
-       context->inject_page_fault = kvm_inject_page_fault;
 }

 static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
@@ -4701,7 +4699,6 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
        g_context->mmu_role.as_u64 = new_role.as_u64;
        g_context->get_guest_pgd     = get_cr3;
        g_context->get_pdptr         = kvm_pdptr_read;
-       g_context->inject_page_fault = kvm_inject_page_fault;

        /*
         * L2 page tables are never shadowed, so there is no need to sync
@@ -5272,6 +5269,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
        if (ret)
                goto fail_allocate_root;

+       static_call(kvm_x86_mmu_create)(vcpu);
+
        return ret;
  fail_allocate_root:
        free_mmu_pages(&vcpu->arch.guest_mmu);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a63da447ede9..aa6c48295117 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -425,15 +425,14 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit
 }


-static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
+static void vmx_inject_page_fault(struct kvm_vcpu *vcpu,
                struct x86_exception *fault)
 {
        struct vmcs12 *vmcs12 = get_vmcs12(vcpu);

-       WARN_ON(!is_guest_mode(vcpu));
-
-       if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
-               !to_vmx(vcpu)->nested.nested_run_pending) {
+       if (guest_mode(vcpu) &&
+           nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
+           !to_vmx(vcpu)->nested.nested_run_pending) {
                vmcs12->vm_exit_intr_error_code = fault->error_code;
                nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
                                  PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
@@ -2594,9 +2593,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
                vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
        }

-       if (!enable_ept)
-               vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
-
        if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
            WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
                                     vmcs12->guest_ia32_perf_global_ctrl)))
@@ -4198,9 +4194,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
        if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &ignored))
                nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);

-       if (!enable_ept)
-               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
-
        nested_vmx_transition_tlb_flush(vcpu, vmcs12, false);

        vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1204e5f0fe67..0e5ee22eea77 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3081,6 +3081,13 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
        vmx->emulation_required = emulation_required(vcpu);
 }

+static void vmx_mmu_create(struct kvm_vcpu *vcpu)
+{
+       vcpu->arch.root_mmu.inject_page_fault = vmx_inject_page_fault;
+       vcpu->arch.guest_mmu.inject_page_fault = nested_ept_inject_page_fault;
+       vcpu->arch.nested_mmu.inject_page_fault = vmx_inject_page_fault;
+}
+
 static int vmx_get_max_tdp_level(void)
 {
        if (cpu_has_vmx_ept_5levels())
@@ -7721,6 +7728,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {

        .write_l1_tsc_offset = vmx_write_l1_tsc_offset,

+       .mmu_create = vmx_mmu_create,
        .load_mmu_pgd = vmx_load_mmu_pgd,

        .check_intercept = vmx_check_intercept,

> -
>  	nested_vmx_transition_tlb_flush(vcpu, vmcs12, false);
>  
>  	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
> diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
> index 197148d76b8f..2ab279744d38 100644
> --- a/arch/x86/kvm/vmx/nested.h
> +++ b/arch/x86/kvm/vmx/nested.h
> @@ -36,6 +36,7 @@ void nested_vmx_pmu_entry_exit_ctls_update(struct kvm_vcpu *vcpu);
>  void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu);
>  bool nested_vmx_check_io_bitmaps(struct kvm_vcpu *vcpu, unsigned int port,
>  				 int size);
> +void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,struct x86_exception *fault);
>  
>  static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index bf6ef674d688..c43324df4877 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3254,7 +3254,10 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long pgd,
>  
>  static void vmx_complete_mmu_init(struct kvm_vcpu *vcpu)
>  {
> -
> +	if (!enable_ept && is_guest_mode(vcpu)) {
> +		WARN_ON(mmu_is_nested(vcpu));
> +		vcpu->arch.mmu->inject_page_fault = vmx_inject_page_fault_nested;
> +	}
>  }
>  
>  static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> -- 
> 2.26.2
> 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 17:29   ` Sean Christopherson
@ 2021-02-17 17:37     ` Paolo Bonzini
  2021-02-17 17:57       ` Sean Christopherson
  2021-02-17 18:49       ` Maxim Levitsky
  2021-02-17 18:43     ` Maxim Levitsky
  1 sibling, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-02-17 17:37 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Joerg Roedel,
	Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On 17/02/21 18:29, Sean Christopherson wrote:
> All that being said, I'm pretty we can eliminate setting 
> inject_page_fault dynamically. I think that would yield more 
> maintainable code. Following these flows is a nightmare. The change 
> itself will be scarier, but I'm pretty sure the end result will be a lot 
> cleaner.

I had a similar reaction, though my proposal was different.

The only thing we're changing in complete_mmu_init is the page fault 
callback for init_kvm_softmmu, so couldn't that be the callback directly 
(i.e. something like context->inject_page_fault = 
kvm_x86_ops.inject_softmmu_page_fault)?  And then adding is_guest_mode 
to the conditional that is already in vmx_inject_page_fault_nested and 
svm_inject_page_fault_nested.

That said, I'm also rusty on _why_ this code is needed.  Why isn't it 
enough to inject the exception normally, and let 
nested_vmx_check_exception decide whether to inject a vmexit to L1 or an 
exception into L2?

Also, bonus question which should have been in the 5/7 changelog: are 
there kvm-unit-tests testcases that fail with npt=0, and if not could we 
write one?  [Answer: the mode_switch testcase fails, but I haven't 
checked why].


Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set
  2021-02-17 14:57 ` [PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set Maxim Levitsky
@ 2021-02-17 17:52   ` Sean Christopherson
  2021-02-17 18:06     ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-02-17 17:52 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, Feb 17, 2021, Maxim Levitsky wrote:
> Just like all other nested memory accesses, after a migration loading
> PDPTRs should be delayed to first VM entry to ensure
> that guest memory is fully initialized.
> 
> Just move the call to nested_vmx_load_cr3 to nested_get_vmcs12_pages
> to implement this.

I don't love this approach.  KVM_SET_NESTED_STATE will now succeed with a bad
vmcs12.GUEST_CR3.  At a minimum, GUEST_CR3 should be checked in
nested_vmx_check_guest_state().  It also feels like vcpu->arch.cr3 should be set
immediately, e.g. KVM_SET_NESTED_STATE -> KVM_GET_SREGS should reflect L2's CR3
even if KVM_RUN hasn't been invoked.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index f9de729dbea6..26084f8eee82 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2596,11 +2596,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  		return -EINVAL;
>  	}
>  
> -	/* Shadow page tables on either EPT or shadow page tables. */
> -	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> -				entry_failure_code))
> -		return -EINVAL;
> -
>  	/*
>  	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
>  	 * on nested VM-Exit, which can occur without actually running L2 and
> @@ -3138,11 +3133,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>  static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>  {
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	enum vm_entry_failure_code entry_failure_code;
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct kvm_host_map *map;
>  	struct page *page;
>  	u64 hpa;
>  
> +	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> +				&entry_failure_code))
> +		return false;
> +
>  	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
>  		/*
>  		 * Translate L1 physical address to host physical
> @@ -3386,6 +3386,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  	}
>  
>  	if (from_vmentry) {
> +		if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
> +		    nested_cpu_has_ept(vmcs12), &entry_failure_code))
> +			goto vmentry_fail_vmexit_guest_mode;

This neglects to set both exit_reason.basic and vmcs12->exit_qualification.

> +
>  		failed_index = nested_vmx_load_msr(vcpu,
>  						   vmcs12->vm_entry_msr_load_addr,
>  						   vmcs12->vm_entry_msr_load_count);
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 17:37     ` Paolo Bonzini
@ 2021-02-17 17:57       ` Sean Christopherson
  2021-02-17 18:00         ` Paolo Bonzini
  2021-02-17 18:49       ` Maxim Levitsky
  1 sibling, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-02-17 17:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Maxim Levitsky, kvm, linux-kernel, Wanpeng Li, Borislav Petkov,
	Joerg Roedel, Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, Feb 17, 2021, Paolo Bonzini wrote:
> On 17/02/21 18:29, Sean Christopherson wrote:
> > All that being said, I'm pretty we can eliminate setting
> > inject_page_fault dynamically. I think that would yield more
> > maintainable code. Following these flows is a nightmare. The change
> > itself will be scarier, but I'm pretty sure the end result will be a lot
> > cleaner.
> 
> I had a similar reaction, though my proposal was different.
> 
> The only thing we're changing in complete_mmu_init is the page fault
> callback for init_kvm_softmmu, so couldn't that be the callback directly
> (i.e. something like context->inject_page_fault =
> kvm_x86_ops.inject_softmmu_page_fault)?  And then adding is_guest_mode to
> the conditional that is already in vmx_inject_page_fault_nested and
> svm_inject_page_fault_nested.

Heh, that exact code crossed my mind as well.

> That said, I'm also rusty on _why_ this code is needed.  Why isn't it enough
> to inject the exception normally, and let nested_vmx_check_exception decide
> whether to inject a vmexit to L1 or an exception into L2?

Hmm, I suspect it was required at one point due to deficiencies elsewhere.
Handling this in the common fault handler logic does seem like the right
approach.

> Also, bonus question which should have been in the 5/7 changelog: are there
> kvm-unit-tests testcases that fail with npt=0, and if not could we write
> one?  [Answer: the mode_switch testcase fails, but I haven't checked why].
> 
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 17:57       ` Sean Christopherson
@ 2021-02-17 18:00         ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-02-17 18:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, kvm, linux-kernel, Wanpeng Li, Borislav Petkov,
	Joerg Roedel, Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On 17/02/21 18:57, Sean Christopherson wrote:
>> That said, I'm also rusty on_why_  this code is needed.  Why isn't it enough
>> to inject the exception normally, and let nested_vmx_check_exception decide
>> whether to inject a vmexit to L1 or an exception into L2?
>
> Hmm, I suspect it was required at one point due to deficiencies elsewhere.
> Handling this in the common fault handler logic does seem like the right
> approach.

I think I'm going to merge a variant of patch 5 just to unbreak things. 
But we should get rid of all this because after the exception payload 
changes we shouldn't need it.

Paolo

>> Also, bonus question which should have been in the 5/7 changelog: are there
>> kvm-unit-tests testcases that fail with npt=0, and if not could we write
>> one?  [Answer: the mode_switch testcase fails, but I haven't checked why].


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set
  2021-02-17 17:52   ` Sean Christopherson
@ 2021-02-17 18:06     ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-02-17 18:06 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Joerg Roedel,
	Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On 17/02/21 18:52, Sean Christopherson wrote:
>>
>> Just move the call to nested_vmx_load_cr3 to nested_get_vmcs12_pages
>> to implement this.
>
> I don't love this approach.  KVM_SET_NESTED_STATE will now succeed with a bad
> vmcs12.GUEST_CR3.  At a minimum, GUEST_CR3 should be checked in
> nested_vmx_check_guest_state().  It also feels like vcpu->arch.cr3 should be set
> immediately, e.g. KVM_SET_NESTED_STATE -> KVM_GET_SREGS should reflect L2's CR3
> even if KVM_RUN hasn't been invoked.

Note that KVM_SET_NESTED_STATE does not remove the need to invoke 
KVM_SET_SREGS.  Calling KVM_SET_NESTED_STATE does not necessarily saying 
anything about the value of KVM_GET_SREGS after it.

In particular on SVM it's a "feature" that KVM_SET_NESTED_STATE does not 
include any guest register state; the nested state only includes the 
VMCB12 control state and the L1 save state.  But thinking more about it, 
loading the PDPTRs for the guest CR3 might not be advisable even upon 
KVM_SET_SREGS, and we might want to extend KVM_REQ_GET_NESTED_PAGES to 
cover non-nested PDPTRs as well.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 17:29   ` Sean Christopherson
  2021-02-17 17:37     ` Paolo Bonzini
@ 2021-02-17 18:43     ` Maxim Levitsky
  2021-02-18  9:45       ` Paolo Bonzini
  1 sibling, 1 reply; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 18:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Paolo Bonzini,
	Joerg Roedel, Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, 2021-02-17 at 09:29 -0800, Sean Christopherson wrote:
> On Wed, Feb 17, 2021, Maxim Levitsky wrote:
> > This fixes a (mostly theoretical) bug which can happen if ept=0
> > on host and we run a nested guest which triggers a mmu context
> > reset while running nested.
> > In this case the .inject_page_fault callback will be lost.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/nested.c | 8 +-------
> >  arch/x86/kvm/vmx/nested.h | 1 +
> >  arch/x86/kvm/vmx/vmx.c    | 5 ++++-
> >  3 files changed, 6 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 0b6dab6915a3..f9de729dbea6 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -419,7 +419,7 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit
> >  }
> >  
> >  
> > -static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
> > +void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
> >  		struct x86_exception *fault)
> >  {
> >  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > @@ -2620,9 +2620,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >  		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
> >  	}
> >  
> > -	if (!enable_ept)
> > -		vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
> > -
> >  	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
> >  	    WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
> >  				     vmcs12->guest_ia32_perf_global_ctrl)))
> > @@ -4224,9 +4221,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >  	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &ignored))
> >  		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> >  
> > -	if (!enable_ept)
> > -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> 
> Oof, please explicitly call out these types of side effects in the changelog,
> it took me a while to piece together that this can be dropped because a MMU
> reset is guaranteed and is also guaranteed to restore inject_page_fault.
> 
> I would even go so far as to say this particular line of code should be removed
> in a separate commit.  Unless I'm overlooking something, this code is
> effectively a nop, which means it doesn't need to be removed to make the bug fix
> functionally correct.
> 
> All that being said, I'm pretty we can eliminate setting inject_page_fault
> dynamically.  I think that would yield more maintainable code.  Following these
> flows is a nightmare.  The change itself will be scarier, but I'm pretty sure
> the end result will be a lot cleaner.
> 
> And I believe there's also a second bug that would be fixed by such an approach.
> Doesn't vmx_inject_page_fault_nested() need to be used for the nested_mmu when
> ept=1?  E.g. if the emulator injects a #PF to L2, L1 should still be able to
> intercept the #PF even if L1 is using EPT.  This likely hasn't been noticed
> because hypervisors typically don't intercept #PF when EPT is enabled.

Let me explain what I know about this:
 
There are basically 3 cases:
 
1. npt/ept disabled in the host. In this case we have a single shadowing
and a nested hypervisor has to do its own shadowing on top of it.
In this case the MMU itself has to generate page faults (they are a result
of hardware page faults, but are completely different), and in case
of nesting these page faults have to be sometimes injected as VM exits.
 
2. npt/ept enabled on host and disabled in guest.
In this case we don't need to shadow anything, while the nested hypervisor
does need to do shadowing to run its guest.
In this case in fact it is likely that L1 intercepts the page faults,
however they are just reflected as is to it, like what nested_svm_exit_special
does (it does have a special case for async page fault which I need to investigate.

This is where the bug that you mention can happen. I haven’t checked how VMX
reflects the page faults to the nested guest also.
 
3. (the common case) npt/ept are enabled on both host and guest.
In this case walk_mmu is used for all the page faults which is actually
tweaked in the similar way (see nested_svm_init_mmu_context for example)


Also if the emulator injects the page fault, then indeed I think the
bug will happen.


Best regards and thanks for the feedback,
	Maxim Levitsky

> 
> Something like this (very incomplete):
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 30e9b0cb9abd..f957514a4d65 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4497,7 +4497,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
>         context->direct_map = true;
>         context->get_guest_pgd = get_cr3;
>         context->get_pdptr = kvm_pdptr_read;
> -       context->inject_page_fault = kvm_inject_page_fault;
> 
>         if (!is_paging(vcpu)) {
>                 context->nx = false;
> @@ -4687,7 +4686,6 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu)
> 
>         context->get_guest_pgd     = get_cr3;
>         context->get_pdptr         = kvm_pdptr_read;
> -       context->inject_page_fault = kvm_inject_page_fault;
>  }
> 
>  static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
> @@ -4701,7 +4699,6 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
>         g_context->mmu_role.as_u64 = new_role.as_u64;
>         g_context->get_guest_pgd     = get_cr3;
>         g_context->get_pdptr         = kvm_pdptr_read;
> -       g_context->inject_page_fault = kvm_inject_page_fault;
> 
>         /*
>          * L2 page tables are never shadowed, so there is no need to sync
> @@ -5272,6 +5269,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>         if (ret)
>                 goto fail_allocate_root;
> 
> +       static_call(kvm_x86_mmu_create)(vcpu);
> +
>         return ret;
>   fail_allocate_root:
>         free_mmu_pages(&vcpu->arch.guest_mmu);
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a63da447ede9..aa6c48295117 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -425,15 +425,14 @@ static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit
>  }
> 
> 
> -static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
> +static void vmx_inject_page_fault(struct kvm_vcpu *vcpu,
>                 struct x86_exception *fault)
>  {
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> 
> -       WARN_ON(!is_guest_mode(vcpu));
> -
> -       if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
> -               !to_vmx(vcpu)->nested.nested_run_pending) {
> +       if (guest_mode(vcpu) &&
> +           nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
> +           !to_vmx(vcpu)->nested.nested_run_pending) {
>                 vmcs12->vm_exit_intr_error_code = fault->error_code;
>                 nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
>                                   PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
> @@ -2594,9 +2593,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>                 vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
>         }
> 
> -       if (!enable_ept)
> -               vcpu->arch.walk_mmu->inject_page_fault = vmx_inject_page_fault_nested;
> -
>         if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
>             WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
>                                      vmcs12->guest_ia32_perf_global_ctrl)))
> @@ -4198,9 +4194,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>         if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &ignored))
>                 nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> 
> -       if (!enable_ept)
> -               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> -
>         nested_vmx_transition_tlb_flush(vcpu, vmcs12, false);
> 
>         vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 1204e5f0fe67..0e5ee22eea77 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3081,6 +3081,13 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>         vmx->emulation_required = emulation_required(vcpu);
>  }
> 
> +static void vmx_mmu_create(struct kvm_vcpu *vcpu)
> +{
> +       vcpu->arch.root_mmu.inject_page_fault = vmx_inject_page_fault;
> +       vcpu->arch.guest_mmu.inject_page_fault = nested_ept_inject_page_fault;
> +       vcpu->arch.nested_mmu.inject_page_fault = vmx_inject_page_fault;
> +}
> +
>  static int vmx_get_max_tdp_level(void)
>  {
>         if (cpu_has_vmx_ept_5levels())
> @@ -7721,6 +7728,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
> 
>         .write_l1_tsc_offset = vmx_write_l1_tsc_offset,
> 
> +       .mmu_create = vmx_mmu_create,
>         .load_mmu_pgd = vmx_load_mmu_pgd,
> 
>         .check_intercept = vmx_check_intercept,
> 
> > -
> >  	nested_vmx_transition_tlb_flush(vcpu, vmcs12, false);
> >  
> >  	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
> > diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
> > index 197148d76b8f..2ab279744d38 100644
> > --- a/arch/x86/kvm/vmx/nested.h
> > +++ b/arch/x86/kvm/vmx/nested.h
> > @@ -36,6 +36,7 @@ void nested_vmx_pmu_entry_exit_ctls_update(struct kvm_vcpu *vcpu);
> >  void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu);
> >  bool nested_vmx_check_io_bitmaps(struct kvm_vcpu *vcpu, unsigned int port,
> >  				 int size);
> > +void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,struct x86_exception *fault);
> >  
> >  static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
> >  {
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index bf6ef674d688..c43324df4877 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -3254,7 +3254,10 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long pgd,
> >  
> >  static void vmx_complete_mmu_init(struct kvm_vcpu *vcpu)
> >  {
> > -
> > +	if (!enable_ept && is_guest_mode(vcpu)) {
> > +		WARN_ON(mmu_is_nested(vcpu));
> > +		vcpu->arch.mmu->inject_page_fault = vmx_inject_page_fault_nested;
> > +	}
> >  }
> >  
> >  static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> > -- 
> > 2.26.2
> > 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 17:37     ` Paolo Bonzini
  2021-02-17 17:57       ` Sean Christopherson
@ 2021-02-17 18:49       ` Maxim Levitsky
  1 sibling, 0 replies; 22+ messages in thread
From: Maxim Levitsky @ 2021-02-17 18:49 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Joerg Roedel,
	Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On Wed, 2021-02-17 at 18:37 +0100, Paolo Bonzini wrote:
> On 17/02/21 18:29, Sean Christopherson wrote:
> > All that being said, I'm pretty we can eliminate setting 
> > inject_page_fault dynamically. I think that would yield more 
> > maintainable code. Following these flows is a nightmare. The change 
> > itself will be scarier, but I'm pretty sure the end result will be a lot 
> > cleaner.

I agree with that.

> 
> I had a similar reaction, though my proposal was different.
> 
> The only thing we're changing in complete_mmu_init is the page fault 
> callback for init_kvm_softmmu, so couldn't that be the callback directly 
> (i.e. something like context->inject_page_fault = 
> kvm_x86_ops.inject_softmmu_page_fault)?  And then adding is_guest_mode 
> to the conditional that is already in vmx_inject_page_fault_nested and 
> svm_inject_page_fault_nested.

I was thinking about this a well, I tried to make an as simple as possible
solution that doesn't make things worse.
> 
> That said, I'm also rusty on _why_ this code is needed.  Why isn't it 
> enough to inject the exception normally, and let 
> nested_vmx_check_exception decide whether to inject a vmexit to L1 or an 
> exception into L2?
> 
> Also, bonus question which should have been in the 5/7 changelog: are 
> there kvm-unit-tests testcases that fail with npt=0, and if not could we 
> write one?  [Answer: the mode_switch testcase fails, but I haven't 
> checked why].

I agree with all of this. I'll see why this code is needed (it is needed,
since I once removed it accidentaly on VMX, and it broke nesting with ept=0,
in exact the same way as it was broken on AMD).

I''l debug this a bit to see if I can make it work as you suggest.


Best regards,
	Maxim Levitsky
> 
> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  2021-02-17 18:43     ` Maxim Levitsky
@ 2021-02-18  9:45       ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2021-02-18  9:45 UTC (permalink / raw)
  To: Maxim Levitsky, Sean Christopherson
  Cc: kvm, linux-kernel, Wanpeng Li, Borislav Petkov, Joerg Roedel,
	Jim Mattson, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Vitaly Kuznetsov, Ingo Molnar

On 17/02/21 19:43, Maxim Levitsky wrote:
> 1. npt/ept disabled in the host. In this case we have a single shadowing
> and a nested hypervisor has to do its own shadowing on top of it.
> In this case the MMU itself has to generate page faults (they are a result
> of hardware page faults, but are completely different), and in case
> of nesting these page faults have to be sometimes injected as VM exits.
> 
> [...] Also if the emulator injects the page fault, then indeed I think the
> bug will happen.

But in both cases you (ought to) get an injected exception which then 
becomes a page fault vmexit at next check_nested_events.  That's the 
part that we are all collectively missing.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-02-18 11:45 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-17 14:57 [PATCH 0/7] KVM: random nested fixes Maxim Levitsky
2021-02-17 14:57 ` [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier Maxim Levitsky
2021-02-17 16:06   ` Paolo Bonzini
2021-02-17 16:18     ` Maxim Levitsky
2021-02-17 16:21       ` Sean Christopherson
2021-02-17 16:29         ` Paolo Bonzini
2021-02-17 14:57 ` [PATCH 2/7] KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode Maxim Levitsky
2021-02-17 14:57 ` [PATCH 3/7] KVM: x86: add .complete_mmu_init arch callback Maxim Levitsky
2021-02-17 14:57 ` [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init Maxim Levitsky
2021-02-17 17:29   ` Sean Christopherson
2021-02-17 17:37     ` Paolo Bonzini
2021-02-17 17:57       ` Sean Christopherson
2021-02-17 18:00         ` Paolo Bonzini
2021-02-17 18:49       ` Maxim Levitsky
2021-02-17 18:43     ` Maxim Levitsky
2021-02-18  9:45       ` Paolo Bonzini
2021-02-17 14:57 ` [PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0 Maxim Levitsky
2021-02-17 15:27   ` Maxim Levitsky
2021-02-17 14:57 ` [PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set Maxim Levitsky
2021-02-17 17:52   ` Sean Christopherson
2021-02-17 18:06     ` Paolo Bonzini
2021-02-17 14:57 ` [PATCH 7/7] KVM: nSVM: call nested_svm_load_cr3 on nested state load Maxim Levitsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.