linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection
@ 2022-04-02  1:08 Sean Christopherson
  2022-04-02  1:08 ` [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Sean Christopherson
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

This is a continuation/alternative of Maciej's series[*] to fix soft
interrupt/exception reinjection.  The core difference is that this version
fixes the underlying issue of not doing proper reinjection, which
manifests most readily as the nested virtualization bugs Maciej's series
addressed.

The underlying issue is that SVM simply retries INT* instructions instead
of reinjecting the soft interupt/exception if an exception VM-Exit occurred
during vectoring.  Lack of reinjection breaks nested virtualization if
the injected event came from L1 and the VM-Exit is not forwarded to L1,
as there is no instruction to retry.  More fundamentally, retrying the
instruction is wrong as it can produce side effects that shouldn't occur,
e.g. code #DBs.

VMX has been fixed since commit 66fd3f7f901f ("KVM: Do not re-execute
INTn instruction."), but SVM was left behind.  Probably because fixing
SVM is a mess due to NRIPS not being supported on all architectures, and
due to it being poorly implemented (with respect to soft events) when it
is supported.

[*] https://lore.kernel.org/all/cover.1646944472.git.maciej.szmigiero@oracle.com

Maciej S. Szmigiero (3):
  KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq()
  KVM: selftests: nSVM: Add svm_nested_soft_inject_test

Sean Christopherson (5):
  KVM: SVM: Unwind "speculative" RIP advancement if INTn injection
    "fails"
  KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is
    supported
  KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"
  KVM: x86: Trace re-injected exceptions

 arch/x86/kvm/svm/nested.c                     |  22 ++-
 arch/x86/kvm/svm/svm.c                        | 135 ++++++++++++----
 arch/x86/kvm/svm/svm.h                        |   5 +-
 arch/x86/kvm/x86.c                            |   8 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/svm_util.h   |   2 +
 .../kvm/x86_64/svm_nested_soft_inject_test.c  | 147 ++++++++++++++++++
 8 files changed, 279 insertions(+), 42 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c


base-commit: 81d50efcff6cf4310aaf6a19806416ffeccf1cdb
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
@ 2022-04-02  1:08 ` Sean Christopherson
  2022-04-04  9:54   ` Maxim Levitsky
  2022-04-04 16:50   ` Maciej S. Szmigiero
  2022-04-02  1:08 ` [PATCH 2/8] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq() Sean Christopherson
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

From: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception.

Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
saves RIP on the stack as-is).

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
 arch/x86/kvm/svm/svm.h    |  1 +
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 73b545278f5f..9a6dc2b38fcf 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -369,6 +369,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
 	to->nested_ctl          = from->nested_ctl;
 	to->event_inj           = from->event_inj;
 	to->event_inj_err       = from->event_inj_err;
+	to->next_rip            = from->next_rip;
 	to->nested_cr3          = from->nested_cr3;
 	to->virt_ext            = from->virt_ext;
 	to->pause_filter_count  = from->pause_filter_count;
@@ -606,7 +607,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
 	}
 }
 
-static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
+static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
+					  unsigned long vmcb12_rip)
 {
 	u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
 	u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
@@ -660,6 +662,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
 	vmcb02->control.event_inj           = svm->nested.ctl.event_inj;
 	vmcb02->control.event_inj_err       = svm->nested.ctl.event_inj_err;
 
+	/*
+	 * next_rip is consumed on VMRUN as the return address pushed on the
+	 * stack for injected soft exceptions/interrupts.  If nrips is exposed
+	 * to L1, take it verbatim from vmcb12.  If nrips is supported in
+	 * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
+	 * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
+	 * prior to injecting the event).
+	 */
+	if (svm->nrips_enabled)
+		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
+	else if (boot_cpu_has(X86_FEATURE_NRIPS))
+		vmcb02->control.next_rip    = vmcb12_rip;
+
 	vmcb02->control.virt_ext            = vmcb01->control.virt_ext &
 					      LBR_CTL_ENABLE_MASK;
 	if (svm->lbrv_enabled)
@@ -743,7 +758,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
 
 	svm_switch_vmcb(svm, &svm->nested.vmcb02);
-	nested_vmcb02_prepare_control(svm);
+	nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
 	nested_vmcb02_prepare_save(svm, vmcb12);
 
 	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
@@ -1422,6 +1437,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
 	dst->nested_ctl           = from->nested_ctl;
 	dst->event_inj            = from->event_inj;
 	dst->event_inj_err        = from->event_inj_err;
+	dst->next_rip             = from->next_rip;
 	dst->nested_cr3           = from->nested_cr3;
 	dst->virt_ext              = from->virt_ext;
 	dst->pause_filter_count   = from->pause_filter_count;
@@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	nested_copy_vmcb_control_to_cache(svm, ctl);
 
 	svm_switch_vmcb(svm, &svm->nested.vmcb02);
-	nested_vmcb02_prepare_control(svm);
+	nested_vmcb02_prepare_control(svm, save->rip);
 
 	/*
 	 * While the nested guest CR3 is already checked and set by
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e246793cbeae..47e7427d0395 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -139,6 +139,7 @@ struct vmcb_ctrl_area_cached {
 	u64 nested_ctl;
 	u32 event_inj;
 	u32 event_inj_err;
+	u64 next_rip;
 	u64 nested_cr3;
 	u64 virt_ext;
 	u32 clean;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/8] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq()
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
  2022-04-02  1:08 ` [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Sean Christopherson
@ 2022-04-02  1:08 ` Sean Christopherson
  2022-04-02  1:08 ` [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" Sean Christopherson
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

From: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

There is no need to bring down the whole host just because there might be
some issue with respect to guest GIF handling in KVM.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 95b26dbfd561..2c86bd9176c6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3383,7 +3383,7 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	BUG_ON(!(gif_set(svm)));
+	WARN_ON(!gif_set(svm));
 
 	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
 	++vcpu->stat.irq_injections;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails"
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
  2022-04-02  1:08 ` [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Sean Christopherson
  2022-04-02  1:08 ` [PATCH 2/8] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq() Sean Christopherson
@ 2022-04-02  1:08 ` Sean Christopherson
  2022-04-04 10:03   ` Maxim Levitsky
  2022-04-20 15:01   ` Paolo Bonzini
  2022-04-02  1:08 ` [PATCH 4/8] KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is supported Sean Christopherson
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

Unwind the RIP advancement done by svm_queue_exception() when injecting
an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
vectoring the injected event, even if the exception reported by the CPU
isn't the same event that was injected.  If vectoring INT3 encounters an
exception, e.g. #NP, and vectoring the #NP encounters an intercepted
exception, e.g. #PF when KVM is using shadow paging, then the #NP will
be reported as the event that was in-progress.

Note, this is still imperfect, as it will get a false positive if the
INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
handler in the guest, the instruction following the INT3 generates an
exception (directly or indirectly), _and_ vectoring that exception
encounters an exception that is intercepted by KVM.  The false positives
could theoretically be solved by further analyzing the vectoring event,
e.g. by comparing the error code against the expected error code were an
exception to occur when vectoring the original injected exception, but
SVM without NRIPS is a complete disaster, trying to make it 100% correct
is a waste of time.

Fixes: 66b7138f9136 ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2c86bd9176c6..30cef3b10838 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3699,6 +3699,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
 	type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
 
+	/*
+	 * If NextRIP isn't enabled, KVM must manually advance RIP prior to
+	 * injecting the soft exception/interrupt.  That advancement needs to
+	 * be unwound if vectoring didn't complete.  Note, the _new_ event may
+	 * not be the injected event, e.g. if KVM injected an INTn, the INTn
+	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
+	 * be the reported vectored event, but RIP still needs to be unwound.
+	 */
+	if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
+	   kvm_is_linear_rip(vcpu, svm->int3_rip))
+		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
+
 	switch (type) {
 	case SVM_EXITINTINFO_TYPE_NMI:
 		vcpu->arch.nmi_injected = true;
@@ -3715,13 +3727,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 		 * but re-execute the instruction instead. Rewind RIP first
 		 * if we emulated INT3 before.
 		 */
-		if (kvm_exception_is_soft(vector)) {
-			if (vector == BP_VECTOR && int3_injected &&
-			    kvm_is_linear_rip(vcpu, svm->int3_rip))
-				kvm_rip_write(vcpu,
-					      kvm_rip_read(vcpu) - int3_injected);
+		if (kvm_exception_is_soft(vector))
 			break;
-		}
+
 		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
 			u32 err = svm->vmcb->control.exit_int_info_err;
 			kvm_requeue_exception_e(vcpu, vector, err);
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/8] KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is supported
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
                   ` (2 preceding siblings ...)
  2022-04-02  1:08 ` [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" Sean Christopherson
@ 2022-04-02  1:08 ` Sean Christopherson
  2022-04-04 12:00   ` Maxim Levitsky
  2022-04-02  1:09 ` [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction Sean Christopherson
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

If NRIPS is supported in hardware but disabled in KVM, set next_rip to
the next RIP when advancing RIP as part of emulating INT3 injection.
There is no flag to tell the CPU that KVM isn't using next_rip, and so
leaving next_rip is left as is will result in the CPU pushing garbage
onto the stack when vectoring the injected event.

Fixes: 66b7138f9136 ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 30cef3b10838..6ea8f16e39ac 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -391,6 +391,10 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
 		 */
 		(void)svm_skip_emulated_instruction(vcpu);
 		rip = kvm_rip_read(vcpu);
+
+		if (boot_cpu_has(X86_FEATURE_NRIPS))
+			svm->vmcb->control.next_rip = rip;
+
 		svm->int3_rip = rip + svm->vmcb->save.cs.base;
 		svm->int3_injected = rip - old_rip;
 	}
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
                   ` (3 preceding siblings ...)
  2022-04-02  1:08 ` [PATCH 4/8] KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is supported Sean Christopherson
@ 2022-04-02  1:09 ` Sean Christopherson
  2022-04-04 12:12   ` Maxim Levitsky
  2022-04-02  1:09 ` [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" Sean Christopherson
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

Re-inject INT3/INTO instead of retrying the instruction if the CPU
encountered an intercepted exception while vectoring the software
exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
shadow paging.  Retrying the instruction is architecturally wrong, e.g.
will result in a spurious #DB if there's a code breakpoint on the INT3/O,
and lack of re-injection also breaks nested virtualization, e.g. if L1
injects a software exception and vectoring the injected exception
encounters an exception that is intercepted by L0 but not L1.

Due to, ahem, deficiencies in the SVM architecture, acquiring the next
RIP may require flowing through the emulator even if NRIPS is supported,
as the CPU clears next_rip if the VM-Exit is due to an exception other
than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
deal with this, "skip" the instruction to calculate next_ript, and then
unwind the RIP write and any side effects (RFLAGS updates).

Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 111 ++++++++++++++++++++++++++++-------------
 arch/x86/kvm/svm/svm.h |   4 +-
 2 files changed, 79 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6ea8f16e39ac..ecc828d6921e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -341,9 +341,11 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
 
 }
 
-static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
+					   bool commit_side_effects)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	unsigned long old_rflags;
 
 	/*
 	 * SEV-ES does not expose the next RIP. The RIP update is controlled by
@@ -358,18 +360,71 @@ static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 	}
 
 	if (!svm->next_rip) {
+		if (unlikely(!commit_side_effects))
+			old_rflags = svm->vmcb->save.rflags;
+
 		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
 			return 0;
+
+		if (unlikely(!commit_side_effects))
+			svm->vmcb->save.rflags = old_rflags;
 	} else {
 		kvm_rip_write(vcpu, svm->next_rip);
 	}
 
 done:
-	svm_set_interrupt_shadow(vcpu, 0);
+	if (likely(commit_side_effects))
+		svm_set_interrupt_shadow(vcpu, 0);
 
 	return 1;
 }
 
+static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+{
+	return __svm_skip_emulated_instruction(vcpu, true);
+}
+
+static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
+{
+	unsigned long rip, old_rip = kvm_rip_read(vcpu);
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	/*
+	 * Due to architectural shortcomings, the CPU doesn't always provide
+	 * NextRIP, e.g. if KVM intercepted an exception that occurred while
+	 * the CPU was vectoring an INTO/INT3 in the guest.  Temporarily skip
+	 * the instruction even if NextRIP is supported to acquire the next
+	 * RIP so that it can be shoved into the NextRIP field, otherwise
+	 * hardware will fail to advance guest RIP during event injection.
+	 * Drop the exception/interrupt if emulation fails and effectively
+	 * retry the instruction, it's the least awful option.  If NRIPS is
+	 * in use, the skip must not commit any side effects such as clearing
+	 * the interrupt shadow or RFLAGS.RF.
+	 */
+	if (!__svm_skip_emulated_instruction(vcpu, !nrips))
+		return -EIO;
+
+	rip = kvm_rip_read(vcpu);
+
+	/*
+	 * If NextRIP is supported, rewind RIP and update NextRip.  If NextRip
+	 * isn't supported, keep the result of the skip as the CPU obviously
+	 * won't advance RIP, but stash away the injection information so that
+	 * RIP can be unwound if injection fails.
+	 */
+	if (nrips) {
+		kvm_rip_write(vcpu, old_rip);
+		svm->vmcb->control.next_rip = rip;
+	} else {
+		if (boot_cpu_has(X86_FEATURE_NRIPS))
+			svm->vmcb->control.next_rip = rip;
+
+		svm->soft_int_linear_rip = rip + svm->vmcb->save.cs.base;
+		svm->soft_int_injected = rip - old_rip;
+	}
+	return 0;
+}
+
 static void svm_queue_exception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -379,25 +434,9 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
 
 	kvm_deliver_exception_payload(vcpu);
 
-	if (nr == BP_VECTOR && !nrips) {
-		unsigned long rip, old_rip = kvm_rip_read(vcpu);
-
-		/*
-		 * For guest debugging where we have to reinject #BP if some
-		 * INT3 is guest-owned:
-		 * Emulate nRIP by moving RIP forward. Will fail if injection
-		 * raises a fault that is not intercepted. Still better than
-		 * failing in all cases.
-		 */
-		(void)svm_skip_emulated_instruction(vcpu);
-		rip = kvm_rip_read(vcpu);
-
-		if (boot_cpu_has(X86_FEATURE_NRIPS))
-			svm->vmcb->control.next_rip = rip;
-
-		svm->int3_rip = rip + svm->vmcb->save.cs.base;
-		svm->int3_injected = rip - old_rip;
-	}
+	if (kvm_exception_is_soft(nr) &&
+	    svm_update_soft_interrupt_rip(vcpu))
+		return;
 
 	svm->vmcb->control.event_inj = nr
 		| SVM_EVTINJ_VALID
@@ -3676,9 +3715,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	u8 vector;
 	int type;
 	u32 exitintinfo = svm->vmcb->control.exit_int_info;
-	unsigned int3_injected = svm->int3_injected;
+	unsigned soft_int_injected = svm->soft_int_injected;
 
-	svm->int3_injected = 0;
+	svm->soft_int_injected = 0;
 
 	/*
 	 * If we've made progress since setting HF_IRET_MASK, we've
@@ -3698,6 +3737,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
 		return;
 
+	/*
+	 * If NextRIP isn't enabled, KVM must manually advance RIP prior to
+	 * injecting the soft exception/interrupt.  That advancement needs to
+	 * be unwound if vectoring didn't complete.  Note, the _new_ event may
+	 * not be the injected event, e.g. if KVM injected an INTn, the INTn
+	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
+	 * be the reported vectored event, but RIP still needs to be unwound.
+	 */
+	if (soft_int_injected &&
+	    kvm_is_linear_rip(vcpu, to_svm(vcpu)->soft_int_linear_rip))
+		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
+
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
@@ -3711,9 +3762,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
 	 * be the reported vectored event, but RIP still needs to be unwound.
 	 */
-	if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
-	   kvm_is_linear_rip(vcpu, svm->int3_rip))
-		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
+	if (soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
+	   kvm_is_linear_rip(vcpu, svm->soft_int_linear_rip))
+		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
 
 	switch (type) {
 	case SVM_EXITINTINFO_TYPE_NMI:
@@ -3726,14 +3777,6 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 		if (vector == X86_TRAP_VC)
 			break;
 
-		/*
-		 * In case of software exceptions, do not reinject the vector,
-		 * but re-execute the instruction instead. Rewind RIP first
-		 * if we emulated INT3 before.
-		 */
-		if (kvm_exception_is_soft(vector))
-			break;
-
 		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
 			u32 err = svm->vmcb->control.exit_int_info_err;
 			kvm_requeue_exception_e(vcpu, vector, err);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 47e7427d0395..a770a1c7ddd2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -230,8 +230,8 @@ struct vcpu_svm {
 	bool nmi_singlestep;
 	u64 nmi_singlestep_guest_rflags;
 
-	unsigned int3_injected;
-	unsigned long int3_rip;
+	unsigned soft_int_injected;
+	unsigned long soft_int_linear_rip;
 
 	/* optional nested SVM features that are enabled for this guest  */
 	bool nrips_enabled                : 1;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
                   ` (4 preceding siblings ...)
  2022-04-02  1:09 ` [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction Sean Christopherson
@ 2022-04-02  1:09 ` Sean Christopherson
  2022-04-04 17:14   ` Sean Christopherson
  2022-04-04 20:27   ` Maciej S. Szmigiero
  2022-04-02  1:09 ` [PATCH 7/8] KVM: x86: Trace re-injected exceptions Sean Christopherson
  2022-04-02  1:09 ` [PATCH 8/8] KVM: selftests: nSVM: Add svm_nested_soft_inject_test Sean Christopherson
  7 siblings, 2 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

Re-inject INTn software interrupts instead of retrying the instruction if
the CPU encountered an intercepted exception while vectoring the INTn,
e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
instruction is architecturally wrong e.g. will result in a spurious #DB
if there's a code breakpoint on the INT3/O, and lack of re-injection also
breaks nested virtualization, e.g. if L1 injects a software interrupt and
vectoring the injected interrupt encounters an exception that is
intercepted by L0 but not L1.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ecc828d6921e..00b1399681d1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3425,14 +3425,24 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 static void svm_inject_irq(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 type;
 
 	WARN_ON(!gif_set(svm));
 
+	if (vcpu->arch.interrupt.soft) {
+		if (svm_update_soft_interrupt_rip(vcpu))
+			return;
+
+		type = SVM_EVTINJ_TYPE_SOFT;
+	} else {
+		type = SVM_EVTINJ_TYPE_INTR;
+	}
+
 	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
 	++vcpu->stat.irq_injections;
 
 	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
-		SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
+				       SVM_EVTINJ_VALID | type;
 }
 
 void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
@@ -3787,9 +3797,13 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	case SVM_EXITINTINFO_TYPE_INTR:
 		kvm_queue_interrupt(vcpu, vector, false);
 		break;
+	case SVM_EXITINTINFO_TYPE_SOFT:
+		kvm_queue_interrupt(vcpu, vector, true);
+		break;
 	default:
 		break;
 	}
+
 }
 
 static void svm_cancel_injection(struct kvm_vcpu *vcpu)
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 7/8] KVM: x86: Trace re-injected exceptions
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
                   ` (5 preceding siblings ...)
  2022-04-02  1:09 ` [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" Sean Christopherson
@ 2022-04-02  1:09 ` Sean Christopherson
  2022-04-04 12:14   ` Maxim Levitsky
  2022-04-02  1:09 ` [PATCH 8/8] KVM: selftests: nSVM: Add svm_nested_soft_inject_test Sean Christopherson
  7 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

Trace exceptions that are re-injected, not just those that KVM is
injecting for the first time.  Debugging re-injection bugs is painful
enough as is, not having visibility into what KVM is doing only makes
things worse.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7a066cf92692..384091600bc2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9382,6 +9382,10 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
 
 static void kvm_inject_exception(struct kvm_vcpu *vcpu)
 {
+	trace_kvm_inj_exception(vcpu->arch.exception.nr,
+				vcpu->arch.exception.has_error_code,
+				vcpu->arch.exception.error_code);
+
 	if (vcpu->arch.exception.error_code && !is_protmode(vcpu))
 		vcpu->arch.exception.error_code = false;
 	static_call(kvm_x86_queue_exception)(vcpu);
@@ -9439,10 +9443,6 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 
 	/* try to inject new event if pending */
 	if (vcpu->arch.exception.pending) {
-		trace_kvm_inj_exception(vcpu->arch.exception.nr,
-					vcpu->arch.exception.has_error_code,
-					vcpu->arch.exception.error_code);
-
 		vcpu->arch.exception.pending = false;
 		vcpu->arch.exception.injected = true;
 
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 8/8] KVM: selftests: nSVM: Add svm_nested_soft_inject_test
  2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
                   ` (6 preceding siblings ...)
  2022-04-02  1:09 ` [PATCH 7/8] KVM: x86: Trace re-injected exceptions Sean Christopherson
@ 2022-04-02  1:09 ` Sean Christopherson
  2022-04-04 12:27   ` Maxim Levitsky
  7 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-02  1:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

From: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Add a KVM self-test that checks whether a nSVM L1 is able to successfully
inject a software interrupt and a soft exception into its L2 guest.

In practice, this tests both the next_rip field consistency and
L1-injected event with intervening L0 VMEXIT during its delivery:
the first nested VMRUN (that's also trying to inject a software interrupt)
will immediately trigger a L0 NPF.
This L0 NPF will have zero in its CPU-returned next_rip field, which if
incorrectly reused by KVM will trigger a #PF when trying to return to
such address 0 from the interrupt handler.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/svm_util.h   |   2 +
 .../kvm/x86_64/svm_nested_soft_inject_test.c  | 147 ++++++++++++++++++
 4 files changed, 151 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 1f1b6c978bf7..1354d3f4a362 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -33,6 +33,7 @@
 /x86_64/state_test
 /x86_64/svm_vmcall_test
 /x86_64/svm_int_ctl_test
+/x86_64/svm_nested_soft_inject_test
 /x86_64/sync_regs_test
 /x86_64/tsc_msrs_test
 /x86_64/userspace_io_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index c9cdbd248727..cef6d583044b 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -66,6 +66,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_preemption_timer_test
 TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test
 TEST_GEN_PROGS_x86_64 += x86_64/svm_int_ctl_test
+TEST_GEN_PROGS_x86_64 += x86_64/svm_nested_soft_inject_test
 TEST_GEN_PROGS_x86_64 += x86_64/tsc_scaling_sync
 TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
 TEST_GEN_PROGS_x86_64 += x86_64/userspace_io_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/svm_util.h b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
index a25aabd8f5e7..d49f7c9b4564 100644
--- a/tools/testing/selftests/kvm/include/x86_64/svm_util.h
+++ b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
@@ -16,6 +16,8 @@
 #define CPUID_SVM_BIT		2
 #define CPUID_SVM		BIT_ULL(CPUID_SVM_BIT)
 
+#define SVM_EXIT_EXCP_BASE	0x040
+#define SVM_EXIT_HLT		0x078
 #define SVM_EXIT_MSR		0x07c
 #define SVM_EXIT_VMMCALL	0x081
 
diff --git a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
new file mode 100644
index 000000000000..d39be5d885c1
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Oracle and/or its affiliates.
+ *
+ * Based on:
+ *   svm_int_ctl_test
+ *
+ *   Copyright (C) 2021, Red Hat, Inc.
+ *
+ */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "svm_util.h"
+
+#define VCPU_ID		0
+#define INT_NR			0x20
+#define X86_FEATURE_NRIPS	BIT(3)
+
+#define vmcall()		\
+	__asm__ __volatile__(	\
+		"vmmcall\n"	\
+		)
+
+#define ud2()			\
+	__asm__ __volatile__(	\
+		"ud2\n"	\
+		)
+
+#define hlt()			\
+	__asm__ __volatile__(	\
+		"hlt\n"	\
+		)
+
+static unsigned int bp_fired;
+static void guest_bp_handler(struct ex_regs *regs)
+{
+	bp_fired++;
+}
+
+static unsigned int int_fired;
+static void guest_int_handler(struct ex_regs *regs)
+{
+	int_fired++;
+}
+
+static void l2_guest_code(void)
+{
+	GUEST_ASSERT(int_fired == 1);
+	vmcall();
+	ud2();
+
+	GUEST_ASSERT(bp_fired == 1);
+	hlt();
+}
+
+static void l1_guest_code(struct svm_test_data *svm)
+{
+	#define L2_GUEST_STACK_SIZE 64
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+	struct vmcb *vmcb = svm->vmcb;
+
+	/* Prepare for L2 execution. */
+	generic_svm_setup(svm, l2_guest_code,
+			  &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+	vmcb->control.intercept_exceptions |= BIT(PF_VECTOR) | BIT(UD_VECTOR);
+	vmcb->control.intercept |= BIT(INTERCEPT_HLT);
+
+	vmcb->control.event_inj = INT_NR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_SOFT;
+	/* The return address pushed on stack */
+	vmcb->control.next_rip = vmcb->save.rip;
+
+	run_guest(vmcb, svm->vmcb_gpa);
+	GUEST_ASSERT_3(vmcb->control.exit_code == SVM_EXIT_VMMCALL,
+		       vmcb->control.exit_code,
+		       vmcb->control.exit_info_1, vmcb->control.exit_info_2);
+
+	/* Skip over VMCALL */
+	vmcb->save.rip += 3;
+
+	vmcb->control.event_inj = BP_VECTOR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
+	/* The return address pushed on stack, skip over UD2 */
+	vmcb->control.next_rip = vmcb->save.rip + 2;
+
+	run_guest(vmcb, svm->vmcb_gpa);
+	GUEST_ASSERT_3(vmcb->control.exit_code == SVM_EXIT_HLT,
+		       vmcb->control.exit_code,
+		       vmcb->control.exit_info_1, vmcb->control.exit_info_2);
+
+	GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_cpuid_entry2 *cpuid;
+	struct kvm_vm *vm;
+	vm_vaddr_t svm_gva;
+	struct kvm_guest_debug debug;
+
+	nested_svm_check_supported();
+
+	cpuid = kvm_get_supported_cpuid_entry(0x8000000a);
+	if (!(cpuid->edx & X86_FEATURE_NRIPS)) {
+		print_skip("nRIP Save unavailable");
+		exit(KSFT_SKIP);
+	}
+
+	vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
+
+	vm_init_descriptor_tables(vm);
+	vcpu_init_descriptor_tables(vm, VCPU_ID);
+
+	vm_install_exception_handler(vm, BP_VECTOR, guest_bp_handler);
+	vm_install_exception_handler(vm, INT_NR, guest_int_handler);
+
+	vcpu_alloc_svm(vm, &svm_gva);
+	vcpu_args_set(vm, VCPU_ID, 1, svm_gva);
+
+	memset(&debug, 0, sizeof(debug));
+	vcpu_set_guest_debug(vm, VCPU_ID, &debug);
+
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+	struct ucall uc;
+
+	vcpu_run(vm, VCPU_ID);
+	TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+		    "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
+		    run->exit_reason,
+		    exit_reason_str(run->exit_reason));
+
+	switch (get_ucall(vm, VCPU_ID, &uc)) {
+	case UCALL_ABORT:
+		TEST_FAIL("%s at %s:%ld, vals = 0x%lx 0x%lx 0x%lx", (const char *)uc.args[0],
+			  __FILE__, uc.args[1], uc.args[2], uc.args[3], uc.args[4]);
+		break;
+		/* NOT REACHED */
+	case UCALL_DONE:
+		goto done;
+	default:
+		TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+	}
+done:
+	kvm_vm_free(vm);
+	return 0;
+}
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-02  1:08 ` [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Sean Christopherson
@ 2022-04-04  9:54   ` Maxim Levitsky
  2022-04-04 16:50   ` Maciej S. Szmigiero
  1 sibling, 0 replies; 45+ messages in thread
From: Maxim Levitsky @ 2022-04-04  9:54 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maciej S . Szmigiero

On Sat, 2022-04-02 at 01:08 +0000, Sean Christopherson wrote:
> From: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
> This field value (instead of the saved guest RIP) in used by the CPU for
> the return address pushed on stack when injecting a software interrupt or
> INT3 or INTO exception.
> 
> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
> loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
> in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
> vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
> saves RIP on the stack as-is).
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
>  arch/x86/kvm/svm/svm.h    |  1 +
>  2 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 73b545278f5f..9a6dc2b38fcf 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -369,6 +369,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>  	to->nested_ctl          = from->nested_ctl;
>  	to->event_inj           = from->event_inj;
>  	to->event_inj_err       = from->event_inj_err;
> +	to->next_rip            = from->next_rip;
>  	to->nested_cr3          = from->nested_cr3;
>  	to->virt_ext            = from->virt_ext;
>  	to->pause_filter_count  = from->pause_filter_count;
> @@ -606,7 +607,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
>  	}
>  }
>  
> -static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> +static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> +					  unsigned long vmcb12_rip)
>  {
>  	u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
>  	u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
> @@ -660,6 +662,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
>  	vmcb02->control.event_inj           = svm->nested.ctl.event_inj;
>  	vmcb02->control.event_inj_err       = svm->nested.ctl.event_inj_err;
>  
> +	/*
> +	 * next_rip is consumed on VMRUN as the return address pushed on the
> +	 * stack for injected soft exceptions/interrupts.  If nrips is exposed
> +	 * to L1, take it verbatim from vmcb12.  If nrips is supported in
> +	 * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
> +	 * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
> +	 * prior to injecting the event).
> +	 */
> +	if (svm->nrips_enabled)
> +		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> +	else if (boot_cpu_has(X86_FEATURE_NRIPS))
> +		vmcb02->control.next_rip    = vmcb12_rip;
> +
>  	vmcb02->control.virt_ext            = vmcb01->control.virt_ext &
>  					      LBR_CTL_ENABLE_MASK;
>  	if (svm->lbrv_enabled)
> @@ -743,7 +758,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
>  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> -	nested_vmcb02_prepare_control(svm);
> +	nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
>  	nested_vmcb02_prepare_save(svm, vmcb12);
>  
>  	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> @@ -1422,6 +1437,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
>  	dst->nested_ctl           = from->nested_ctl;
>  	dst->event_inj            = from->event_inj;
>  	dst->event_inj_err        = from->event_inj_err;
> +	dst->next_rip             = from->next_rip;
>  	dst->nested_cr3           = from->nested_cr3;
>  	dst->virt_ext              = from->virt_ext;
>  	dst->pause_filter_count   = from->pause_filter_count;
> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	nested_copy_vmcb_control_to_cache(svm, ctl);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> -	nested_vmcb02_prepare_control(svm);
> +	nested_vmcb02_prepare_control(svm, save->rip);
>  
>  	/*
>  	 * While the nested guest CR3 is already checked and set by
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index e246793cbeae..47e7427d0395 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -139,6 +139,7 @@ struct vmcb_ctrl_area_cached {
>  	u64 nested_ctl;
>  	u32 event_inj;
>  	u32 event_inj_err;
> +	u64 next_rip;
>  	u64 nested_cr3;
>  	u64 virt_ext;
>  	u32 clean;

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails"
  2022-04-02  1:08 ` [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" Sean Christopherson
@ 2022-04-04 10:03   ` Maxim Levitsky
  2022-04-20 15:01   ` Paolo Bonzini
  1 sibling, 0 replies; 45+ messages in thread
From: Maxim Levitsky @ 2022-04-04 10:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maciej S . Szmigiero

On Sat, 2022-04-02 at 01:08 +0000, Sean Christopherson wrote:
> Unwind the RIP advancement done by svm_queue_exception() when injecting
> an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
> vectoring the injected event, even if the exception reported by the CPU
> isn't the same event that was injected.  If vectoring INT3 encounters an
> exception, e.g. #NP, and vectoring the #NP encounters an intercepted
> exception, e.g. #PF when KVM is using shadow paging, then the #NP will
> be reported as the event that was in-progress.
> 
> Note, this is still imperfect, as it will get a false positive if the
> INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
> handler in the guest, the instruction following the INT3 generates an
> exception (directly or indirectly), _and_ vectoring that exception
> encounters an exception that is intercepted by KVM.  The false positives
> could theoretically be solved by further analyzing the vectoring event,
> e.g. by comparing the error code against the expected error code were an
> exception to occur when vectoring the original injected exception, but
> SVM without NRIPS is a complete disaster, trying to make it 100% correct
> is a waste of time.

Makes sense.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky

> 
> Fixes: 66b7138f9136 ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/svm.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 2c86bd9176c6..30cef3b10838 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3699,6 +3699,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
>  	type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
>  
> +	/*
> +	 * If NextRIP isn't enabled, KVM must manually advance RIP prior to
> +	 * injecting the soft exception/interrupt.  That advancement needs to
> +	 * be unwound if vectoring didn't complete.  Note, the _new_ event may
> +	 * not be the injected event, e.g. if KVM injected an INTn, the INTn
> +	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
> +	 * be the reported vectored event, but RIP still needs to be unwound.
> +	 */
> +	if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
> +	   kvm_is_linear_rip(vcpu, svm->int3_rip))
> +		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
> +
>  	switch (type) {
>  	case SVM_EXITINTINFO_TYPE_NMI:
>  		vcpu->arch.nmi_injected = true;
> @@ -3715,13 +3727,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  		 * but re-execute the instruction instead. Rewind RIP first
>  		 * if we emulated INT3 before.
>  		 */
> -		if (kvm_exception_is_soft(vector)) {
> -			if (vector == BP_VECTOR && int3_injected &&
> -			    kvm_is_linear_rip(vcpu, svm->int3_rip))
> -				kvm_rip_write(vcpu,
> -					      kvm_rip_read(vcpu) - int3_injected);
> +		if (kvm_exception_is_soft(vector))
>  			break;
> -		}
> +
>  		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
>  			u32 err = svm->vmcb->control.exit_int_info_err;
>  			kvm_requeue_exception_e(vcpu, vector, err);



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/8] KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is supported
  2022-04-02  1:08 ` [PATCH 4/8] KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is supported Sean Christopherson
@ 2022-04-04 12:00   ` Maxim Levitsky
  0 siblings, 0 replies; 45+ messages in thread
From: Maxim Levitsky @ 2022-04-04 12:00 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maciej S . Szmigiero

On Sat, 2022-04-02 at 01:08 +0000, Sean Christopherson wrote:
> If NRIPS is supported in hardware but disabled in KVM, set next_rip to
> the next RIP when advancing RIP as part of emulating INT3 injection.
> There is no flag to tell the CPU that KVM isn't using next_rip, and so
> leaving next_rip is left as is will result in the CPU pushing garbage
> onto the stack when vectoring the injected event.
> 
> Fixes: 66b7138f9136 ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/svm.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 30cef3b10838..6ea8f16e39ac 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -391,6 +391,10 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
>  		 */
>  		(void)svm_skip_emulated_instruction(vcpu);
>  		rip = kvm_rip_read(vcpu);
> +
> +		if (boot_cpu_has(X86_FEATURE_NRIPS))
> +			svm->vmcb->control.next_rip = rip;
> +
>  		svm->int3_rip = rip + svm->vmcb->save.cs.base;
>  		svm->int3_injected = rip - old_rip;
>  	}

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-02  1:09 ` [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction Sean Christopherson
@ 2022-04-04 12:12   ` Maxim Levitsky
  2022-04-04 16:49     ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Maxim Levitsky @ 2022-04-04 12:12 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maciej S . Szmigiero

On Sat, 2022-04-02 at 01:09 +0000, Sean Christopherson wrote:
> Re-inject INT3/INTO instead of retrying the instruction if the CPU
> encountered an intercepted exception while vectoring the software
> exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
> shadow paging.  Retrying the instruction is architecturally wrong, e.g.
> will result in a spurious #DB if there's a code breakpoint on the INT3/O,
> and lack of re-injection also breaks nested virtualization, e.g. if L1
> injects a software exception and vectoring the injected exception
> encounters an exception that is intercepted by L0 but not L1.
> 
> Due to, ahem, deficiencies in the SVM architecture, acquiring the next
> RIP may require flowing through the emulator even if NRIPS is supported,
> as the CPU clears next_rip if the VM-Exit is due to an exception other
> than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
> deal with this, "skip" the instruction to calculate next_ript, and then
> unwind the RIP write and any side effects (RFLAGS updates).
> 
> Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/svm.c | 111 ++++++++++++++++++++++++++++-------------
>  arch/x86/kvm/svm/svm.h |   4 +-
>  2 files changed, 79 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 6ea8f16e39ac..ecc828d6921e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -341,9 +341,11 @@ static void svm_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
>  
>  }
>  
> -static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> +static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
> +					   bool commit_side_effects)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> +	unsigned long old_rflags;
>  
>  	/*
>  	 * SEV-ES does not expose the next RIP. The RIP update is controlled by
> @@ -358,18 +360,71 @@ static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>  	}
>  
>  	if (!svm->next_rip) {
> +		if (unlikely(!commit_side_effects))
> +			old_rflags = svm->vmcb->save.rflags;
> +
>  		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
>  			return 0;
> +
> +		if (unlikely(!commit_side_effects))
> +			svm->vmcb->save.rflags = old_rflags;
>  	} else {
>  		kvm_rip_write(vcpu, svm->next_rip);
>  	}
>  
>  done:
> -	svm_set_interrupt_shadow(vcpu, 0);
> +	if (likely(commit_side_effects))
> +		svm_set_interrupt_shadow(vcpu, 0);
>  
>  	return 1;
>  }
>  
> +static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> +{
> +	return __svm_skip_emulated_instruction(vcpu, true);
> +}
> +
> +static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long rip, old_rip = kvm_rip_read(vcpu);
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	/*
> +	 * Due to architectural shortcomings, the CPU doesn't always provide
> +	 * NextRIP, e.g. if KVM intercepted an exception that occurred while
> +	 * the CPU was vectoring an INTO/INT3 in the guest.  Temporarily skip
> +	 * the instruction even if NextRIP is supported to acquire the next
> +	 * RIP so that it can be shoved into the NextRIP field, otherwise
> +	 * hardware will fail to advance guest RIP during event injection.
> +	 * Drop the exception/interrupt if emulation fails and effectively
> +	 * retry the instruction, it's the least awful option.  If NRIPS is
> +	 * in use, the skip must not commit any side effects such as clearing
> +	 * the interrupt shadow or RFLAGS.RF.
> +	 */
> +	if (!__svm_skip_emulated_instruction(vcpu, !nrips))
> +		return -EIO;
> +
> +	rip = kvm_rip_read(vcpu);
> +
> +	/*
> +	 * If NextRIP is supported, rewind RIP and update NextRip.  If NextRip
> +	 * isn't supported, keep the result of the skip as the CPU obviously
> +	 * won't advance RIP, but stash away the injection information so that
> +	 * RIP can be unwound if injection fails.
> +	 */
> +	if (nrips) {
> +		kvm_rip_write(vcpu, old_rip);
> +		svm->vmcb->control.next_rip = rip;
> +	} else {
> +		if (boot_cpu_has(X86_FEATURE_NRIPS))
> +			svm->vmcb->control.next_rip = rip;
> +
> +		svm->soft_int_linear_rip = rip + svm->vmcb->save.cs.base;
> +		svm->soft_int_injected = rip - old_rip;
> +	}
> +	return 0;
> +}
> +
>  static void svm_queue_exception(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -379,25 +434,9 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
>  
>  	kvm_deliver_exception_payload(vcpu);
>  
> -	if (nr == BP_VECTOR && !nrips) {
> -		unsigned long rip, old_rip = kvm_rip_read(vcpu);
> -
> -		/*
> -		 * For guest debugging where we have to reinject #BP if some
> -		 * INT3 is guest-owned:
> -		 * Emulate nRIP by moving RIP forward. Will fail if injection
> -		 * raises a fault that is not intercepted. Still better than
> -		 * failing in all cases.
> -		 */
> -		(void)svm_skip_emulated_instruction(vcpu);
> -		rip = kvm_rip_read(vcpu);
> -
> -		if (boot_cpu_has(X86_FEATURE_NRIPS))
> -			svm->vmcb->control.next_rip = rip;
> -
> -		svm->int3_rip = rip + svm->vmcb->save.cs.base;
> -		svm->int3_injected = rip - old_rip;
> -	}
> +	if (kvm_exception_is_soft(nr) &&
> +	    svm_update_soft_interrupt_rip(vcpu))
> +		return;
>  
>  	svm->vmcb->control.event_inj = nr
>  		| SVM_EVTINJ_VALID
> @@ -3676,9 +3715,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  	u8 vector;
>  	int type;
>  	u32 exitintinfo = svm->vmcb->control.exit_int_info;
> -	unsigned int3_injected = svm->int3_injected;
> +	unsigned soft_int_injected = svm->soft_int_injected;
>  
> -	svm->int3_injected = 0;
> +	svm->soft_int_injected = 0;
>  
>  	/*
>  	 * If we've made progress since setting HF_IRET_MASK, we've
> @@ -3698,6 +3737,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
>  		return;
>  
> +	/*
> +	 * If NextRIP isn't enabled, KVM must manually advance RIP prior to
> +	 * injecting the soft exception/interrupt.  That advancement needs to
> +	 * be unwound if vectoring didn't complete.  Note, the _new_ event may
> +	 * not be the injected event, e.g. if KVM injected an INTn, the INTn
> +	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
> +	 * be the reported vectored event, but RIP still needs to be unwound.
> +	 */
> +	if (soft_int_injected &&
> +	    kvm_is_linear_rip(vcpu, to_svm(vcpu)->soft_int_linear_rip))
> +		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
> +
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
>  	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> @@ -3711,9 +3762,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
>  	 * be the reported vectored event, but RIP still needs to be unwound.
>  	 */
> -	if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
> -	   kvm_is_linear_rip(vcpu, svm->int3_rip))
> -		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
> +	if (soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
> +	   kvm_is_linear_rip(vcpu, svm->soft_int_linear_rip))
> +		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
>  
>  	switch (type) {
>  	case SVM_EXITINTINFO_TYPE_NMI:
> @@ -3726,14 +3777,6 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  		if (vector == X86_TRAP_VC)
>  			break;
>  
> -		/*
> -		 * In case of software exceptions, do not reinject the vector,
> -		 * but re-execute the instruction instead. Rewind RIP first
> -		 * if we emulated INT3 before.
> -		 */
> -		if (kvm_exception_is_soft(vector))
> -			break;
> -
>  		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
>  			u32 err = svm->vmcb->control.exit_int_info_err;
>  			kvm_requeue_exception_e(vcpu, vector, err);
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 47e7427d0395..a770a1c7ddd2 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -230,8 +230,8 @@ struct vcpu_svm {
>  	bool nmi_singlestep;
>  	u64 nmi_singlestep_guest_rflags;
>  
> -	unsigned int3_injected;
> -	unsigned long int3_rip;
> +	unsigned soft_int_injected;
> +	unsigned long soft_int_linear_rip;
>  
>  	/* optional nested SVM features that are enabled for this guest  */
>  	bool nrips_enabled                : 1;


I mostly agree with this patch, but think that it doesn't address the original issue that
Maciej wanted to address:

Suppose that there is *no* instruction in L2 code which caused the software exception,
but rather L1 set arbitrary next_rip, and set EVENTINJ to software exception with some vector,
and that injection got interrupted.

I don't think that this code will support this.

I think that svm_complete_interrupts should store next_rip it in some field like VMX does
(vcpu->arch.event_exit_inst_len).

That field also should be migrated, or we must prove that it works anyway.
E.g, what happens when we tried to inject event,
injection was interrupted by other exception, and then we migrate?

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] KVM: x86: Trace re-injected exceptions
  2022-04-02  1:09 ` [PATCH 7/8] KVM: x86: Trace re-injected exceptions Sean Christopherson
@ 2022-04-04 12:14   ` Maxim Levitsky
  2022-04-04 16:14     ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Maxim Levitsky @ 2022-04-04 12:14 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maciej S . Szmigiero

On Sat, 2022-04-02 at 01:09 +0000, Sean Christopherson wrote:
> Trace exceptions that are re-injected, not just those that KVM is
> injecting for the first time.  Debugging re-injection bugs is painful
> enough as is, not having visibility into what KVM is doing only makes
> things worse.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7a066cf92692..384091600bc2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9382,6 +9382,10 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
>  
>  static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  {
> +	trace_kvm_inj_exception(vcpu->arch.exception.nr,
> +				vcpu->arch.exception.has_error_code,
> +				vcpu->arch.exception.error_code);
> +

Can we use a {new tracepoint / new parameter for this tracepoint} for this to avoid confusion?

Best regards,
	Maxim Levitsky



>  	if (vcpu->arch.exception.error_code && !is_protmode(vcpu))
>  		vcpu->arch.exception.error_code = false;
>  	static_call(kvm_x86_queue_exception)(vcpu);
> @@ -9439,10 +9443,6 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>  
>  	/* try to inject new event if pending */
>  	if (vcpu->arch.exception.pending) {
> -		trace_kvm_inj_exception(vcpu->arch.exception.nr,
> -					vcpu->arch.exception.has_error_code,
> -					vcpu->arch.exception.error_code);
> -
>  		vcpu->arch.exception.pending = false;
>  		vcpu->arch.exception.injected = true;
>  



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: nSVM: Add svm_nested_soft_inject_test
  2022-04-02  1:09 ` [PATCH 8/8] KVM: selftests: nSVM: Add svm_nested_soft_inject_test Sean Christopherson
@ 2022-04-04 12:27   ` Maxim Levitsky
  2022-04-04 16:59     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 45+ messages in thread
From: Maxim Levitsky @ 2022-04-04 12:27 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maciej S . Szmigiero

On Sat, 2022-04-02 at 01:09 +0000, Sean Christopherson wrote:
> From: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Add a KVM self-test that checks whether a nSVM L1 is able to successfully
> inject a software interrupt and a soft exception into its L2 guest.
> 
> In practice, this tests both the next_rip field consistency and
> L1-injected event with intervening L0 VMEXIT during its delivery:
> the first nested VMRUN (that's also trying to inject a software interrupt)
> will immediately trigger a L0 NPF.
> This L0 NPF will have zero in its CPU-returned next_rip field, which if
> incorrectly reused by KVM will trigger a #PF when trying to return to
> such address 0 from the interrupt handler.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/include/x86_64/svm_util.h   |   2 +
>  .../kvm/x86_64/svm_nested_soft_inject_test.c  | 147 ++++++++++++++++++
>  4 files changed, 151 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> index 1f1b6c978bf7..1354d3f4a362 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -33,6 +33,7 @@
>  /x86_64/state_test
>  /x86_64/svm_vmcall_test
>  /x86_64/svm_int_ctl_test
> +/x86_64/svm_nested_soft_inject_test
>  /x86_64/sync_regs_test
>  /x86_64/tsc_msrs_test
>  /x86_64/userspace_io_test
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index c9cdbd248727..cef6d583044b 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -66,6 +66,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/state_test
>  TEST_GEN_PROGS_x86_64 += x86_64/vmx_preemption_timer_test
>  TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test
>  TEST_GEN_PROGS_x86_64 += x86_64/svm_int_ctl_test
> +TEST_GEN_PROGS_x86_64 += x86_64/svm_nested_soft_inject_test
>  TEST_GEN_PROGS_x86_64 += x86_64/tsc_scaling_sync
>  TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
>  TEST_GEN_PROGS_x86_64 += x86_64/userspace_io_test
> diff --git a/tools/testing/selftests/kvm/include/x86_64/svm_util.h b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
> index a25aabd8f5e7..d49f7c9b4564 100644
> --- a/tools/testing/selftests/kvm/include/x86_64/svm_util.h
> +++ b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
> @@ -16,6 +16,8 @@
>  #define CPUID_SVM_BIT		2
>  #define CPUID_SVM		BIT_ULL(CPUID_SVM_BIT)
>  
> +#define SVM_EXIT_EXCP_BASE	0x040
> +#define SVM_EXIT_HLT		0x078
>  #define SVM_EXIT_MSR		0x07c
>  #define SVM_EXIT_VMMCALL	0x081
>  
> diff --git a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
> new file mode 100644
> index 000000000000..d39be5d885c1
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
> @@ -0,0 +1,147 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2022 Oracle and/or its affiliates.
> + *
> + * Based on:
> + *   svm_int_ctl_test
> + *
> + *   Copyright (C) 2021, Red Hat, Inc.
> + *
> + */
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +#include "svm_util.h"
> +
> +#define VCPU_ID		0
> +#define INT_NR			0x20
> +#define X86_FEATURE_NRIPS	BIT(3)
> +
> +#define vmcall()		\
> +	__asm__ __volatile__(	\
> +		"vmmcall\n"	\
> +		)
> +
> +#define ud2()			\
> +	__asm__ __volatile__(	\
> +		"ud2\n"	\
> +		)
> +
> +#define hlt()			\
> +	__asm__ __volatile__(	\
> +		"hlt\n"	\
> +		)

Minor nitpick: I guess it would be nice to put these in some common header file.


> +
> +static unsigned int bp_fired;
> +static void guest_bp_handler(struct ex_regs *regs)
> +{
> +	bp_fired++;
> +}
> +
> +static unsigned int int_fired;
> +static void guest_int_handler(struct ex_regs *regs)
> +{
> +	int_fired++;
> +}
> +
> +static void l2_guest_code(void)
> +{
> +	GUEST_ASSERT(int_fired == 1);
> +	vmcall();
> +	ud2();
> +
> +	GUEST_ASSERT(bp_fired == 1);
> +	hlt();
> +}


Yes, I was right - the test indeed has no either INT 0x20 nor INT 3 in the L2 guest code,
and yet this should work.

> +
> +static void l1_guest_code(struct svm_test_data *svm)
> +{
> +	#define L2_GUEST_STACK_SIZE 64
> +	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
> +	struct vmcb *vmcb = svm->vmcb;
> +
> +	/* Prepare for L2 execution. */
> +	generic_svm_setup(svm, l2_guest_code,
> +			  &l2_guest_stack[L2_GUEST_STACK_SIZE]);
> +
> +	vmcb->control.intercept_exceptions |= BIT(PF_VECTOR) | BIT(UD_VECTOR);
> +	vmcb->control.intercept |= BIT(INTERCEPT_HLT);
> +
> +	vmcb->control.event_inj = INT_NR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_SOFT;
> +	/* The return address pushed on stack */
> +	vmcb->control.next_rip = vmcb->save.rip;

I'll would be putting even something more spicy here just to see that KVM preserves this field
like say put two ud2 in the start of guest code, and here have

vmcb->control.next_rip = vmcb->save.rip + 4; // skip over two ud2 instructions.

That way KVM won't be able to skip over a single instruction to get correct next_rip

> +
> +	run_guest(vmcb, svm->vmcb_gpa);
> +	GUEST_ASSERT_3(vmcb->control.exit_code == SVM_EXIT_VMMCALL,
> +		       vmcb->control.exit_code,
> +		       vmcb->control.exit_info_1, vmcb->control.exit_info_2);
> +
> +	/* Skip over VMCALL */
> +	vmcb->save.rip += 3;
> +
> +	vmcb->control.event_inj = BP_VECTOR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
> +	/* The return address pushed on stack, skip over UD2 */
> +	vmcb->control.next_rip = vmcb->save.rip + 2;
> +
> +	run_guest(vmcb, svm->vmcb_gpa);
> +	GUEST_ASSERT_3(vmcb->control.exit_code == SVM_EXIT_HLT,
> +		       vmcb->control.exit_code,
> +		       vmcb->control.exit_info_1, vmcb->control.exit_info_2);
> +
> +	GUEST_DONE();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	struct kvm_cpuid_entry2 *cpuid;
> +	struct kvm_vm *vm;
> +	vm_vaddr_t svm_gva;
> +	struct kvm_guest_debug debug;
> +
> +	nested_svm_check_supported();
> +
> +	cpuid = kvm_get_supported_cpuid_entry(0x8000000a);
> +	if (!(cpuid->edx & X86_FEATURE_NRIPS)) {
> +		print_skip("nRIP Save unavailable");
> +		exit(KSFT_SKIP);
> +	}
> +
> +	vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
> +
> +	vm_init_descriptor_tables(vm);
> +	vcpu_init_descriptor_tables(vm, VCPU_ID);
> +
> +	vm_install_exception_handler(vm, BP_VECTOR, guest_bp_handler);
> +	vm_install_exception_handler(vm, INT_NR, guest_int_handler);
> +
> +	vcpu_alloc_svm(vm, &svm_gva);
> +	vcpu_args_set(vm, VCPU_ID, 1, svm_gva);
> +
> +	memset(&debug, 0, sizeof(debug));
> +	vcpu_set_guest_debug(vm, VCPU_ID, &debug);
> +
> +	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> +	struct ucall uc;
> +
> +	vcpu_run(vm, VCPU_ID);
> +	TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> +		    "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
> +		    run->exit_reason,
> +		    exit_reason_str(run->exit_reason));
> +
> +	switch (get_ucall(vm, VCPU_ID, &uc)) {
> +	case UCALL_ABORT:
> +		TEST_FAIL("%s at %s:%ld, vals = 0x%lx 0x%lx 0x%lx", (const char *)uc.args[0],
> +			  __FILE__, uc.args[1], uc.args[2], uc.args[3], uc.args[4]);
> +		break;
> +		/* NOT REACHED */
> +	case UCALL_DONE:
> +		goto done;
> +	default:
> +		TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
> +	}
> +done:
> +	kvm_vm_free(vm);
> +	return 0;
> +}


Other that nitpicks:

Reviewed-by: Maxim levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] KVM: x86: Trace re-injected exceptions
  2022-04-04 12:14   ` Maxim Levitsky
@ 2022-04-04 16:14     ` Sean Christopherson
  0 siblings, 0 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-04 16:14 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

On Mon, Apr 04, 2022, Maxim Levitsky wrote:
> On Sat, 2022-04-02 at 01:09 +0000, Sean Christopherson wrote:
> > Trace exceptions that are re-injected, not just those that KVM is
> > injecting for the first time.  Debugging re-injection bugs is painful
> > enough as is, not having visibility into what KVM is doing only makes
> > things worse.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/x86.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7a066cf92692..384091600bc2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9382,6 +9382,10 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
> >  
> >  static void kvm_inject_exception(struct kvm_vcpu *vcpu)
> >  {
> > +	trace_kvm_inj_exception(vcpu->arch.exception.nr,
> > +				vcpu->arch.exception.has_error_code,
> > +				vcpu->arch.exception.error_code);
> > +
> 
> Can we use a {new tracepoint / new parameter for this tracepoint} for this to
> avoid confusion?

Good idea, a param to capture re-injection would be very helpful.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 12:12   ` Maxim Levitsky
@ 2022-04-04 16:49     ` Sean Christopherson
  2022-04-04 16:53       ` Maciej S. Szmigiero
  2022-04-04 20:44       ` Maciej S. Szmigiero
  0 siblings, 2 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-04 16:49 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

On Mon, Apr 04, 2022, Maxim Levitsky wrote:
> On Sat, 2022-04-02 at 01:09 +0000, Sean Christopherson wrote:
> > Re-inject INT3/INTO instead of retrying the instruction if the CPU
> > encountered an intercepted exception while vectoring the software
> > exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
> > shadow paging.  Retrying the instruction is architecturally wrong, e.g.
> > will result in a spurious #DB if there's a code breakpoint on the INT3/O,
> > and lack of re-injection also breaks nested virtualization, e.g. if L1
> > injects a software exception and vectoring the injected exception
> > encounters an exception that is intercepted by L0 but not L1.
> > 
> > Due to, ahem, deficiencies in the SVM architecture, acquiring the next
> > RIP may require flowing through the emulator even if NRIPS is supported,
> > as the CPU clears next_rip if the VM-Exit is due to an exception other
> > than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
> > deal with this, "skip" the instruction to calculate next_ript, and then
> > unwind the RIP write and any side effects (RFLAGS updates).

...

> > @@ -3698,6 +3737,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> >  	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
> >  		return;
> >  
> > +	/*
> > +	 * If NextRIP isn't enabled, KVM must manually advance RIP prior to
> > +	 * injecting the soft exception/interrupt.  That advancement needs to
> > +	 * be unwound if vectoring didn't complete.  Note, the _new_ event may
> > +	 * not be the injected event, e.g. if KVM injected an INTn, the INTn
> > +	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
> > +	 * be the reported vectored event, but RIP still needs to be unwound.
> > +	 */
> > +	if (soft_int_injected &&
> > +	    kvm_is_linear_rip(vcpu, to_svm(vcpu)->soft_int_linear_rip))
> > +		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);

Doh, I botched my last minute rebase.  This is duplicate code and needs to be
dropped.

> > +
> >  	kvm_make_request(KVM_REQ_EVENT, vcpu);
> >  
> >  	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> > @@ -3711,9 +3762,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> >  	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
> >  	 * be the reported vectored event, but RIP still needs to be unwound.
> >  	 */
> > -	if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
> > -	   kvm_is_linear_rip(vcpu, svm->int3_rip))
> > -		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
> > +	if (soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
> > +	   kvm_is_linear_rip(vcpu, svm->soft_int_linear_rip))
> > +		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
> >  
> >  	switch (type) {
> >  	case SVM_EXITINTINFO_TYPE_NMI:
> > @@ -3726,14 +3777,6 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> >  		if (vector == X86_TRAP_VC)
> >  			break;
> >  
> > -		/*
> > -		 * In case of software exceptions, do not reinject the vector,
> > -		 * but re-execute the instruction instead. Rewind RIP first
> > -		 * if we emulated INT3 before.
> > -		 */
> > -		if (kvm_exception_is_soft(vector))
> > -			break;
> > -
> >  		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
> >  			u32 err = svm->vmcb->control.exit_int_info_err;
> >  			kvm_requeue_exception_e(vcpu, vector, err);
> > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > index 47e7427d0395..a770a1c7ddd2 100644
> > --- a/arch/x86/kvm/svm/svm.h
> > +++ b/arch/x86/kvm/svm/svm.h
> > @@ -230,8 +230,8 @@ struct vcpu_svm {
> >  	bool nmi_singlestep;
> >  	u64 nmi_singlestep_guest_rflags;
> >  
> > -	unsigned int3_injected;
> > -	unsigned long int3_rip;
> > +	unsigned soft_int_injected;
> > +	unsigned long soft_int_linear_rip;
> >  
> >  	/* optional nested SVM features that are enabled for this guest  */
> >  	bool nrips_enabled                : 1;
> 
> 
> I mostly agree with this patch, but think that it doesn't address the
> original issue that Maciej wanted to address:
> 
> Suppose that there is *no* instruction in L2 code which caused the software
> exception, but rather L1 set arbitrary next_rip, and set EVENTINJ to software
> exception with some vector, and that injection got interrupted.
> 
> I don't think that this code will support this.

Argh, you're right.  Maciej's selftest injects without an instruction, but it doesn't
configure the scenario where that injection fails due to an exception+VM-Exit that
isn't intercepted by L1 and is handled by L0.  The event_inj test gets the coverage
for the latter, but always has a backing instruction. 

> I think that svm_complete_interrupts should store next_rip it in some field
> like VMX does (vcpu->arch.event_exit_inst_len).

Yeah.  The ugly part is that because next_rip is guaranteed to be cleared on exit
(the exit is gauranteed to be due to a fault-like exception), KVM has to snapshot
next_rip during the "original" injection and use the linear_rip matching heuristic
to detect this scenario.

> That field also should be migrated, or we must prove that it works anyway.
> E.g, what happens when we tried to inject event,
> injection was interrupted by other exception, and then we migrate?

Ya, should Just Work if control.next_rip is used to cache the next rip.

Handling this doesn't seem to be too awful (haven't tested yet), it's largely the
same logic as the existing !nrips code.

In svm_update_soft_interrupt_rip(), snapshot all information regardless of whether
or not nrips is enabled:

	svm->soft_int_injected = true;
	svm->soft_int_csbase = svm->vmcb->save.cs.base;
	svm->soft_int_old_rip = old_rip;
	svm->soft_int_next_rip = rip;

	if (nrips)
		kvm_rip_write(vcpu, old_rip);

	if (static_cpu_has(X86_FEATURE_NRIPS))
		svm->vmcb->control.next_rip = rip;

and then in svm_complete_interrupts(), change the linear RIP matching code to look
for the old rip in the nrips case and stuff svm->vmcb->control.next_rip on match.

	bool soft_int_injected = svm->soft_int_injected;
	unsigned soft_int_rip;

	svm->soft_int_injected = false;

	if (soft_int_injected) {
		if (nrips)
			soft_int_rip = svm->soft_int_old_rip;
		else
			soft_int_rip = svm->soft_int_next_rip;
	}

	...

	if soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
	   kvm_is_linear_rip(vcpu, soft_int_rip + svm->soft_int_csbase)) {
		if (nrips)
			svm->vmcb->control.next_rip = svm->soft_int_next_rip;
		else
			kvm_rip_write(vcpu, svm->soft_int_old_rip);
	}




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-02  1:08 ` [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Sean Christopherson
  2022-04-04  9:54   ` Maxim Levitsky
@ 2022-04-04 16:50   ` Maciej S. Szmigiero
  2022-04-04 17:21     ` Sean Christopherson
  1 sibling, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 16:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On 2.04.2022 03:08, Sean Christopherson wrote:
> From: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
> This field value (instead of the saved guest RIP) in used by the CPU for
> the return address pushed on stack when injecting a software interrupt or
> INT3 or INTO exception.
> 
> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
> loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
> in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
> vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
> saves RIP on the stack as-is).
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/nested.c | 22 +++++++++++++++++++---
>   arch/x86/kvm/svm/svm.h    |  1 +
>   2 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 73b545278f5f..9a6dc2b38fcf 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -369,6 +369,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>   	to->nested_ctl          = from->nested_ctl;
>   	to->event_inj           = from->event_inj;
>   	to->event_inj_err       = from->event_inj_err;
> +	to->next_rip            = from->next_rip;
>   	to->nested_cr3          = from->nested_cr3;
>   	to->virt_ext            = from->virt_ext;
>   	to->pause_filter_count  = from->pause_filter_count;
> @@ -606,7 +607,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12
>   	}
>   }
>   
> -static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
> +static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> +					  unsigned long vmcb12_rip)
>   {
>   	u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
>   	u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
> @@ -660,6 +662,19 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
>   	vmcb02->control.event_inj           = svm->nested.ctl.event_inj;
>   	vmcb02->control.event_inj_err       = svm->nested.ctl.event_inj_err;
>   
> +	/*
> +	 * next_rip is consumed on VMRUN as the return address pushed on the
> +	 * stack for injected soft exceptions/interrupts.  If nrips is exposed
> +	 * to L1, take it verbatim from vmcb12.  If nrips is supported in
> +	 * hardware but not exposed to L1, stuff the actual L2 RIP to emulate
> +	 * what a nrips=0 CPU would do (L1 is responsible for advancing RIP
> +	 * prior to injecting the event).
> +	 */
> +	if (svm->nrips_enabled)
> +		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> +	else if (boot_cpu_has(X86_FEATURE_NRIPS))
> +		vmcb02->control.next_rip    = vmcb12_rip;
> +
>   	vmcb02->control.virt_ext            = vmcb01->control.virt_ext &
>   					      LBR_CTL_ENABLE_MASK;
>   	if (svm->lbrv_enabled)
> @@ -743,7 +758,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
>   	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>   
>   	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> -	nested_vmcb02_prepare_control(svm);
> +	nested_vmcb02_prepare_control(svm, vmcb12->save.rip);
>   	nested_vmcb02_prepare_save(svm, vmcb12);
>   
>   	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
> @@ -1422,6 +1437,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
>   	dst->nested_ctl           = from->nested_ctl;
>   	dst->event_inj            = from->event_inj;
>   	dst->event_inj_err        = from->event_inj_err;
> +	dst->next_rip             = from->next_rip;
>   	dst->nested_cr3           = from->nested_cr3;
>   	dst->virt_ext              = from->virt_ext;
>   	dst->pause_filter_count   = from->pause_filter_count;
> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>   	nested_copy_vmcb_control_to_cache(svm, ctl);
>   
>   	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> -	nested_vmcb02_prepare_control(svm);
> +	nested_vmcb02_prepare_control(svm, save->rip);
>   

					   ^
I guess this should be "svm->vmcb->save.rip", since
KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
not vmcb{0,1}2 (in contrast to the "control" field).


Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 16:49     ` Sean Christopherson
@ 2022-04-04 16:53       ` Maciej S. Szmigiero
  2022-04-04 19:33         ` Sean Christopherson
  2022-04-04 20:44       ` Maciej S. Szmigiero
  1 sibling, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 16:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

>>> index 47e7427d0395..a770a1c7ddd2 100644
>>> --- a/arch/x86/kvm/svm/svm.h
>>> +++ b/arch/x86/kvm/svm/svm.h
>>> @@ -230,8 +230,8 @@ struct vcpu_svm {
>>>   	bool nmi_singlestep;
>>>   	u64 nmi_singlestep_guest_rflags;
>>>   
>>> -	unsigned int3_injected;
>>> -	unsigned long int3_rip;
>>> +	unsigned soft_int_injected;
>>> +	unsigned long soft_int_linear_rip;
>>>   
>>>   	/* optional nested SVM features that are enabled for this guest  */
>>>   	bool nrips_enabled                : 1;
>>
>>
>> I mostly agree with this patch, but think that it doesn't address the
>> original issue that Maciej wanted to address:
>>
>> Suppose that there is *no* instruction in L2 code which caused the software
>> exception, but rather L1 set arbitrary next_rip, and set EVENTINJ to software
>> exception with some vector, and that injection got interrupted.
>>
>> I don't think that this code will support this.
> 
> Argh, you're right.  Maciej's selftest injects without an instruction, but it doesn't
> configure the scenario where that injection fails due to an exception+VM-Exit that
> isn't intercepted by L1 and is handled by L0.  The event_inj test gets the coverage
> for the latter, but always has a backing instruction.

Still reviewing the whole patch set, but want to clear this point quickly:
The selftest does have an implicit intervening NPF (handled by L0) while
injecting the first L1 -> L2 event.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: nSVM: Add svm_nested_soft_inject_test
  2022-04-04 12:27   ` Maxim Levitsky
@ 2022-04-04 16:59     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 16:59 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 4.04.2022 14:27, Maxim Levitsky wrote:
> On Sat, 2022-04-02 at 01:09 +0000, Sean Christopherson wrote:
>> From: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>
>> Add a KVM self-test that checks whether a nSVM L1 is able to successfully
>> inject a software interrupt and a soft exception into its L2 guest.
>>
>> In practice, this tests both the next_rip field consistency and
>> L1-injected event with intervening L0 VMEXIT during its delivery:
>> the first nested VMRUN (that's also trying to inject a software interrupt)
>> will immediately trigger a L0 NPF.
>> This L0 NPF will have zero in its CPU-returned next_rip field, which if
>> incorrectly reused by KVM will trigger a #PF when trying to return to
>> such address 0 from the interrupt handler.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> ---
>>   tools/testing/selftests/kvm/.gitignore        |   1 +
>>   tools/testing/selftests/kvm/Makefile          |   1 +
>>   .../selftests/kvm/include/x86_64/svm_util.h   |   2 +
>>   .../kvm/x86_64/svm_nested_soft_inject_test.c  | 147 ++++++++++++++++++
>>   4 files changed, 151 insertions(+)
>>   create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
>>
(..)
>> diff --git a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
>> new file mode 100644
>> index 000000000000..d39be5d885c1
>> --- /dev/null
>> +++ b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
>> @@ -0,0 +1,147 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2022 Oracle and/or its affiliates.
>> + *
>> + * Based on:
>> + *   svm_int_ctl_test
>> + *
>> + *   Copyright (C) 2021, Red Hat, Inc.
>> + *
>> + */
>> +
>> +#include "test_util.h"
>> +#include "kvm_util.h"
>> +#include "processor.h"
>> +#include "svm_util.h"
>> +
>> +#define VCPU_ID		0
>> +#define INT_NR			0x20
>> +#define X86_FEATURE_NRIPS	BIT(3)
>> +
>> +#define vmcall()		\
>> +	__asm__ __volatile__(	\
>> +		"vmmcall\n"	\
>> +		)
>> +
>> +#define ud2()			\
>> +	__asm__ __volatile__(	\
>> +		"ud2\n"	\
>> +		)
>> +
>> +#define hlt()			\
>> +	__asm__ __volatile__(	\
>> +		"hlt\n"	\
>> +		)
> 
> Minor nitpick: I guess it would be nice to put these in some common header file.

Will move these to include/x86_64/processor.h (that's probably the most
matching header file).

>> +
>> +static void l1_guest_code(struct svm_test_data *svm)
>> +{
>> +	#define L2_GUEST_STACK_SIZE 64
>> +	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
>> +	struct vmcb *vmcb = svm->vmcb;
>> +
>> +	/* Prepare for L2 execution. */
>> +	generic_svm_setup(svm, l2_guest_code,
>> +			  &l2_guest_stack[L2_GUEST_STACK_SIZE]);
>> +
>> +	vmcb->control.intercept_exceptions |= BIT(PF_VECTOR) | BIT(UD_VECTOR);
>> +	vmcb->control.intercept |= BIT(INTERCEPT_HLT);
>> +
>> +	vmcb->control.event_inj = INT_NR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_SOFT;
>> +	/* The return address pushed on stack */
>> +	vmcb->control.next_rip = vmcb->save.rip;
> 
> I'll would be putting even something more spicy here just to see that KVM preserves this field
> like say put two ud2 in the start of guest code, and here have
> 
> vmcb->control.next_rip = vmcb->save.rip + 4; // skip over two ud2 instructions.
> 
> That way KVM won't be able to skip over a single instruction to get correct next_rip

Good point, will add these two additional ud2s at the start of L2 guest code.

> 
> 
> Other that nitpicks:
> 
> Reviewed-by: Maxim levitsky <mlevitsk@redhat.com>
> 
> Best regards,
> 	Maxim Levitsky
> 
> 

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"
  2022-04-02  1:09 ` [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" Sean Christopherson
@ 2022-04-04 17:14   ` Sean Christopherson
  2022-04-04 20:27   ` Maciej S. Szmigiero
  1 sibling, 0 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-04 17:14 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maciej S . Szmigiero

On Sat, Apr 02, 2022, Sean Christopherson wrote:
> Re-inject INTn software interrupts instead of retrying the instruction if
> the CPU encountered an intercepted exception while vectoring the INTn,
> e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
> instruction is architecturally wrong e.g. will result in a spurious #DB
> if there's a code breakpoint on the INT3/O, and lack of re-injection also
> breaks nested virtualization, e.g. if L1 injects a software interrupt and
> vectoring the injected interrupt encounters an exception that is
> intercepted by L0 but not L1.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/svm.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ecc828d6921e..00b1399681d1 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3425,14 +3425,24 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>  static void svm_inject_irq(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> +	u32 type;
>  
>  	WARN_ON(!gif_set(svm));
>  
> +	if (vcpu->arch.interrupt.soft) {
> +		if (svm_update_soft_interrupt_rip(vcpu))
> +			return;
> +
> +		type = SVM_EVTINJ_TYPE_SOFT;
> +	} else {
> +		type = SVM_EVTINJ_TYPE_INTR;
> +	}
> +
>  	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
>  	++vcpu->stat.irq_injections;
>  
>  	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
> -		SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
> +				       SVM_EVTINJ_VALID | type;
>  }
>  
>  void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
> @@ -3787,9 +3797,13 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  	case SVM_EXITINTINFO_TYPE_INTR:
>  		kvm_queue_interrupt(vcpu, vector, false);
>  		break;
> +	case SVM_EXITINTINFO_TYPE_SOFT:
> +		kvm_queue_interrupt(vcpu, vector, true);

I believe this patch is wrong, it needs to also tweak the soft_int_injected logic
to look for SVM_EXITINTINFO_TYPE_EXEPT _or_ SVM_EXITINTINFO_TYPE_SOFT, e.g. if the
SOFT-type interrupt doesn't complete due to a non-legacy-exception exit, e.g. #NPF.

> +		break;
>  	default:
>  		break;
>  	}
> +
>  }

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-04 16:50   ` Maciej S. Szmigiero
@ 2022-04-04 17:21     ` Sean Christopherson
  2022-04-04 17:45       ` Maciej S. Szmigiero
  2022-04-20 15:00       ` Paolo Bonzini
  0 siblings, 2 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-04 17:21 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> > @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> >   	nested_copy_vmcb_control_to_cache(svm, ctl);
> >   	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > -	nested_vmcb02_prepare_control(svm);
> > +	nested_vmcb02_prepare_control(svm, save->rip);
> 
> 					   ^
> I guess this should be "svm->vmcb->save.rip", since
> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
> not vmcb{0,1}2 (in contrast to the "control" field).

Argh, yes.  Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
If not, this will result in garbage being loaded into vmcb02.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-04 17:21     ` Sean Christopherson
@ 2022-04-04 17:45       ` Maciej S. Szmigiero
  2022-04-20 15:00       ` Paolo Bonzini
  1 sibling, 0 replies; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 17:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On 4.04.2022 19:21, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>>>    	nested_copy_vmcb_control_to_cache(svm, ctl);
>>>    	svm_switch_vmcb(svm, &svm->nested.vmcb02);
>>> -	nested_vmcb02_prepare_control(svm);
>>> +	nested_vmcb02_prepare_control(svm, save->rip);
>>
>> 					   ^
>> I guess this should be "svm->vmcb->save.rip", since
>> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
>> not vmcb{0,1}2 (in contrast to the "control" field).
> 
> Argh, yes.  Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
> If not, this will result in garbage being loaded into vmcb02.

I'm not sure about particular KVM API guarantees,
but looking at the code I guess it is supposed to handle both cases:
1) VMM loads the usual basic KVM state via KVM_SET_{S,}REGS then immediately
issues KVM_SET_NESTED_STATE to load the remaining nested data.

Assuming that it was the L2 that was running at the save time,
at first the basic L2 state will be loaded into vmcb01,
then at KVM_SET_NESTED_STATE time:
> if (is_guest_mode(vcpu))
>     svm_leave_nested(vcpu);
> else
>     svm->nested.vmcb02.ptr->save = svm->vmcb01.ptr->save;

The !is_guest_mode(vcpu) branch will be taken (since the new VM haven't entered
the guest mode yet), which will copy the basic L2 state from vmcb01 to vmcb02 and
then the remaining code will restore vmcb01 save and vmcb{0,1}2 control normally and
then enter the guest mode.

2) VMM first issues KVM_SET_NESTED_STATE then immediately loads the basic state.

Sane as the above, only some initial VM state will be copied into vmcb02 from vmcb01
by the code mentioned above, then vmcb01 save and vmcb{0,1}2 control will be restored
and guest mode will be entered.
If the VMM then immediately issues KVM_SET_{S,}REGS then it will restore L2 basic state
straight into vmcb02.

However, this all is my guess work from just looking at the relevant code,
I haven't run any tests to make sure that I haven't missed something.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 16:53       ` Maciej S. Szmigiero
@ 2022-04-04 19:33         ` Sean Christopherson
  2022-04-04 19:50           ` Maciej S. Szmigiero
  2022-04-04 19:54           ` Sean Christopherson
  0 siblings, 2 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-04 19:33 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> > > > index 47e7427d0395..a770a1c7ddd2 100644
> > > > --- a/arch/x86/kvm/svm/svm.h
> > > > +++ b/arch/x86/kvm/svm/svm.h
> > > > @@ -230,8 +230,8 @@ struct vcpu_svm {
> > > >   	bool nmi_singlestep;
> > > >   	u64 nmi_singlestep_guest_rflags;
> > > > -	unsigned int3_injected;
> > > > -	unsigned long int3_rip;
> > > > +	unsigned soft_int_injected;
> > > > +	unsigned long soft_int_linear_rip;
> > > >   	/* optional nested SVM features that are enabled for this guest  */
> > > >   	bool nrips_enabled                : 1;
> > > 
> > > 
> > > I mostly agree with this patch, but think that it doesn't address the
> > > original issue that Maciej wanted to address:
> > > 
> > > Suppose that there is *no* instruction in L2 code which caused the software
> > > exception, but rather L1 set arbitrary next_rip, and set EVENTINJ to software
> > > exception with some vector, and that injection got interrupted.
> > > 
> > > I don't think that this code will support this.
> > 
> > Argh, you're right.  Maciej's selftest injects without an instruction, but it doesn't
> > configure the scenario where that injection fails due to an exception+VM-Exit that
> > isn't intercepted by L1 and is handled by L0.  The event_inj test gets the coverage
> > for the latter, but always has a backing instruction.
> 
> Still reviewing the whole patch set, but want to clear this point quickly:
> The selftest does have an implicit intervening NPF (handled by L0) while
> injecting the first L1 -> L2 event.

I'll do some debug to figure out why the test passes for me.  I'm guessing I either
got lucky, e.g. IDT was faulted in already, or I screwed up and the test doesn't
actually pass.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 19:33         ` Sean Christopherson
@ 2022-04-04 19:50           ` Maciej S. Szmigiero
  2022-04-04 19:54           ` Sean Christopherson
  1 sibling, 0 replies; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 19:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 4.04.2022 21:33, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>>>> index 47e7427d0395..a770a1c7ddd2 100644
>>>>> --- a/arch/x86/kvm/svm/svm.h
>>>>> +++ b/arch/x86/kvm/svm/svm.h
>>>>> @@ -230,8 +230,8 @@ struct vcpu_svm {
>>>>>    	bool nmi_singlestep;
>>>>>    	u64 nmi_singlestep_guest_rflags;
>>>>> -	unsigned int3_injected;
>>>>> -	unsigned long int3_rip;
>>>>> +	unsigned soft_int_injected;
>>>>> +	unsigned long soft_int_linear_rip;
>>>>>    	/* optional nested SVM features that are enabled for this guest  */
>>>>>    	bool nrips_enabled                : 1;
>>>>
>>>>
>>>> I mostly agree with this patch, but think that it doesn't address the
>>>> original issue that Maciej wanted to address:
>>>>
>>>> Suppose that there is *no* instruction in L2 code which caused the software
>>>> exception, but rather L1 set arbitrary next_rip, and set EVENTINJ to software
>>>> exception with some vector, and that injection got interrupted.
>>>>
>>>> I don't think that this code will support this.
>>>
>>> Argh, you're right.  Maciej's selftest injects without an instruction, but it doesn't
>>> configure the scenario where that injection fails due to an exception+VM-Exit that
>>> isn't intercepted by L1 and is handled by L0.  The event_inj test gets the coverage
>>> for the latter, but always has a backing instruction.
>>
>> Still reviewing the whole patch set, but want to clear this point quickly:
>> The selftest does have an implicit intervening NPF (handled by L0) while
>> injecting the first L1 -> L2 event.
> 
> I'll do some debug to figure out why the test passes for me.  I'm guessing I either
> got lucky, e.g. IDT was faulted in already, or I screwed up and the test doesn't
> actually pass.

Try applying Maxim's proposed change to this test (adding two additional ud2s at the
start of L2 guest code and adjusting just the next_rip field accordingly).

We might have been lucky in the patched KVM code and have next_rip = rip at this
re-injection case now (the next_rip field was zero in the original KVM code).

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 19:33         ` Sean Christopherson
  2022-04-04 19:50           ` Maciej S. Szmigiero
@ 2022-04-04 19:54           ` Sean Christopherson
  2022-04-04 20:46             ` Maciej S. Szmigiero
  1 sibling, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-04 19:54 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Mon, Apr 04, 2022, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> > > > > index 47e7427d0395..a770a1c7ddd2 100644
> > > > > --- a/arch/x86/kvm/svm/svm.h
> > > > > +++ b/arch/x86/kvm/svm/svm.h
> > > > > @@ -230,8 +230,8 @@ struct vcpu_svm {
> > > > >   	bool nmi_singlestep;
> > > > >   	u64 nmi_singlestep_guest_rflags;
> > > > > -	unsigned int3_injected;
> > > > > -	unsigned long int3_rip;
> > > > > +	unsigned soft_int_injected;
> > > > > +	unsigned long soft_int_linear_rip;
> > > > >   	/* optional nested SVM features that are enabled for this guest  */
> > > > >   	bool nrips_enabled                : 1;
> > > > 
> > > > 
> > > > I mostly agree with this patch, but think that it doesn't address the
> > > > original issue that Maciej wanted to address:
> > > > 
> > > > Suppose that there is *no* instruction in L2 code which caused the software
> > > > exception, but rather L1 set arbitrary next_rip, and set EVENTINJ to software
> > > > exception with some vector, and that injection got interrupted.
> > > > 
> > > > I don't think that this code will support this.
> > > 
> > > Argh, you're right.  Maciej's selftest injects without an instruction, but it doesn't
> > > configure the scenario where that injection fails due to an exception+VM-Exit that
> > > isn't intercepted by L1 and is handled by L0.  The event_inj test gets the coverage
> > > for the latter, but always has a backing instruction.
> > 
> > Still reviewing the whole patch set, but want to clear this point quickly:
> > The selftest does have an implicit intervening NPF (handled by L0) while
> > injecting the first L1 -> L2 event.
> 
> I'll do some debug to figure out why the test passes for me.  I'm guessing I either
> got lucky, e.g. IDT was faulted in already, or I screwed up and the test doesn't
> actually pass.

Well that was easy.  My code is indeed flawed and skips the wrong instruction,
the skipped instruction just so happens to be a (spurious?) adjustment of RSP.  The
L2 guest function never runs to completion and so the "bad" RSP is never consumed.
 
   KVM: incomplete injection for L2, vector 32 @ 401c70.  next_rip = 0
   KVM: injecting for L2, vector 0 @ 401c70.  next_rip = 401c74

0000000000401c70 <l2_guest_code>:
  401c70:       48 83 ec 08             sub    $0x8,%rsp
  401c74:       83 3d 75 a7 0e 00 01    cmpl   $0x1,0xea775(%rip)        # 4ec3f0 <int_fired>
  401c7b:       74 1e                   je     401c9b <l2_guest_code+0x2b>
  401c7d:       45 31 c0                xor    %r8d,%r8d
  401c80:       b9 32 00 00 00          mov    $0x32,%ecx
  401c85:       ba 90 40 4b 00          mov    $0x4b4090,%edx
  401c8a:       31 c0                   xor    %eax,%eax
  401c8c:       be 02 00 00 00          mov    $0x2,%esi
  401c91:       bf 02 00 00 00          mov    $0x2,%edi
  401c96:       e8 05 ae 00 00          call   40caa0 <ucall>
  401c9b:       0f 01 d9                vmmcall 
  401c9e:       0f 0b                   ud2    
  401ca0:       83 3d 4d a7 0e 00 01    cmpl   $0x1,0xea74d(%rip)        # 4ec3f4 <bp_fired>
  401ca7:       74 1e                   je     401cc7 <l2_guest_code+0x57>
  401ca9:       45 31 c0                xor    %r8d,%r8d
  401cac:       b9 36 00 00 00          mov    $0x36,%ecx
  401cb1:       ba b8 40 4b 00          mov    $0x4b40b8,%edx
  401cb6:       31 c0                   xor    %eax,%eax
  401cb8:       be 02 00 00 00          mov    $0x2,%esi
  401cbd:       bf 02 00 00 00          mov    $0x2,%edi
  401cc2:       e8 d9 ad 00 00          call   40caa0 <ucall>
  401cc7:       f4                      hlt    
  401cc8:       48 83 c4 08             add    $0x8,%rsp
  401ccc:       c3                      ret    
  401ccd:       0f 1f 00                nopl   (%rax)

I don't see why the compiler is creating room for a single variable, but it doesn't
really matter, the easiest way to detect this bug is to assert that the return RIP
in the INT 0x20 handler points at l2_guest_code, e.g. this fails:

diff --git a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
index d39be5d885c1..257aa2280b5c 100644
--- a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
+++ b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
@@ -40,9 +40,13 @@ static void guest_bp_handler(struct ex_regs *regs)
 }

 static unsigned int int_fired;
+static void l2_guest_code(void);
+
 static void guest_int_handler(struct ex_regs *regs)
 {
        int_fired++;
+       GUEST_ASSERT_2(regs->rip == (unsigned long)l2_guest_code,
+                      regs->rip, (unsigned long)l2_guest_code);
 }

 static void l2_guest_code(void)

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"
  2022-04-02  1:09 ` [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" Sean Christopherson
  2022-04-04 17:14   ` Sean Christopherson
@ 2022-04-04 20:27   ` Maciej S. Szmigiero
  1 sibling, 0 replies; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 20:27 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel

On 2.04.2022 03:09, Sean Christopherson wrote:
> Re-inject INTn software interrupts instead of retrying the instruction if
> the CPU encountered an intercepted exception while vectoring the INTn,
> e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
> instruction is architecturally wrong e.g. will result in a spurious #DB
> if there's a code breakpoint on the INT3/O, and lack of re-injection also
> breaks nested virtualization, e.g. if L1 injects a software interrupt and
> vectoring the injected interrupt encounters an exception that is
> intercepted by L0 but not L1.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/svm.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ecc828d6921e..00b1399681d1 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3425,14 +3425,24 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>   static void svm_inject_irq(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_svm *svm = to_svm(vcpu);
> +	u32 type;
>   
>   	WARN_ON(!gif_set(svm));
>   
> +	if (vcpu->arch.interrupt.soft) {

It should be possible to inject soft interrupts even with GIF masked,
looked at the relevant code at patch 3 from my series [1].

Thanks,
Maciej

[1]: https://lore.kernel.org/kvm/a28577564a7583c32f0029f2307f63ca8869cf22.1646944472.git.maciej.szmigiero@oracle.com/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 16:49     ` Sean Christopherson
  2022-04-04 16:53       ` Maciej S. Szmigiero
@ 2022-04-04 20:44       ` Maciej S. Szmigiero
  2022-04-06  1:48         ` Sean Christopherson
  1 sibling, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 20:44 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On 4.04.2022 18:49, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maxim Levitsky wrote:
>> On Sat, 2022-04-02 at 01:09 +0000, Sean Christopherson wrote:
>>> Re-inject INT3/INTO instead of retrying the instruction if the CPU
>>> encountered an intercepted exception while vectoring the software
>>> exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
>>> shadow paging.  Retrying the instruction is architecturally wrong, e.g.
>>> will result in a spurious #DB if there's a code breakpoint on the INT3/O,
>>> and lack of re-injection also breaks nested virtualization, e.g. if L1
>>> injects a software exception and vectoring the injected exception
>>> encounters an exception that is intercepted by L0 but not L1.
>>>
>>> Due to, ahem, deficiencies in the SVM architecture, acquiring the next
>>> RIP may require flowing through the emulator even if NRIPS is supported,
>>> as the CPU clears next_rip if the VM-Exit is due to an exception other
>>> than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
>>> deal with this, "skip" the instruction to calculate next_ript, and then
>>> unwind the RIP write and any side effects (RFLAGS updates).
> 
> ...
> 
(..)
>>> +
>>>   	kvm_make_request(KVM_REQ_EVENT, vcpu);
>>>   
>>>   	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
>>> @@ -3711,9 +3762,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>>>   	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
>>>   	 * be the reported vectored event, but RIP still needs to be unwound.
>>>   	 */
>>> -	if (int3_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
>>> -	   kvm_is_linear_rip(vcpu, svm->int3_rip))
>>> -		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - int3_injected);
>>> +	if (soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
>>> +	   kvm_is_linear_rip(vcpu, svm->soft_int_linear_rip))
>>> +		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
>>>   
>>>   	switch (type) {
>>>   	case SVM_EXITINTINFO_TYPE_NMI:
>>> @@ -3726,14 +3777,6 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>>>   		if (vector == X86_TRAP_VC)
>>>   			break;
>>>   
>>> -		/*
>>> -		 * In case of software exceptions, do not reinject the vector,
>>> -		 * but re-execute the instruction instead. Rewind RIP first
>>> -		 * if we emulated INT3 before.
>>> -		 */
>>> -		if (kvm_exception_is_soft(vector))
>>> -			break;
>>> -
>>>   		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
>>>   			u32 err = svm->vmcb->control.exit_int_info_err;
>>>   			kvm_requeue_exception_e(vcpu, vector, err);
>>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>>> index 47e7427d0395..a770a1c7ddd2 100644
>>> --- a/arch/x86/kvm/svm/svm.h
>>> +++ b/arch/x86/kvm/svm/svm.h
>>> @@ -230,8 +230,8 @@ struct vcpu_svm {
>>>   	bool nmi_singlestep;
>>>   	u64 nmi_singlestep_guest_rflags;
>>>   
>>> -	unsigned int3_injected;
>>> -	unsigned long int3_rip;
>>> +	unsigned soft_int_injected;
>>> +	unsigned long soft_int_linear_rip;
>>>   
>>>   	/* optional nested SVM features that are enabled for this guest  */
>>>   	bool nrips_enabled                : 1;
>>
>>
>> I mostly agree with this patch, but think that it doesn't address the
>> original issue that Maciej wanted to address:
>>
>> Suppose that there is *no* instruction in L2 code which caused the software
>> exception, but rather L1 set arbitrary next_rip, and set EVENTINJ to software
>> exception with some vector, and that injection got interrupted.
>>
>> I don't think that this code will support this.
> 
> Argh, you're right.  Maciej's selftest injects without an instruction, but it doesn't
> configure the scenario where that injection fails due to an exception+VM-Exit that
> isn't intercepted by L1 and is handled by L0.  The event_inj test gets the coverage
> for the latter, but always has a backing instruction.
> 
>> I think that svm_complete_interrupts should store next_rip it in some field
>> like VMX does (vcpu->arch.event_exit_inst_len).
> 
> Yeah.  The ugly part is that because next_rip is guaranteed to be cleared on exit
> (the exit is gauranteed to be due to a fault-like exception), KVM has to snapshot
> next_rip during the "original" injection and use the linear_rip matching heuristic
> to detect this scenario.
> 
>> That field also should be migrated, or we must prove that it works anyway.
>> E.g, what happens when we tried to inject event,
>> injection was interrupted by other exception, and then we migrate?
> 
> Ya, should Just Work if control.next_rip is used to cache the next rip.
> 
> Handling this doesn't seem to be too awful (haven't tested yet), it's largely the
> same logic as the existing !nrips code.
> 
> In svm_update_soft_interrupt_rip(), snapshot all information regardless of whether
> or not nrips is enabled:
> 
> 	svm->soft_int_injected = true;
> 	svm->soft_int_csbase = svm->vmcb->save.cs.base;
> 	svm->soft_int_old_rip = old_rip;
> 	svm->soft_int_next_rip = rip;
> 
> 	if (nrips)
> 		kvm_rip_write(vcpu, old_rip);
> 
> 	if (static_cpu_has(X86_FEATURE_NRIPS))
> 		svm->vmcb->control.next_rip = rip;
> 
> and then in svm_complete_interrupts(), change the linear RIP matching code to look
> for the old rip in the nrips case and stuff svm->vmcb->control.next_rip on match.
> 
> 	bool soft_int_injected = svm->soft_int_injected;
> 	unsigned soft_int_rip;
> 
> 	svm->soft_int_injected = false;
> 
> 	if (soft_int_injected) {
> 		if (nrips)
> 			soft_int_rip = svm->soft_int_old_rip;
> 		else
> 			soft_int_rip = svm->soft_int_next_rip;
> 	}
> 
> 	...
> 
> 	if soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
> 	   kvm_is_linear_rip(vcpu, soft_int_rip + svm->soft_int_csbase)) {
> 		if (nrips)
> 			svm->vmcb->control.next_rip = svm->soft_int_next_rip;
> 		else
> 			kvm_rip_write(vcpu, svm->soft_int_old_rip);
> 	}
> 
> 
> 

Despite what the svm_update_soft_interrupt_rip() name might suggest this
handles only *soft exceptions*, not *soft interrupts*
(which are injected by svm_inject_irq() and also need proper next_rip
management).

Also, I'm not sure that even the proposed updated code above will
actually restore the L1-requested next_rip correctly on L1 -> L2
re-injection (will review once the full version is available).

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 19:54           ` Sean Christopherson
@ 2022-04-04 20:46             ` Maciej S. Szmigiero
  0 siblings, 0 replies; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 20:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 4.04.2022 21:54, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Sean Christopherson wrote:
>> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>>>>> index 47e7427d0395..a770a1c7ddd2 100644
>>>>>> --- a/arch/x86/kvm/svm/svm.h
>>>>>> +++ b/arch/x86/kvm/svm/svm.h
>>>>>> @@ -230,8 +230,8 @@ struct vcpu_svm {
>>>>>>    	bool nmi_singlestep;
>>>>>>    	u64 nmi_singlestep_guest_rflags;
>>>>>> -	unsigned int3_injected;
>>>>>> -	unsigned long int3_rip;
>>>>>> +	unsigned soft_int_injected;
>>>>>> +	unsigned long soft_int_linear_rip;
>>>>>>    	/* optional nested SVM features that are enabled for this guest  */
>>>>>>    	bool nrips_enabled                : 1;
>>>>>
>>>>>
>>>>> I mostly agree with this patch, but think that it doesn't address the
>>>>> original issue that Maciej wanted to address:
>>>>>
>>>>> Suppose that there is *no* instruction in L2 code which caused the software
>>>>> exception, but rather L1 set arbitrary next_rip, and set EVENTINJ to software
>>>>> exception with some vector, and that injection got interrupted.
>>>>>
>>>>> I don't think that this code will support this.
>>>>
>>>> Argh, you're right.  Maciej's selftest injects without an instruction, but it doesn't
>>>> configure the scenario where that injection fails due to an exception+VM-Exit that
>>>> isn't intercepted by L1 and is handled by L0.  The event_inj test gets the coverage
>>>> for the latter, but always has a backing instruction.
>>>
>>> Still reviewing the whole patch set, but want to clear this point quickly:
>>> The selftest does have an implicit intervening NPF (handled by L0) while
>>> injecting the first L1 -> L2 event.
>>
>> I'll do some debug to figure out why the test passes for me.  I'm guessing I either
>> got lucky, e.g. IDT was faulted in already, or I screwed up and the test doesn't
>> actually pass.
> 
> Well that was easy.  My code is indeed flawed and skips the wrong instruction,
> the skipped instruction just so happens to be a (spurious?) adjustment of RSP.  The
> L2 guest function never runs to completion and so the "bad" RSP is never consumed.
>   
>     KVM: incomplete injection for L2, vector 32 @ 401c70.  next_rip = 0
>     KVM: injecting for L2, vector 0 @ 401c70.  next_rip = 401c74
> 
> 0000000000401c70 <l2_guest_code>:
>    401c70:       48 83 ec 08             sub    $0x8,%rsp
>    401c74:       83 3d 75 a7 0e 00 01    cmpl   $0x1,0xea775(%rip)        # 4ec3f0 <int_fired>
>    401c7b:       74 1e                   je     401c9b <l2_guest_code+0x2b>
>    401c7d:       45 31 c0                xor    %r8d,%r8d
>    401c80:       b9 32 00 00 00          mov    $0x32,%ecx
>    401c85:       ba 90 40 4b 00          mov    $0x4b4090,%edx
>    401c8a:       31 c0                   xor    %eax,%eax
>    401c8c:       be 02 00 00 00          mov    $0x2,%esi
>    401c91:       bf 02 00 00 00          mov    $0x2,%edi
>    401c96:       e8 05 ae 00 00          call   40caa0 <ucall>
>    401c9b:       0f 01 d9                vmmcall
>    401c9e:       0f 0b                   ud2
>    401ca0:       83 3d 4d a7 0e 00 01    cmpl   $0x1,0xea74d(%rip)        # 4ec3f4 <bp_fired>
>    401ca7:       74 1e                   je     401cc7 <l2_guest_code+0x57>
>    401ca9:       45 31 c0                xor    %r8d,%r8d
>    401cac:       b9 36 00 00 00          mov    $0x36,%ecx
>    401cb1:       ba b8 40 4b 00          mov    $0x4b40b8,%edx
>    401cb6:       31 c0                   xor    %eax,%eax
>    401cb8:       be 02 00 00 00          mov    $0x2,%esi
>    401cbd:       bf 02 00 00 00          mov    $0x2,%edi
>    401cc2:       e8 d9 ad 00 00          call   40caa0 <ucall>
>    401cc7:       f4                      hlt
>    401cc8:       48 83 c4 08             add    $0x8,%rsp
>    401ccc:       c3                      ret
>    401ccd:       0f 1f 00                nopl   (%rax)
> 
> I don't see why the compiler is creating room for a single variable, but it doesn't
> really matter, the easiest way to detect this bug is to assert that the return RIP
> in the INT 0x20 handler points at l2_guest_code, e.g. this fails:
> 
> diff --git a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
> index d39be5d885c1..257aa2280b5c 100644
> --- a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
> @@ -40,9 +40,13 @@ static void guest_bp_handler(struct ex_regs *regs)
>   }
> 
>   static unsigned int int_fired;
> +static void l2_guest_code(void);
> +
>   static void guest_int_handler(struct ex_regs *regs)
>   {
>          int_fired++;
> +       GUEST_ASSERT_2(regs->rip == (unsigned long)l2_guest_code,
> +                      regs->rip, (unsigned long)l2_guest_code);
>   }
> 
>   static void l2_guest_code(void)

It totally makes sense to add the above as an additional assert to the
self test - the more checks the test have the better at catching bugs
it is.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-04 20:44       ` Maciej S. Szmigiero
@ 2022-04-06  1:48         ` Sean Christopherson
  2022-04-06 13:13           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-06  1:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> On 4.04.2022 18:49, Sean Christopherson wrote:
> > On Mon, Apr 04, 2022, Maxim Levitsky wrote:
> > In svm_update_soft_interrupt_rip(), snapshot all information regardless of whether
> > or not nrips is enabled:
> > 
> > 	svm->soft_int_injected = true;
> > 	svm->soft_int_csbase = svm->vmcb->save.cs.base;
> > 	svm->soft_int_old_rip = old_rip;
> > 	svm->soft_int_next_rip = rip;
> > 
> > 	if (nrips)
> > 		kvm_rip_write(vcpu, old_rip);
> > 
> > 	if (static_cpu_has(X86_FEATURE_NRIPS))
> > 		svm->vmcb->control.next_rip = rip;
> > 
> > and then in svm_complete_interrupts(), change the linear RIP matching code to look
> > for the old rip in the nrips case and stuff svm->vmcb->control.next_rip on match.
> > 
> > 	bool soft_int_injected = svm->soft_int_injected;
> > 	unsigned soft_int_rip;
> > 
> > 	svm->soft_int_injected = false;
> > 
> > 	if (soft_int_injected) {
> > 		if (nrips)
> > 			soft_int_rip = svm->soft_int_old_rip;
> > 		else
> > 			soft_int_rip = svm->soft_int_next_rip;
> > 	}
> > 
> > 	...
> > 
> > 	if soft_int_injected && type == SVM_EXITINTINFO_TYPE_EXEPT &&
> > 	   kvm_is_linear_rip(vcpu, soft_int_rip + svm->soft_int_csbase)) {
> > 		if (nrips)
> > 			svm->vmcb->control.next_rip = svm->soft_int_next_rip;
> > 		else
> > 			kvm_rip_write(vcpu, svm->soft_int_old_rip);
> > 	}
> > 
> > 
> > 
> 
> Despite what the svm_update_soft_interrupt_rip() name might suggest this
> handles only *soft exceptions*, not *soft interrupts*
> (which are injected by svm_inject_irq() and also need proper next_rip
> management).

Yeah, soft interrupts are handled in the next patch.  I couldn't come up with a
less awful name.

> Also, I'm not sure that even the proposed updated code above will
> actually restore the L1-requested next_rip correctly on L1 -> L2
> re-injection (will review once the full version is available).

Spoiler alert, it doesn't.  Save yourself the review time.  :-)

The missing piece is stashing away the injected event on nested VMRUN.  Those
events don't get routed through the normal interrupt/exception injection code and
so the next_rip info is lost on the subsequent #NPF.

Treating soft interrupts/exceptions like they were injected by KVM (which they
are, technically) works and doesn't seem too gross.  E.g. when prepping vmcb02

	if (svm->nrips_enabled)
		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
	else if (boot_cpu_has(X86_FEATURE_NRIPS))
		vmcb02->control.next_rip    = vmcb12_rip;

	if (is_evtinj_soft(vmcb02->control.event_inj)) {
		svm->soft_int_injected = true;
		svm->soft_int_csbase = svm->vmcb->save.cs.base;
		svm->soft_int_old_rip = vmcb12_rip;
		if (svm->nrips_enabled)
			svm->soft_int_next_rip = svm->nested.ctl.next_rip;
		else
			svm->soft_int_next_rip = vmcb12_rip;
	}

And then the VMRUN error path just needs to clear soft_int_injected.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06  1:48         ` Sean Christopherson
@ 2022-04-06 13:13           ` Maciej S. Szmigiero
  2022-04-06 17:10             ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-06 13:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 6.04.2022 03:48, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
(..)
>> Also, I'm not sure that even the proposed updated code above will
>> actually restore the L1-requested next_rip correctly on L1 -> L2
>> re-injection (will review once the full version is available).
> 
> Spoiler alert, it doesn't.  Save yourself the review time.  :-)
> 
> The missing piece is stashing away the injected event on nested VMRUN.  Those
> events don't get routed through the normal interrupt/exception injection code and
> so the next_rip info is lost on the subsequent #NPF.
> 
> Treating soft interrupts/exceptions like they were injected by KVM (which they
> are, technically) works and doesn't seem too gross.  E.g. when prepping vmcb02
> 
> 	if (svm->nrips_enabled)
> 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> 	else if (boot_cpu_has(X86_FEATURE_NRIPS))
> 		vmcb02->control.next_rip    = vmcb12_rip;
> 
> 	if (is_evtinj_soft(vmcb02->control.event_inj)) {
> 		svm->soft_int_injected = true;
> 		svm->soft_int_csbase = svm->vmcb->save.cs.base;
> 		svm->soft_int_old_rip = vmcb12_rip;
> 		if (svm->nrips_enabled)
> 			svm->soft_int_next_rip = svm->nested.ctl.next_rip;
> 		else
> 			svm->soft_int_next_rip = vmcb12_rip;
> 	}
> 
> And then the VMRUN error path just needs to clear soft_int_injected.

I am also a fan of parsing EVENTINJ from VMCB12 into relevant KVM
injection structures (much like EXITINTINFO is parsed), as I said to
Maxim two days ago [1].
Not only for software {interrupts,exceptions} but for all incoming
events (again, just like EXITINTINFO).

However, there is another issue related to L1 -> L2 event re-injection
using standard KVM event injection mechanism: it mixes the L1 injection
state with the L2 one.

Specifically for SVM:
* When re-injecting a NMI into L2 NMI-blocking is enabled in
vcpu->arch.hflags (shared between L1 and L2) and IRET intercept is
enabled.

This is incorrect, since it is L1 that is responsible for enforcing NMI
blocking for NMIs that it injects into its L2.
Also, *L2* being the target of such injection definitely should not block
further NMIs for *L1*.

* When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
even on the BUG_ON() level), while L1 should be able to inject even when
L2 GIF is off,

With the code in my previous patch set I planned to use
exit_during_event_injection() to detect such case, but if we implement
VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
comes from L1, so its normal injection side-effects should be skipped.

By the way, the relevant VMX code also looks rather suspicious,
especially for the !enable_vnmi case.

Thanks,
Maciej

[1]: https://lore.kernel.org/kvm/7d67bc6f-00ac-7c07-f6c2-c41b2f0d35a1@maciej.szmigiero.name/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 13:13           ` Maciej S. Szmigiero
@ 2022-04-06 17:10             ` Sean Christopherson
  2022-04-06 19:08               ` Maciej S. Szmigiero
  0 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-06 17:10 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 03:48, Sean Christopherson wrote:
> > On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> (..)
> > > Also, I'm not sure that even the proposed updated code above will
> > > actually restore the L1-requested next_rip correctly on L1 -> L2
> > > re-injection (will review once the full version is available).
> > 
> > Spoiler alert, it doesn't.  Save yourself the review time.  :-)
> > 
> > The missing piece is stashing away the injected event on nested VMRUN.  Those
> > events don't get routed through the normal interrupt/exception injection code and
> > so the next_rip info is lost on the subsequent #NPF.
> > 
> > Treating soft interrupts/exceptions like they were injected by KVM (which they
> > are, technically) works and doesn't seem too gross.  E.g. when prepping vmcb02
> > 
> > 	if (svm->nrips_enabled)
> > 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> > 	else if (boot_cpu_has(X86_FEATURE_NRIPS))
> > 		vmcb02->control.next_rip    = vmcb12_rip;
> > 
> > 	if (is_evtinj_soft(vmcb02->control.event_inj)) {
> > 		svm->soft_int_injected = true;
> > 		svm->soft_int_csbase = svm->vmcb->save.cs.base;
> > 		svm->soft_int_old_rip = vmcb12_rip;
> > 		if (svm->nrips_enabled)
> > 			svm->soft_int_next_rip = svm->nested.ctl.next_rip;
> > 		else
> > 			svm->soft_int_next_rip = vmcb12_rip;
> > 	}
> > 
> > And then the VMRUN error path just needs to clear soft_int_injected.
> 
> I am also a fan of parsing EVENTINJ from VMCB12 into relevant KVM
> injection structures (much like EXITINTINFO is parsed), as I said to
> Maxim two days ago [1].
> Not only for software {interrupts,exceptions} but for all incoming
> events (again, just like EXITINTINFO).

Ahh, I saw that fly by, but somehow I managed to misread what you intended.

I like the idea of populating vcpu->arch.interrupt/exception as "injected" events.
KVM prioritizes "injected" over other nested events, so in theory it should work
without too much fuss.  I've ran through a variety of edge cases in my head and
haven't found anything that would be fundamentally broken.  I think even live
migration would work.

I think I'd prefer to do that in a follow-up series so that nVMX can be converted
at the same time?  It's a bit absurd to add the above soft int code knowing that,
at least in theory, simply populating the right software structs would automagically
fix the bug.  But manually handling the soft int case first would be safer in the
sense that we'd still have a fix for the soft int case if it turns out that populating
vcpu->arch.interrupt/exception isn't as straightfoward as it seems.

> However, there is another issue related to L1 -> L2 event re-injection
> using standard KVM event injection mechanism: it mixes the L1 injection
> state with the L2 one.
> 
> Specifically for SVM:
> * When re-injecting a NMI into L2 NMI-blocking is enabled in
> vcpu->arch.hflags (shared between L1 and L2) and IRET intercept is
> enabled.
> 
> This is incorrect, since it is L1 that is responsible for enforcing NMI
> blocking for NMIs that it injects into its L2.

Ah, I see what you're saying.  I think :-)  IIUC, we can fix this bug without any
new flags, just skip the side effects if the NMI is being injected into L2.

@@ -3420,6 +3424,10 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
        struct vcpu_svm *svm = to_svm(vcpu);

        svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
+
+       if (is_guest_mode(vcpu))
+               return;
+
        vcpu->arch.hflags |= HF_NMI_MASK;
        if (!sev_es_guest(vcpu->kvm))
                svm_set_intercept(svm, INTERCEPT_IRET);

and for nVMX:

@@ -4598,6 +4598,9 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 {
        struct vcpu_vmx *vmx = to_vmx(vcpu);

+       if (is_guest_mode(vcpu))
+               goto inject_nmi;
+
        if (!enable_vnmi) {
                /*
                 * Tracking the NMI-blocked state in software is built upon
@@ -4619,6 +4622,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
                return;
        }

+inject_nmi:
        vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
                        INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);


> Also, *L2* being the target of such injection definitely should not block
> further NMIs for *L1*.

Actually, it should block NMIs for L1.  From L1's perspective, the injection is
part of VM-Entry.  That's a single gigantic instruction, thus there is no NMI window
until VM-Entry completes from L1's perspetive.  Any exit that occurs on vectoring
an injected event and is handled by L0 should not be visible to L1, because from
L1's perspective it's all part of VMRUN/VMLAUNCH/VMRESUME.  So blocking new events
because an NMI (or any event) needs to be reinjected for L2 is correct.

> * When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
> even on the BUG_ON() level), while L1 should be able to inject even when
> L2 GIF is off,

Isn't that just a matter of tweaking the assertion to ignore GIF if L2 is
active?  Hmm, or deleting the assertion altogether, it's likely doing more harm
than good at this point.

> With the code in my previous patch set I planned to use
> exit_during_event_injection() to detect such case, but if we implement
> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
> comes from L1, so its normal injection side-effects should be skipped.

Do we still need a flag based on the above?  Honest question... I've been staring
at all this for the better part of an hour and may have lost track of things.

> By the way, the relevant VMX code also looks rather suspicious,
> especially for the !enable_vnmi case.

I think it's safe to say we can ignore edge cases for !enable_vnmi.  It might even
be worth trying to remove that support again (Paolo tried years ago), IIRC the
only Intel CPUs that don't support virtual NMIs are some funky Yonah SKUs.

> Thanks,
> Maciej
> 
> [1]: https://lore.kernel.org/kvm/7d67bc6f-00ac-7c07-f6c2-c41b2f0d35a1@maciej.szmigiero.name/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 17:10             ` Sean Christopherson
@ 2022-04-06 19:08               ` Maciej S. Szmigiero
  2022-04-06 19:48                 ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-06 19:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 6.04.2022 19:10, Sean Christopherson wrote:
> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 03:48, Sean Christopherson wrote:
>>> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>> (..)
>>>> Also, I'm not sure that even the proposed updated code above will
>>>> actually restore the L1-requested next_rip correctly on L1 -> L2
>>>> re-injection (will review once the full version is available).
>>>
>>> Spoiler alert, it doesn't.  Save yourself the review time.  :-)
>>>
>>> The missing piece is stashing away the injected event on nested VMRUN.  Those
>>> events don't get routed through the normal interrupt/exception injection code and
>>> so the next_rip info is lost on the subsequent #NPF.
>>>
>>> Treating soft interrupts/exceptions like they were injected by KVM (which they
>>> are, technically) works and doesn't seem too gross.  E.g. when prepping vmcb02
>>>
>>> 	if (svm->nrips_enabled)
>>> 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
>>> 	else if (boot_cpu_has(X86_FEATURE_NRIPS))
>>> 		vmcb02->control.next_rip    = vmcb12_rip;
>>>
>>> 	if (is_evtinj_soft(vmcb02->control.event_inj)) {
>>> 		svm->soft_int_injected = true;
>>> 		svm->soft_int_csbase = svm->vmcb->save.cs.base;
>>> 		svm->soft_int_old_rip = vmcb12_rip;
>>> 		if (svm->nrips_enabled)
>>> 			svm->soft_int_next_rip = svm->nested.ctl.next_rip;
>>> 		else
>>> 			svm->soft_int_next_rip = vmcb12_rip;
>>> 	}
>>>
>>> And then the VMRUN error path just needs to clear soft_int_injected.
>>
>> I am also a fan of parsing EVENTINJ from VMCB12 into relevant KVM
>> injection structures (much like EXITINTINFO is parsed), as I said to
>> Maxim two days ago [1].
>> Not only for software {interrupts,exceptions} but for all incoming
>> events (again, just like EXITINTINFO).
> 
> Ahh, I saw that fly by, but somehow I managed to misread what you intended.
> 
> I like the idea of populating vcpu->arch.interrupt/exception as "injected" events.
> KVM prioritizes "injected" over other nested events, so in theory it should work
> without too much fuss.  I've ran through a variety of edge cases in my head and
> haven't found anything that would be fundamentally broken.  I think even live
> migration would work.
> 
> I think I'd prefer to do that in a follow-up series so that nVMX can be converted
> at the same time?  It's a bit absurd to add the above soft int code knowing that,
> at least in theory, simply populating the right software structs would automagically
> fix the bug.  But manually handling the soft int case first would be safer in the
> sense that we'd still have a fix for the soft int case if it turns out that populating
> vcpu->arch.interrupt/exception isn't as straightfoward as it seems.

I don't see any problem with having two patch series, the second series
depending on the first.

I planned to at least fix the bugs that I described in my previous message
(the NMI one actually breaks a real Windows guest) but that needs
knowledge how the event injection code will finally look like after the
first series of fixes.

>> However, there is another issue related to L1 -> L2 event re-injection
>> using standard KVM event injection mechanism: it mixes the L1 injection
>> state with the L2 one.
>>
>> Specifically for SVM:
>> * When re-injecting a NMI into L2 NMI-blocking is enabled in
>> vcpu->arch.hflags (shared between L1 and L2) and IRET intercept is
>> enabled.
>>
>> This is incorrect, since it is L1 that is responsible for enforcing NMI
>> blocking for NMIs that it injects into its L2.
> 
> Ah, I see what you're saying.  I think :-)  IIUC, we can fix this bug without any
> new flags, just skip the side effects if the NMI is being injected into L2.
> 
> @@ -3420,6 +3424,10 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>          struct vcpu_svm *svm = to_svm(vcpu);
> 
>          svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
> +
> +       if (is_guest_mode(vcpu))
> +               return;
> +
>          vcpu->arch.hflags |= HF_NMI_MASK;
>          if (!sev_es_guest(vcpu->kvm))
>                  svm_set_intercept(svm, INTERCEPT_IRET);
> 
> and for nVMX:
> 
> @@ -4598,6 +4598,9 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
>   {
>          struct vcpu_vmx *vmx = to_vmx(vcpu);
> 
> +       if (is_guest_mode(vcpu))
> +               goto inject_nmi;
> +
>          if (!enable_vnmi) {
>                  /*
>                   * Tracking the NMI-blocked state in software is built upon
> @@ -4619,6 +4622,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
>                  return;
>          }
> 
> +inject_nmi:
>          vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
>                          INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);
> 

And what if it's L0 that is trying to inject a NMI into L2?
In this case is_guest_mode() is true, but the full NMI injection machinery
should be used.

>> Also, *L2* being the target of such injection definitely should not block
>> further NMIs for *L1*.
> 
> Actually, it should block NMIs for L1.  From L1's perspective, the injection is
> part of VM-Entry.  That's a single gigantic instruction, thus there is no NMI window
> until VM-Entry completes from L1's perspetive.  Any exit that occurs on vectoring
> an injected event and is handled by L0 should not be visible to L1, because from
> L1's perspective it's all part of VMRUN/VMLAUNCH/VMRESUME.  So blocking new events
> because an NMI (or any event) needs to be reinjected for L2 is correct.

I think this kind of NMI blocking will be already handled by having
the pending new NMI in vcpu->arch.nmi_pending but the one that needs
re-injecting in vcpu->arch.nmi_injected.

The pending new NMI in vcpu->arch.nmi_pending won't be handled until
vcpu->arch.nmi_injected gets cleared (that is, until re-injection is
successful).

It is incorrect however, to wait for L2 to execute IRET to unblock
L0 -> L1 NMIs or L1 -> L2 NMIs, in these two cases we (L0) just need the CPU
to vector that L2 NMI so it no longer shows in EXITINTINFO.

It is also incorrect to block L1 -> L2 NMI injection because either L1
or L2 is currently under NMI blocking: the first case is obvious,
the second because it's L1 that is supposed to take care of proper NMI
blocking for L2 when injecting an NMI there.

>> * When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
>> even on the BUG_ON() level), while L1 should be able to inject even when
>> L2 GIF is off,
> 
> Isn't that just a matter of tweaking the assertion to ignore GIF if L2 is
> active?  Hmm, or deleting the assertion altogether, it's likely doing more harm
> than good at this point.

I assume this assertion is meant to catch the case when KVM itself (L0) is
trying to erroneously inject a hardware interrupt into L1 or L2, so it will
need to be skipped only for L1 -> L2 event injection.

Whether this assertion benefits outweigh its costs is debatable - don't have
a strong opinion here (BUG_ON() is for sure too strong, but WARN_ON_ONCE()
might make sense to catch latent bugs).

>> With the code in my previous patch set I planned to use
>> exit_during_event_injection() to detect such case, but if we implement
>> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
>> comes from L1, so its normal injection side-effects should be skipped.
> 
> Do we still need a flag based on the above?  Honest question... I've been staring
> at all this for the better part of an hour and may have lost track of things.

If checking just is_guest_mode() is not enough due to reasons I described
above then we need to somehow determine in the NMI / IRQ injection handler
whether the event to be injected into L2 comes from L0 or L1.
For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.

>> By the way, the relevant VMX code also looks rather suspicious,
>> especially for the !enable_vnmi case.
> 
> I think it's safe to say we can ignore edge cases for !enable_vnmi.  It might even
> be worth trying to remove that support again (Paolo tried years ago), IIRC the
> only Intel CPUs that don't support virtual NMIs are some funky Yonah SKUs.

Ack, we could at least disable nested on !enable_vnmi.

BTW, I think that besides Yonah cores very early Core 2 CPUs also lacked
vNMI support, that's why !enable_vnmi support was reinstated.
But that's hardware even older than !nrips AMD parts.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 19:08               ` Maciej S. Szmigiero
@ 2022-04-06 19:48                 ` Sean Christopherson
  2022-04-06 20:30                   ` Maciej S. Szmigiero
  0 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-06 19:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 19:10, Sean Christopherson wrote:
> > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> And what if it's L0 that is trying to inject a NMI into L2?
> In this case is_guest_mode() is true, but the full NMI injection machinery
> should be used.

Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
was thinking that NMIs always exit.  The "L1 wants" part should be conditioned on
NMI exiting being enabled.  It's benign because KVM always wants "real" NMIs, and
so the path is never encountered.

@@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
        switch ((u16)exit_reason.basic) {
        case EXIT_REASON_EXCEPTION_NMI:
                intr_info = vmx_get_intr_info(vcpu);
-               if (is_nmi(intr_info))
+               if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
                        return true;
                else if (is_page_fault(intr_info))
                        return true;


> > > Also, *L2* being the target of such injection definitely should not block
> > > further NMIs for *L1*.
> > 
> > Actually, it should block NMIs for L1.  From L1's perspective, the injection is
> > part of VM-Entry.  That's a single gigantic instruction, thus there is no NMI window
> > until VM-Entry completes from L1's perspetive.  Any exit that occurs on vectoring
> > an injected event and is handled by L0 should not be visible to L1, because from
> > L1's perspective it's all part of VMRUN/VMLAUNCH/VMRESUME.  So blocking new events
> > because an NMI (or any event) needs to be reinjected for L2 is correct.
> 
> I think this kind of NMI blocking will be already handled by having
> the pending new NMI in vcpu->arch.nmi_pending but the one that needs
> re-injecting in vcpu->arch.nmi_injected.
> 
> The pending new NMI in vcpu->arch.nmi_pending won't be handled until
> vcpu->arch.nmi_injected gets cleared (that is, until re-injection is
> successful).

Yep.

> It is incorrect however, to wait for L2 to execute IRET to unblock
> L0 -> L1 NMIs or L1 -> L2 NMIs, in these two cases we (L0) just need the CPU
> to vector that L2 NMI so it no longer shows in EXITINTINFO.

Yep, and the pending NMI should cause KVM to request an "immediate" exit that
occurs after vectoring completes.

> It is also incorrect to block L1 -> L2 NMI injection because either L1
> or L2 is currently under NMI blocking: the first case is obvious,
> the second because it's L1 that is supposed to take care of proper NMI
> blocking for L2 when injecting an NMI there.

Yep, but I don't think there's a bug here.  At least not for nVMX.

> > > * When re-injecting a *hardware* IRQ into L2 GIF is checked (previously
> > > even on the BUG_ON() level), while L1 should be able to inject even when
> > > L2 GIF is off,
> > 
> > Isn't that just a matter of tweaking the assertion to ignore GIF if L2 is
> > active?  Hmm, or deleting the assertion altogether, it's likely doing more harm
> > than good at this point.
> 
> I assume this assertion is meant to catch the case when KVM itself (L0) is
> trying to erroneously inject a hardware interrupt into L1 or L2, so it will
> need to be skipped only for L1 -> L2 event injection.

Yeah, that's what my git archeaology came up with too.

> Whether this assertion benefits outweigh its costs is debatable - don't have
> a strong opinion here (BUG_ON() is for sure too strong, but WARN_ON_ONCE()
> might make sense to catch latent bugs).
> 
> > > With the code in my previous patch set I planned to use
> > > exit_during_event_injection() to detect such case, but if we implement
> > > VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
> > > comes from L1, so its normal injection side-effects should be skipped.
> > 
> > Do we still need a flag based on the above?  Honest question... I've been staring
> > at all this for the better part of an hour and may have lost track of things.
> 
> If checking just is_guest_mode() is not enough due to reasons I described
> above then we need to somehow determine in the NMI / IRQ injection handler
> whether the event to be injected into L2 comes from L0 or L1.
> For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.

Yes :-(  And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
 
> > > By the way, the relevant VMX code also looks rather suspicious,
> > > especially for the !enable_vnmi case.
> > 
> > I think it's safe to say we can ignore edge cases for !enable_vnmi.  It might even
> > be worth trying to remove that support again (Paolo tried years ago), IIRC the
> > only Intel CPUs that don't support virtual NMIs are some funky Yonah SKUs.
> 
> Ack, we could at least disable nested on !enable_vnmi.
> 
> BTW, I think that besides Yonah cores very early Core 2 CPUs also lacked
> vNMI support, that's why !enable_vnmi support was reinstated.
> But that's hardware even older than !nrips AMD parts.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 19:48                 ` Sean Christopherson
@ 2022-04-06 20:30                   ` Maciej S. Szmigiero
  2022-04-06 20:52                     ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-06 20:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 6.04.2022 21:48, Sean Christopherson wrote:
> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 19:10, Sean Christopherson wrote:
>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> And what if it's L0 that is trying to inject a NMI into L2?
>> In this case is_guest_mode() is true, but the full NMI injection machinery
>> should be used.
> 
> Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
> was thinking that NMIs always exit.  The "L1 wants" part should be conditioned on
> NMI exiting being enabled.  It's benign because KVM always wants "real" NMIs, and
> so the path is never encountered.
> 
> @@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
>          switch ((u16)exit_reason.basic) {
>          case EXIT_REASON_EXCEPTION_NMI:
>                  intr_info = vmx_get_intr_info(vcpu);
> -               if (is_nmi(intr_info))
> +               if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
>                          return true;
>                  else if (is_page_fault(intr_info))
>                          return true;
> 

I guess you mean "benign" when having KVM as L1, since other hypervisors may
let their L2s handle NMIs themselves.

>> It is also incorrect to block L1 -> L2 NMI injection because either L1
>> or L2 is currently under NMI blocking: the first case is obvious,
>> the second because it's L1 that is supposed to take care of proper NMI
>> blocking for L2 when injecting an NMI there.
> 
> Yep, but I don't think there's a bug here.  At least not for nVMX.

I agree this scenario should currently work (including on nSVM) - mentioned
it just as a constraint on solution space.

>>>> With the code in my previous patch set I planned to use
>>>> exit_during_event_injection() to detect such case, but if we implement
>>>> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
>>>> comes from L1, so its normal injection side-effects should be skipped.
>>>
>>> Do we still need a flag based on the above?  Honest question... I've been staring
>>> at all this for the better part of an hour and may have lost track of things.
>>
>> If checking just is_guest_mode() is not enough due to reasons I described
>> above then we need to somehow determine in the NMI / IRQ injection handler
>> whether the event to be injected into L2 comes from L0 or L1.
>> For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.
> 
> Yes :-(  And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
>

Another option for saving and restoring a VM would be to add it to
KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
control area?).

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 20:30                   ` Maciej S. Szmigiero
@ 2022-04-06 20:52                     ` Sean Christopherson
  2022-04-06 22:34                       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-06 20:52 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 21:48, Sean Christopherson wrote:
> > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > On 6.04.2022 19:10, Sean Christopherson wrote:
> > > > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > And what if it's L0 that is trying to inject a NMI into L2?
> > > In this case is_guest_mode() is true, but the full NMI injection machinery
> > > should be used.
> > 
> > Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
> > was thinking that NMIs always exit.  The "L1 wants" part should be conditioned on
> > NMI exiting being enabled.  It's benign because KVM always wants "real" NMIs, and
> > so the path is never encountered.
> > 
> > @@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
> >          switch ((u16)exit_reason.basic) {
> >          case EXIT_REASON_EXCEPTION_NMI:
> >                  intr_info = vmx_get_intr_info(vcpu);
> > -               if (is_nmi(intr_info))
> > +               if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
> >                          return true;
> >                  else if (is_page_fault(intr_info))
> >                          return true;
> > 
> 
> I guess you mean "benign" when having KVM as L1, since other hypervisors may
> let their L2s handle NMIs themselves.

No, this one's truly benign.  The nVMX exit processing is:

	if (nested_vmx_l0_wants_exit())
		handle in L0 / KVM;

	if (nested_vmx_l1_wants_exit())
		handle in L1

	handle in L0 / KVM

Since this is for actual hardware NMIs, the first "L0 wants" check always returns
true for NMIs, so the fact that KVM screws up L1's wants is a non-issue.
 
> > > It is also incorrect to block L1 -> L2 NMI injection because either L1
> > > or L2 is currently under NMI blocking: the first case is obvious,
> > > the second because it's L1 that is supposed to take care of proper NMI
> > > blocking for L2 when injecting an NMI there.
> > 
> > Yep, but I don't think there's a bug here.  At least not for nVMX.
> 
> I agree this scenario should currently work (including on nSVM) - mentioned
> it just as a constraint on solution space.
> 
> > > > > With the code in my previous patch set I planned to use
> > > > > exit_during_event_injection() to detect such case, but if we implement
> > > > > VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
> > > > > comes from L1, so its normal injection side-effects should be skipped.
> > > > 
> > > > Do we still need a flag based on the above?  Honest question... I've been staring
> > > > at all this for the better part of an hour and may have lost track of things.
> > > 
> > > If checking just is_guest_mode() is not enough due to reasons I described
> > > above then we need to somehow determine in the NMI / IRQ injection handler
> > > whether the event to be injected into L2 comes from L0 or L1.
> > > For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.
> > 
> > Yes :-(  And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
> > 
> 
> Another option for saving and restoring a VM would be to add it to
> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
> control area?).

Ooh.  What if we keep nested_run_pending=true until the injection completes?  Then
we don't even need an extra flag because nested_run_pending effectively says that
any and all injected events are for L1=>L2.  In KVM_GET_NESTED_STATE, shove the
to-be-injected event into the normal vmc*12 injection field, and ignore all
to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.

That should work even for migrating to an older KVM, as keeping nested_run_pending
will cause the target to reprocess the event injection as if it were from nested
VM-Enter, which it technically is.

We could probably get away with completely dropping the intermediate event as
the vmc*12 should still have the original event, but that technically could result
in architecturally incorrect behavior, e.g. if vectoring up to the point of
interception sets A/D bits in the guest.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 20:52                     ` Sean Christopherson
@ 2022-04-06 22:34                       ` Maciej S. Szmigiero
  2022-04-06 23:03                         ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-06 22:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 6.04.2022 22:52, Sean Christopherson wrote:
> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 21:48, Sean Christopherson wrote:
>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>>>> On 6.04.2022 19:10, Sean Christopherson wrote:
>>>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>>>> And what if it's L0 that is trying to inject a NMI into L2?
>>>> In this case is_guest_mode() is true, but the full NMI injection machinery
>>>> should be used.
>>>
>>> Gah, you're right, I got misled by a benign bug in nested_vmx_l1_wants_exit() and
>>> was thinking that NMIs always exit.  The "L1 wants" part should be conditioned on
>>> NMI exiting being enabled.  It's benign because KVM always wants "real" NMIs, and
>>> so the path is never encountered.
>>>
>>> @@ -5980,7 +6005,7 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
>>>           switch ((u16)exit_reason.basic) {
>>>           case EXIT_REASON_EXCEPTION_NMI:
>>>                   intr_info = vmx_get_intr_info(vcpu);
>>> -               if (is_nmi(intr_info))
>>> +               if (is_nmi(intr_info) && nested_cpu_has_nmi_exiting(vmcs12))
>>>                           return true;
>>>                   else if (is_page_fault(intr_info))
>>>                           return true;
>>>
>>
>> I guess you mean "benign" when having KVM as L1, since other hypervisors may
>> let their L2s handle NMIs themselves.
> 
> No, this one's truly benign.  The nVMX exit processing is:
> 
> 	if (nested_vmx_l0_wants_exit())
> 		handle in L0 / KVM;
> 
> 	if (nested_vmx_l1_wants_exit())
> 		handle in L1
> 
> 	handle in L0 / KVM
> 
> Since this is for actual hardware NMIs, the first "L0 wants" check always returns
> true for NMIs, so the fact that KVM screws up L1's wants is a non-issue.

Got it, forgot the path was for actual hardware NMIs, which obviously
can't go directly to L1 or L2.

>>>>>> With the code in my previous patch set I planned to use
>>>>>> exit_during_event_injection() to detect such case, but if we implement
>>>>>> VMCB12 EVENTINJ parsing we can simply add a flag that the relevant event
>>>>>> comes from L1, so its normal injection side-effects should be skipped.
>>>>>
>>>>> Do we still need a flag based on the above?  Honest question... I've been staring
>>>>> at all this for the better part of an hour and may have lost track of things.
>>>>
>>>> If checking just is_guest_mode() is not enough due to reasons I described
>>>> above then we need to somehow determine in the NMI / IRQ injection handler
>>>> whether the event to be injected into L2 comes from L0 or L1.
>>>> For this (assuming we do VMCB12 EVENTINJ parsing) I think we need an extra flag.
>>>
>>> Yes :-(  And I believe the extra flag would need to be handled by KVM_{G,S}ET_VCPU_EVENTS.
>>>
>>
>> Another option for saving and restoring a VM would be to add it to
>> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
>> control area?).
> 
> Ooh.  What if we keep nested_run_pending=true until the injection completes?  Then
> we don't even need an extra flag because nested_run_pending effectively says that
> any and all injected events are for L1=>L2.  In KVM_GET_NESTED_STATE, shove the
> to-be-injected event into the normal vmc*12 injection field, and ignore all
> to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.
> 
> That should work even for migrating to an older KVM, as keeping nested_run_pending
> will cause the target to reprocess the event injection as if it were from nested
> VM-Enter, which it technically is.

I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
mean *moving* back the L1 -> L2 event to be injected from KVM internal data
structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
VMCB12 EVENTINJ field (or its VMX equivalent).

But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
these events as pending).
And their setters in the opposite order when restoring the VM.

Although, if the VMCB12 EVENTINJ field contents perfectly matches things like
arch.nmi_injected there should be no problem in principle if these events are
restored by VMM to both places by KVM_SET_NESTED_STATE and KVM_SET_VCPU_EVENTS.

Another option would be to ignore either a first KVM_SET_VCPU_EVENTS after
KVM_SET_NESTED_STATE with KVM_STATE_NESTED_RUN_PENDING or every such call
while nested_run_pending is still true (but wouldn't either of these changes
break KVM API?).

I'm not sure, however, that there isn't some corner case lurking here :(

> We could probably get away with completely dropping the intermediate event as
> the vmc*12 should still have the original event, but that technically could result
> in architecturally incorrect behavior, e.g. if vectoring up to the point of
> interception sets A/D bits in the guest.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 22:34                       ` Maciej S. Szmigiero
@ 2022-04-06 23:03                         ` Sean Christopherson
  2022-04-07 15:32                           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-06 23:03 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Thu, Apr 07, 2022, Maciej S. Szmigiero wrote:
> On 6.04.2022 22:52, Sean Christopherson wrote:
> > On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
> > > Another option for saving and restoring a VM would be to add it to
> > > KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
> > > control area?).
> > 
> > Ooh.  What if we keep nested_run_pending=true until the injection completes?  Then
> > we don't even need an extra flag because nested_run_pending effectively says that
> > any and all injected events are for L1=>L2.  In KVM_GET_NESTED_STATE, shove the
> > to-be-injected event into the normal vmc*12 injection field, and ignore all
> > to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.
> > 
> > That should work even for migrating to an older KVM, as keeping nested_run_pending
> > will cause the target to reprocess the event injection as if it were from nested
> > VM-Enter, which it technically is.
> 
> I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
> mean *moving* back the L1 -> L2 event to be injected from KVM internal data
> structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
> VMCB12 EVENTINJ field (or its VMX equivalent).
> 
> But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
> the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
> these events as pending).
> And their setters in the opposite order when restoring the VM.

I wasn't thinking of actually moving things in the source VM, only ignoring events
in KVM_GET_VCPU_EVENTS.  Getting state shouldn't be destructive, e.g. the source VM
should still be able to continue running.

Ahahahaha, and actually looking at the code, there's this gem in KVM_GET_VCPU_EVENTS

	/*
	 * The API doesn't provide the instruction length for software
	 * exceptions, so don't report them. As long as the guest RIP
	 * isn't advanced, we should expect to encounter the exception
	 * again.
	 */
	if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
		events->exception.injected = 0;
		events->exception.pending = 0;
	}

and again for soft interrupts

	events->interrupt.injected =
		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;

so through KVM's own incompetency, it's already doing half the work.

This is roughly what I had in mind.  It will "require" moving nested_run_pending
to kvm_vcpu_arch, but I've been itching for an excuse to do that anyways.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb71727acecb..62c48f6a0815 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4846,6 +4846,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
 static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
                                               struct kvm_vcpu_events *events)
 {
+       bool drop_injected_events = vcpu->arch.nested_run_pending;
+
        process_nmi(vcpu);

        if (kvm_check_request(KVM_REQ_SMI, vcpu))
@@ -4872,7 +4874,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
         * isn't advanced, we should expect to encounter the exception
         * again.
         */
-       if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
+       if (drop_injected_events ||
+           kvm_exception_is_soft(vcpu->arch.exception.nr)) {
                events->exception.injected = 0;
                events->exception.pending = 0;
        } else {
@@ -4893,13 +4896,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
        events->exception_has_payload = vcpu->arch.exception.has_payload;
        events->exception_payload = vcpu->arch.exception.payload;

-       events->interrupt.injected =
-               vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
+       events->interrupt.injected = vcpu->arch.interrupt.injected &&
+                                    !vcpu->arch.interrupt.soft &&
+                                    !drop_injected_events;
        events->interrupt.nr = vcpu->arch.interrupt.nr;
        events->interrupt.soft = 0;
        events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);

-       events->nmi.injected = vcpu->arch.nmi_injected;
+       events->nmi.injected = vcpu->arch.nmi_injected && !drop_injected_events;
        events->nmi.pending = vcpu->arch.nmi_pending != 0;
        events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu);
        events->nmi.pad = 0;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
  2022-04-06 23:03                         ` Sean Christopherson
@ 2022-04-07 15:32                           ` Maciej S. Szmigiero
  0 siblings, 0 replies; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-07 15:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On 7.04.2022 01:03, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 22:52, Sean Christopherson wrote:
>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>>>> Another option for saving and restoring a VM would be to add it to
>>>> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
>>>> control area?).
>>>
>>> Ooh.  What if we keep nested_run_pending=true until the injection completes?  Then
>>> we don't even need an extra flag because nested_run_pending effectively says that
>>> any and all injected events are for L1=>L2.  In KVM_GET_NESTED_STATE, shove the
>>> to-be-injected event into the normal vmc*12 injection field, and ignore all
>>> to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.
>>>
>>> That should work even for migrating to an older KVM, as keeping nested_run_pending
>>> will cause the target to reprocess the event injection as if it were from nested
>>> VM-Enter, which it technically is.
>>
>> I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
>> mean *moving* back the L1 -> L2 event to be injected from KVM internal data
>> structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
>> VMCB12 EVENTINJ field (or its VMX equivalent).
>>
>> But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
>> the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
>> these events as pending).
>> And their setters in the opposite order when restoring the VM.
> 
> I wasn't thinking of actually moving things in the source VM, only ignoring events
> in KVM_GET_VCPU_EVENTS.  Getting state shouldn't be destructive, e.g. the source VM
> should still be able to continue running.

Right, getters should not change state.

> Ahahahaha, and actually looking at the code, there's this gem in KVM_GET_VCPU_EVENTS
> 
> 	/*
> 	 * The API doesn't provide the instruction length for software
> 	 * exceptions, so don't report them. As long as the guest RIP
> 	 * isn't advanced, we should expect to encounter the exception
> 	 * again.
> 	 */
> 	if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> 		events->exception.injected = 0;
> 		events->exception.pending = 0;
> 	}
> 
> and again for soft interrupts
> 
> 	events->interrupt.injected =
> 		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> 
> so through KVM's own incompetency, it's already doing half the work.

So KVM_GET_VCPU_EVENTS was basically already explicitly broken for this
case (where RIP does not point to a INT3/INTO/INT x instruction).

> This is roughly what I had in mind.  It will "require" moving nested_run_pending
> to kvm_vcpu_arch, but I've been itching for an excuse to do that anyways.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eb71727acecb..62c48f6a0815 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4846,6 +4846,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>   static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>                                                 struct kvm_vcpu_events *events)
>   {
> +       bool drop_injected_events = vcpu->arch.nested_run_pending;
> +
>          process_nmi(vcpu);
> 
>          if (kvm_check_request(KVM_REQ_SMI, vcpu))
> @@ -4872,7 +4874,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>           * isn't advanced, we should expect to encounter the exception
>           * again.
>           */
> -       if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> +       if (drop_injected_events ||
> +           kvm_exception_is_soft(vcpu->arch.exception.nr)) {
>                  events->exception.injected = 0;
>                  events->exception.pending = 0;
>          } else {
> @@ -4893,13 +4896,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>          events->exception_has_payload = vcpu->arch.exception.has_payload;
>          events->exception_payload = vcpu->arch.exception.payload;
> 
> -       events->interrupt.injected =
> -               vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> +       events->interrupt.injected = vcpu->arch.interrupt.injected &&
> +                                    !vcpu->arch.interrupt.soft &&
> +                                    !drop_injected_events;
>          events->interrupt.nr = vcpu->arch.interrupt.nr;
>          events->interrupt.soft = 0;
>          events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> 
> -       events->nmi.injected = vcpu->arch.nmi_injected;
> +       events->nmi.injected = vcpu->arch.nmi_injected && !drop_injected_events;
>          events->nmi.pending = vcpu->arch.nmi_pending != 0;
>          events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu);
>          events->nmi.pad = 0;
> 

So the VMM will get VMCB12 with EVENTINJ field filled with the event
to re-inject from KVM_GET_NESTED_STATE and events.{exception,interrupt,nmi}.injected
unset from KVM_GET_VCPU_EVENTS.

Let's say now that the VMM uses this data to restore a VM: it restores nested
state by using KVM_SET_NESTED_STATE and then events by calling KVM_SET_VCPU_EVENTS.

So it ends with VMCB12 EVENTINJ field filled, but KVM injection structures
(arch.{exception,interrupt,nmi}.injected) zeroed by that later KVM_SET_VCPU_EVENTS
call.

Assuming that L1 -> L2 event injection is always based on KVM injection structures
(like we discussed earlier), rather than on a direct copy of EVENTINJ field like
it is now, before doing the first VM entry KVM will need to re-parse VMCB12 EVENTINJ
field into KVM arch.{exception,interrupt,nmi}.injected to make it work properly.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-04 17:21     ` Sean Christopherson
  2022-04-04 17:45       ` Maciej S. Szmigiero
@ 2022-04-20 15:00       ` Paolo Bonzini
  2022-04-20 15:05         ` Maciej S. Szmigiero
  1 sibling, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2022-04-20 15:00 UTC (permalink / raw)
  To: Sean Christopherson, Maciej S. Szmigiero
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel

On 4/4/22 19:21, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>>>    	nested_copy_vmcb_control_to_cache(svm, ctl);
>>>    	svm_switch_vmcb(svm, &svm->nested.vmcb02);
>>> -	nested_vmcb02_prepare_control(svm);
>>> +	nested_vmcb02_prepare_control(svm, save->rip);
>>
>> 					   ^
>> I guess this should be "svm->vmcb->save.rip", since
>> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
>> not vmcb{0,1}2 (in contrast to the "control" field).
> 
> Argh, yes.  Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
> If not, this will result in garbage being loaded into vmcb02.
> 

Let's just require X86_FEATURE_NRIPS, either in general or just to
enable nested virtualiazation, i.e.:

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fc1725b7d05f..f8fc8a1b09f1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4904,10 +4904,12 @@ static __init int svm_hardware_setup(void)
  			goto err;
  	}
  
-	if (nrips) {
-		if (!boot_cpu_has(X86_FEATURE_NRIPS))
-			nrips = false;
-	}
+	if (!boot_cpu_has(X86_FEATURE_NRIPS))
+		nrips = false;
+	if (nested & !nrips) {
+		pr_warn("Next RIP Save not available, disabling nested virtualization\n");
+		nested = false;
+	}
  
  	enable_apicv = avic = avic && npt_enabled && (boot_cpu_has(X86_FEATURE_AVIC) || force_avic);
  

If I looked it up correctly it was introduced around 2010-2011.

Paolo

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails"
  2022-04-02  1:08 ` [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" Sean Christopherson
  2022-04-04 10:03   ` Maxim Levitsky
@ 2022-04-20 15:01   ` Paolo Bonzini
  1 sibling, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2022-04-20 15:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maciej S . Szmigiero

On 4/2/22 03:08, Sean Christopherson wrote:
>   		 * but re-execute the instruction instead. Rewind RIP first
>   		 * if we emulated INT3 before.
>   		 */
> -		if (kvm_exception_is_soft(vector)) {
> -			if (vector == BP_VECTOR && int3_injected &&
> -			    kvm_is_linear_rip(vcpu, svm->int3_rip))
> -				kvm_rip_write(vcpu,
> -					      kvm_rip_read(vcpu) - int3_injected);
> +		if (kvm_exception_is_soft(vector))
>   			break;
> -		}

Stale comment.

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-20 15:00       ` Paolo Bonzini
@ 2022-04-20 15:05         ` Maciej S. Szmigiero
  2022-04-20 16:15           ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-20 15:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Sean Christopherson

On 20.04.2022 17:00, Paolo Bonzini wrote:
> On 4/4/22 19:21, Sean Christopherson wrote:
>> On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
>>>> @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>>>>        nested_copy_vmcb_control_to_cache(svm, ctl);
>>>>        svm_switch_vmcb(svm, &svm->nested.vmcb02);
>>>> -    nested_vmcb02_prepare_control(svm);
>>>> +    nested_vmcb02_prepare_control(svm, save->rip);
>>>
>>>                        ^
>>> I guess this should be "svm->vmcb->save.rip", since
>>> KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
>>> not vmcb{0,1}2 (in contrast to the "control" field).
>>
>> Argh, yes.  Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
>> If not, this will result in garbage being loaded into vmcb02.
>>
> 
> Let's just require X86_FEATURE_NRIPS, either in general or just to
> enable nested virtualiazation

👍

> 
> If I looked it up correctly it was introduced around 2010-2011.

A quick Internet search showed that the first CPUs with NextRIP were
the second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
They started being released in early 2009.

> Paolo

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-20 15:05         ` Maciej S. Szmigiero
@ 2022-04-20 16:15           ` Sean Christopherson
  2022-04-20 16:33             ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Sean Christopherson @ 2022-04-20 16:15 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Wed, Apr 20, 2022, Maciej S. Szmigiero wrote:
> On 20.04.2022 17:00, Paolo Bonzini wrote:
> > On 4/4/22 19:21, Sean Christopherson wrote:
> > > On Mon, Apr 04, 2022, Maciej S. Szmigiero wrote:
> > > > > @@ -1606,7 +1622,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
> > > > >        nested_copy_vmcb_control_to_cache(svm, ctl);
> > > > >        svm_switch_vmcb(svm, &svm->nested.vmcb02);
> > > > > -    nested_vmcb02_prepare_control(svm);
> > > > > +    nested_vmcb02_prepare_control(svm, save->rip);
> > > > 
> > > >                        ^
> > > > I guess this should be "svm->vmcb->save.rip", since
> > > > KVM_{GET,SET}_NESTED_STATE "save" field contains vmcb01 data,
> > > > not vmcb{0,1}2 (in contrast to the "control" field).
> > > 
> > > Argh, yes.  Is userspace required to set L2 guest state prior to KVM_SET_NESTED_STATE?
> > > If not, this will result in garbage being loaded into vmcb02.
> > > 
> > 
> > Let's just require X86_FEATURE_NRIPS, either in general or just to
> > enable nested virtualiazation
> 
> 👍

Hmm, so requiring NRIPS for nested doesn't actually buy us anything.  KVM still
has to deal with userspace hiding NRIPS from L1, so unless I'm overlooking something,
the only change would be:

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bdf8375a718b..7bed4e05aaea 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -686,7 +686,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
         */
        if (svm->nrips_enabled)
                vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
-       else if (boot_cpu_has(X86_FEATURE_NRIPS))
+       else
                vmcb02->control.next_rip    = vmcb12_rip;

        if (is_evtinj_soft(vmcb02->control.event_inj)) {

And sadly, because SVM doesn't provide the instruction length if an exit occurs
while vectoring a software interrupt/exception, making NRIPS mandatory doesn't buy
us much either.

I believe the below diff is the total savings (plus the above nested thing) against
this series if NRIPS is mandatory (ignoring the setup code, which is a wash).  It
does eliminate the rewind in svm_complete_soft_interrupt() and the funky logic in
svm_update_soft_interrupt_rip(), but that's it AFAICT.  The most obnoxious code of
having to unwind EMULTYPE_SKIP when retrieving the next RIP for software int/except
injection doesn't go away :-(

I'm not totally opposed to requiring NRIPS, but I'm not in favor of it either.

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 66cfd533aaf8..6b48af423246 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -354,7 +354,7 @@ static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu,
        if (sev_es_guest(vcpu->kvm))
                goto done;

-       if (nrips && svm->vmcb->control.next_rip != 0) {
+       if (svm->vmcb->control.next_rip != 0) {
                WARN_ON_ONCE(!static_cpu_has(X86_FEATURE_NRIPS));
                svm->next_rip = svm->vmcb->control.next_rip;
        }
@@ -401,7 +401,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
         * in use, the skip must not commit any side effects such as clearing
         * the interrupt shadow or RFLAGS.RF.
         */
-       if (!__svm_skip_emulated_instruction(vcpu, !nrips))
+       if (!__svm_skip_emulated_instruction(vcpu, false))
                return -EIO;

        rip = kvm_rip_read(vcpu);
@@ -420,11 +420,8 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
        svm->soft_int_old_rip = old_rip;
        svm->soft_int_next_rip = rip;

-       if (nrips)
-               kvm_rip_write(vcpu, old_rip);
-
-       if (static_cpu_has(X86_FEATURE_NRIPS))
-               svm->vmcb->control.next_rip = rip;
+       kvm_rip_write(vcpu, old_rip);
+       svm->vmcb->control.next_rip = rip;

        return 0;
 }
@@ -3738,20 +3735,9 @@ static void svm_complete_soft_interrupt(struct kvm_vcpu *vcpu, u8 vector,
         * the same event, i.e. if the event is a soft exception/interrupt,
         * otherwise next_rip is unused on VMRUN.
         */
-       if (nrips && (is_soft || (is_exception && kvm_exception_is_soft(vector))) &&
+       if ((is_soft || (is_exception && kvm_exception_is_soft(vector))) &&
            kvm_is_linear_rip(vcpu, svm->soft_int_old_rip + svm->soft_int_csbase))
                svm->vmcb->control.next_rip = svm->soft_int_next_rip;
-       /*
-        * If NextRIP isn't enabled, KVM must manually advance RIP prior to
-        * injecting the soft exception/interrupt.  That advancement needs to
-        * be unwound if vectoring didn't complete.  Note, the new event may
-        * not be the injected event, e.g. if KVM injected an INTn, the INTn
-        * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
-        * be the reported vectored event, but RIP still needs to be unwound.
-        */
-       else if (!nrips && (is_soft || is_exception) &&
-                kvm_is_linear_rip(vcpu, svm->soft_int_next_rip + svm->soft_int_csbase))
-               kvm_rip_write(vcpu, svm->soft_int_old_rip);
 }

 static void svm_complete_interrupts(struct kvm_vcpu *vcpu)


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-20 16:15           ` Sean Christopherson
@ 2022-04-20 16:33             ` Paolo Bonzini
  2022-04-20 16:44               ` Sean Christopherson
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2022-04-20 16:33 UTC (permalink / raw)
  To: Sean Christopherson, Maciej S. Szmigiero
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel

On 4/20/22 18:15, Sean Christopherson wrote:
>>> Let's just require X86_FEATURE_NRIPS, either in general or just to
>>> enable nested virtualiazation
>> 👍
> Hmm, so requiring NRIPS for nested doesn't actually buy us anything.  KVM still
> has to deal with userspace hiding NRIPS from L1, so unless I'm overlooking something,
> the only change would be:
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index bdf8375a718b..7bed4e05aaea 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -686,7 +686,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
>           */
>          if (svm->nrips_enabled)
>                  vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> -       else if (boot_cpu_has(X86_FEATURE_NRIPS))
> +       else
>                  vmcb02->control.next_rip    = vmcb12_rip;
> 
>          if (is_evtinj_soft(vmcb02->control.event_inj)) {
> 
> And sadly, because SVM doesn't provide the instruction length if an exit occurs
> while vectoring a software interrupt/exception, making NRIPS mandatory doesn't buy
> us much either.
> 
> I believe the below diff is the total savings (plus the above nested thing) against
> this series if NRIPS is mandatory (ignoring the setup code, which is a wash).  It
> does eliminate the rewind in svm_complete_soft_interrupt() and the funky logic in
> svm_update_soft_interrupt_rip(), but that's it AFAICT.  The most obnoxious code of
> having to unwind EMULTYPE_SKIP when retrieving the next RIP for software int/except
> injection doesn't go away:-(
> 
> I'm not totally opposed to requiring NRIPS, but I'm not in favor of it either.

Yeah, you're right.  However:

* the rewind might already be worth it;

* if we require NRIPS for nested, we can also assume that the SVM save 
state data has a valid next_rip; even if !svm->nrips_enabled.  There's 
the pesky issue of restoring from an old system that did not have NRIPS, 
but let's assume for now that NRIPS was set on the source as well.

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-20 16:33             ` Paolo Bonzini
@ 2022-04-20 16:44               ` Sean Christopherson
  0 siblings, 0 replies; 45+ messages in thread
From: Sean Christopherson @ 2022-04-20 16:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Maciej S. Szmigiero, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

On Wed, Apr 20, 2022, Paolo Bonzini wrote:
> On 4/20/22 18:15, Sean Christopherson wrote:
> > > > Let's just require X86_FEATURE_NRIPS, either in general or just to
> > > > enable nested virtualiazation
> > > 👍
> > Hmm, so requiring NRIPS for nested doesn't actually buy us anything.  KVM still
> > has to deal with userspace hiding NRIPS from L1, so unless I'm overlooking something,
> > the only change would be:
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index bdf8375a718b..7bed4e05aaea 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -686,7 +686,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
> >           */
> >          if (svm->nrips_enabled)
> >                  vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> > -       else if (boot_cpu_has(X86_FEATURE_NRIPS))
> > +       else
> >                  vmcb02->control.next_rip    = vmcb12_rip;
> > 
> >          if (is_evtinj_soft(vmcb02->control.event_inj)) {
> > 
> > And sadly, because SVM doesn't provide the instruction length if an exit occurs
> > while vectoring a software interrupt/exception, making NRIPS mandatory doesn't buy
> > us much either.
> > 
> > I believe the below diff is the total savings (plus the above nested thing) against
> > this series if NRIPS is mandatory (ignoring the setup code, which is a wash).  It
> > does eliminate the rewind in svm_complete_soft_interrupt() and the funky logic in
> > svm_update_soft_interrupt_rip(), but that's it AFAICT.  The most obnoxious code of
> > having to unwind EMULTYPE_SKIP when retrieving the next RIP for software int/except
> > injection doesn't go away:-(
> > 
> > I'm not totally opposed to requiring NRIPS, but I'm not in favor of it either.
> 
> Yeah, you're right.  However:
> 
> * the rewind might already be worth it;

FWIW, I don't actually care about supporting nrips=false, it's the ability to test
EMULTYPE_SKIP that I find valuable.  I also find the extra perspective of how RIP
interacts with software interrupts/exceptions to be very helpful/educational, though
that's of debatable value going forward.

> * if we require NRIPS for nested, we can also assume that the SVM save state
> data has a valid next_rip; even if !svm->nrips_enabled.  There's the pesky
> issue of restoring from an old system that did not have NRIPS, but let's
> assume for now that NRIPS was set on the source as well.

How about I throw a Not-signed-off-by patch on the end of the series to make NRIPS
mandatory, that way we (in theory) have a fully functional snapshot for nrips=false
if we want to go back in time.  And we probably need to give a deprecation grace
period, i.e. wait a release or two before disappearing nrips?

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-04-20 16:44 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-02  1:08 [PATCH 0/8] KVM: SVM: Fix soft int/ex re-injection Sean Christopherson
2022-04-02  1:08 ` [PATCH 1/8] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Sean Christopherson
2022-04-04  9:54   ` Maxim Levitsky
2022-04-04 16:50   ` Maciej S. Szmigiero
2022-04-04 17:21     ` Sean Christopherson
2022-04-04 17:45       ` Maciej S. Szmigiero
2022-04-20 15:00       ` Paolo Bonzini
2022-04-20 15:05         ` Maciej S. Szmigiero
2022-04-20 16:15           ` Sean Christopherson
2022-04-20 16:33             ` Paolo Bonzini
2022-04-20 16:44               ` Sean Christopherson
2022-04-02  1:08 ` [PATCH 2/8] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq() Sean Christopherson
2022-04-02  1:08 ` [PATCH 3/8] KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" Sean Christopherson
2022-04-04 10:03   ` Maxim Levitsky
2022-04-20 15:01   ` Paolo Bonzini
2022-04-02  1:08 ` [PATCH 4/8] KVM: SVM: Stuff next_rip on emualted INT3 injection if NRIPS is supported Sean Christopherson
2022-04-04 12:00   ` Maxim Levitsky
2022-04-02  1:09 ` [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction Sean Christopherson
2022-04-04 12:12   ` Maxim Levitsky
2022-04-04 16:49     ` Sean Christopherson
2022-04-04 16:53       ` Maciej S. Szmigiero
2022-04-04 19:33         ` Sean Christopherson
2022-04-04 19:50           ` Maciej S. Szmigiero
2022-04-04 19:54           ` Sean Christopherson
2022-04-04 20:46             ` Maciej S. Szmigiero
2022-04-04 20:44       ` Maciej S. Szmigiero
2022-04-06  1:48         ` Sean Christopherson
2022-04-06 13:13           ` Maciej S. Szmigiero
2022-04-06 17:10             ` Sean Christopherson
2022-04-06 19:08               ` Maciej S. Szmigiero
2022-04-06 19:48                 ` Sean Christopherson
2022-04-06 20:30                   ` Maciej S. Szmigiero
2022-04-06 20:52                     ` Sean Christopherson
2022-04-06 22:34                       ` Maciej S. Szmigiero
2022-04-06 23:03                         ` Sean Christopherson
2022-04-07 15:32                           ` Maciej S. Szmigiero
2022-04-02  1:09 ` [PATCH 6/8] KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" Sean Christopherson
2022-04-04 17:14   ` Sean Christopherson
2022-04-04 20:27   ` Maciej S. Szmigiero
2022-04-02  1:09 ` [PATCH 7/8] KVM: x86: Trace re-injected exceptions Sean Christopherson
2022-04-04 12:14   ` Maxim Levitsky
2022-04-04 16:14     ` Sean Christopherson
2022-04-02  1:09 ` [PATCH 8/8] KVM: selftests: nSVM: Add svm_nested_soft_inject_test Sean Christopherson
2022-04-04 12:27   ` Maxim Levitsky
2022-04-04 16:59     ` Maciej S. Szmigiero

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).