All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] nSVM: L1 -> L2 event injection fixes and a self-test
@ 2022-03-10 21:38 Maciej S. Szmigiero
  2022-03-10 21:38 ` [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Maciej S. Szmigiero
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-10 21:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

There are some issues with respect to nSVM L1 -> L2 event injection.

First, the next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception (this was confirmed by AMD).

On a VMRUN that does event injection it has similar function as VMX's
VM_ENTRY_INSTRUCTION_LEN field, although, in contrast to VMX, it holds an
absolute RIP value, not a relative increment.

However, KVM seems to treat this field as a unidirectional hint from the CPU
to the hypervisor - there seems to be no specific effort to maintain this
field consistency for such VMRUN.

This is mostly visible with running a nested guest, with L1 trying to inject
an event into its L2.
In this case, we need to make sure the next_rip field gets synced from
vmcb12 to vmcb02.

Another issue is that pending L1 -> L2 events are forgotten if there is an
intervening L0 VMEXIT during their delivery.
We need to make sure they are remembered (including their desired next_rip
field value) until they are either re-injected into L2 successfully or
returned back to L1 in the EXITINTINFO field upon a nested VMEXIT.

A new KVM self-test that checks for the nSVM issues described above is
included in this patch series.

These issues are SVM-specific - all the use cases described above already
work correctly with VMX.

This patch set was tested with both Linux and Windows nested guests.

  KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq()
  KVM: nSVM: Don't forget about L1-injected events
  KVM: nSVM: Restore next_rip when doing L1 -> L2 event re-injection
  KVM: selftests: nSVM: Add svm_nested_soft_inject_test

 arch/x86/kvm/svm/nested.c                     |  69 +++++++-
 arch/x86/kvm/svm/svm.c                        |  60 ++++++-
 arch/x86/kvm/svm/svm.h                        |  48 ++++++
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/svm_util.h   |   2 +
 .../kvm/x86_64/svm_nested_soft_inject_test.c  | 147 ++++++++++++++++++
 7 files changed, 324 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-03-10 21:38 [PATCH 0/5] nSVM: L1 -> L2 event injection fixes and a self-test Maciej S. Szmigiero
@ 2022-03-10 21:38 ` Maciej S. Szmigiero
  2022-04-01 18:32   ` Sean Christopherson
  2022-03-10 21:38 ` [PATCH 2/5] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq() Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-10 21:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception.

Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
loading a nested state.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/x86/kvm/svm/nested.c | 4 ++++
 arch/x86/kvm/svm/svm.h    | 1 +
 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d736ec6514ca..9656f0d6815c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -366,6 +366,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
 	to->nested_ctl          = from->nested_ctl;
 	to->event_inj           = from->event_inj;
 	to->event_inj_err       = from->event_inj_err;
+	to->next_rip            = from->next_rip;
 	to->nested_cr3          = from->nested_cr3;
 	to->virt_ext            = from->virt_ext;
 	to->pause_filter_count  = from->pause_filter_count;
@@ -638,6 +639,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
 	svm->vmcb->control.int_state           = svm->nested.ctl.int_state;
 	svm->vmcb->control.event_inj           = svm->nested.ctl.event_inj;
 	svm->vmcb->control.event_inj_err       = svm->nested.ctl.event_inj_err;
+	/* The return address pushed on stack by the CPU for some injected events */
+	svm->vmcb->control.next_rip            = svm->nested.ctl.next_rip;
 
 	if (!nested_vmcb_needs_vls_intercept(svm))
 		svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
@@ -1348,6 +1351,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
 	dst->nested_ctl           = from->nested_ctl;
 	dst->event_inj            = from->event_inj;
 	dst->event_inj_err        = from->event_inj_err;
+	dst->next_rip             = from->next_rip;
 	dst->nested_cr3           = from->nested_cr3;
 	dst->virt_ext              = from->virt_ext;
 	dst->pause_filter_count   = from->pause_filter_count;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 93502d2a52ce..f757400fc933 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -138,6 +138,7 @@ struct vmcb_ctrl_area_cached {
 	u64 nested_ctl;
 	u32 event_inj;
 	u32 event_inj_err;
+	u64 next_rip;
 	u64 nested_cr3;
 	u64 virt_ext;
 	u32 clean;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq()
  2022-03-10 21:38 [PATCH 0/5] nSVM: L1 -> L2 event injection fixes and a self-test Maciej S. Szmigiero
  2022-03-10 21:38 ` [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Maciej S. Szmigiero
@ 2022-03-10 21:38 ` Maciej S. Szmigiero
  2022-04-04  9:50   ` Maxim Levitsky
  2022-03-10 21:38 ` [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events Maciej S. Szmigiero
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-10 21:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

There is no need to bring down the whole host just because there might be
some issue with respect to guest GIF handling in KVM.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/x86/kvm/svm/svm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b069493ad5c7..1e5d904aeec3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3322,7 +3322,7 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	BUG_ON(!(gif_set(svm)));
+	WARN_ON(!gif_set(svm));
 
 	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
 	++vcpu->stat.irq_injections;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-03-10 21:38 [PATCH 0/5] nSVM: L1 -> L2 event injection fixes and a self-test Maciej S. Szmigiero
  2022-03-10 21:38 ` [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Maciej S. Szmigiero
  2022-03-10 21:38 ` [PATCH 2/5] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq() Maciej S. Szmigiero
@ 2022-03-10 21:38 ` Maciej S. Szmigiero
  2022-03-30 21:59   ` Sean Christopherson
  2022-04-04  9:53   ` Maxim Levitsky
  2022-03-10 21:38 ` [PATCH 4/5] KVM: nSVM: Restore next_rip when doing L1 -> L2 event re-injection Maciej S. Szmigiero
  2022-03-10 21:38 ` [PATCH 5/5] KVM: selftests: nSVM: Add svm_nested_soft_inject_test Maciej S. Szmigiero
  4 siblings, 2 replies; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-10 21:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

In SVM synthetic software interrupts or INT3 or INTO exception that L1
wants to inject into its L2 guest are forgotten if there is an intervening
L0 VMEXIT during their delivery.

They are re-injected correctly with VMX, however.

This is because there is an assumption in SVM that such exceptions will be
re-delivered by simply re-executing the current instruction.
Which might not be true if this is a synthetic exception injected by L1,
since in this case the re-executed instruction will be one already in L2,
not the VMRUN instruction in L1 that attempted the injection.

Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
it is either re-injected successfully or returned to L1 upon a nested
VMEXIT.
Make sure to always re-queue such event if returned in EXITINTINFO.

The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
unforeseen regressions.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/svm/svm.c    | 17 ++++++++--
 arch/x86/kvm/svm/svm.h    | 47 ++++++++++++++++++++++++++++
 3 files changed, 125 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 9656f0d6815c..75017bf77955 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
 void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
 {
 	u32 mask;
-	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
-	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
+
+	/*
+	 * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err}
+	 * if its re-injection is needed
+	 */
+	if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj,
+					 svm->nested.ctl.event_inj_err)) {
+		WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID);
+		svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
+		svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
+	}
 
 	/* Only a few fields of int_ctl are written by the processor.  */
 	mask = V_IRQ_MASK | V_TPR_MASK;
@@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
 	to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
 }
 
+void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	unsigned int vector, type;
+	u32 exitintinfo = svm->vmcb->control.exit_int_info;
+
+	if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
+		return;
+
+	/*
+	 * No L1 -> L2 event to re-inject?
+	 *
+	 * In this case event_inj will be cleared by
+	 * nested_sync_control_from_vmcb02().
+	 */
+	if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID))
+		return;
+
+	/* If the last event injection was successful there shouldn't be any pending event */
+	if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID)))
+		return;
+
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+
+	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
+	type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
+
+	switch (type) {
+	case SVM_EXITINTINFO_TYPE_NMI:
+		vcpu->arch.nmi_injected = true;
+		break;
+	case SVM_EXITINTINFO_TYPE_EXEPT:
+		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
+			kvm_requeue_exception_e(vcpu, vector,
+						svm->vmcb->control.exit_int_info_err);
+		else
+			kvm_requeue_exception(vcpu, vector);
+		break;
+	case SVM_EXITINTINFO_TYPE_SOFT:
+	case SVM_EXITINTINFO_TYPE_INTR:
+		kvm_queue_interrupt(vcpu, vector, type == SVM_EXITINTINFO_TYPE_SOFT);
+		break;
+	default:
+		vcpu_unimpl(vcpu, "unknown L1 -> L2 exitintinfo type 0x%x\n", type);
+		break;
+	}
+}
+
 int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 			 struct vmcb *vmcb12, bool from_vmrun)
 {
@@ -898,6 +955,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 	if (svm->nrips_enabled)
 		vmcb12->control.next_rip  = vmcb->control.next_rip;
 
+	/* Forget about any pending L1 event injection since it's a L1 worry now */
+	svm->nested.ctl.event_inj = 0;
+	svm->nested.ctl.event_inj_err = 0;
+
 	vmcb12->control.int_ctl           = svm->nested.ctl.int_ctl;
 	vmcb12->control.tlb_ctl           = svm->nested.ctl.tlb_ctl;
 	vmcb12->control.event_inj         = svm->nested.ctl.event_inj;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1e5d904aeec3..5b128baa5e57 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3322,13 +3322,18 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	WARN_ON(!gif_set(svm));
+	WARN_ON(!(vcpu->arch.interrupt.soft || gif_set(svm)));
 
 	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
 	++vcpu->stat.irq_injections;
 
 	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
-		SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
+		SVM_EVTINJ_VALID;
+	if (vcpu->arch.interrupt.soft) {
+		svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_SOFT;
+	} else {
+		svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_INTR;
+	}
 }
 
 void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
@@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
 		return;
 
+	/* L1 -> L2 event re-injection needs a different handling */
+	if (is_guest_mode(vcpu) &&
+	    exit_during_event_injection(svm, svm->nested.ctl.event_inj,
+					svm->nested.ctl.event_inj_err)) {
+		nested_svm_maybe_reinject(vcpu);
+		return;
+	}
+
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f757400fc933..7cafc2e6c82a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -488,6 +488,52 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
 	return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
 }
 
+static inline bool event_inj_same(u32 event_inj1, u32 event_inj_err1,
+				  u32 event_inj2, u32 event_inj_err2)
+{
+	unsigned int vector_1, vector_2, type_1, type_2;
+
+	/* Either of them not valid? */
+	if (!(event_inj1 & SVM_EVTINJ_VALID) ||
+	    !(event_inj2 & SVM_EVTINJ_VALID))
+		return false;
+
+	vector_1 = event_inj1 & SVM_EVTINJ_VEC_MASK;
+	type_1 = event_inj1 & SVM_EVTINJ_TYPE_MASK;
+	vector_2 = event_inj2 & SVM_EVTINJ_VEC_MASK;
+	type_2 = event_inj2 & SVM_EVTINJ_TYPE_MASK;
+
+	/* Different vector or type? */
+	if (vector_1 != vector_2 || type_1 != type_2)
+		return false;
+
+	/* Different error code presence flag? */
+	if ((event_inj1 & SVM_EVTINJ_VALID_ERR) !=
+	    (event_inj2 & SVM_EVTINJ_VALID_ERR))
+		return false;
+
+	/* No error code? */
+	if (!(event_inj1 & SVM_EVTINJ_VALID_ERR))
+		return true;
+
+	/* Same error code? */
+	return event_inj_err1 == event_inj_err2;
+}
+
+/* Did the last VMEXIT happen when attempting to inject that event? */
+static inline bool exit_during_event_injection(struct vcpu_svm *svm,
+					       u32 event_inj, u32 event_inj_err)
+{
+	BUILD_BUG_ON(SVM_EXITINTINFO_VEC_MASK != SVM_EVTINJ_VEC_MASK ||
+		     SVM_EXITINTINFO_TYPE_MASK != SVM_EVTINJ_TYPE_MASK ||
+		     SVM_EXITINTINFO_VALID != SVM_EVTINJ_VALID ||
+		     SVM_EXITINTINFO_VALID_ERR != SVM_EVTINJ_VALID_ERR);
+
+	return event_inj_same(svm->vmcb->control.exit_int_info,
+			      svm->vmcb->control.exit_int_info_err,
+			      event_inj, event_inj_err);
+}
+
 /* svm.c */
 #define MSR_INVALID				0xffffffffU
 
@@ -540,6 +586,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
 	return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
 }
 
+void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu);
 int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
 			 u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
 void svm_leave_nested(struct kvm_vcpu *vcpu);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] KVM: nSVM: Restore next_rip when doing L1 -> L2 event re-injection
  2022-03-10 21:38 [PATCH 0/5] nSVM: L1 -> L2 event injection fixes and a self-test Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2022-03-10 21:38 ` [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events Maciej S. Szmigiero
@ 2022-03-10 21:38 ` Maciej S. Szmigiero
  2022-03-10 21:38 ` [PATCH 5/5] KVM: selftests: nSVM: Add svm_nested_soft_inject_test Maciej S. Szmigiero
  4 siblings, 0 replies; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-10 21:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

According to APM 15.7.1 "State Saved on Exit" the next_rip field can be
zero after a VMEXIT in some cases.
Yet, it is used by the CPU for the return address pushed on stack when
injecting INT3 or INTO exception or a software interrupt.

Restore this field to the L1-provided value if zeroed by the CPU when
re-injecting a L1-provided event into L2.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/x86/kvm/svm/svm.c | 43 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5b128baa5e57..760dd0e070ea 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -385,6 +385,44 @@ static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+/*
+ * According to APM 15.7.1 "State Saved on Exit" the next_rip field can
+ * be zero after a VMEXIT in some cases.
+ * Yet, it is used by the CPU for the return address pushed on stack when
+ * injecting INT3 or INTO exception or a software interrupt.
+ *
+ * Restore this field to the L1-provided value if zeroed by the CPU when
+ * re-injecting a L1-provided event into L2.
+ */
+static void maybe_fixup_next_rip(struct kvm_vcpu *vcpu, bool uses_err)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 err_vmcb = uses_err ? svm->vmcb->control.event_inj_err : 0;
+	u32 err_inject = uses_err ? svm->nested.ctl.event_inj_err : 0;
+
+	/* No nRIP Save feature? Then nothing to fix up. */
+	if (!nrips)
+		return;
+
+	/* The fix only applies to event injection into a L2. */
+	if (!is_guest_mode(vcpu))
+		return;
+
+	/*
+	 * If the current next_rip field is already non-zero assume the CPU had
+	 * returned the correct address during the last VMEXIT.
+	 */
+	if (svm->vmcb->control.next_rip)
+		return;
+
+	/* Is this a L1 -> L2 event re-injection? */
+	if (!event_inj_same(svm->vmcb->control.event_inj, err_vmcb,
+			    svm->nested.ctl.event_inj, err_inject))
+		return;
+
+	svm->vmcb->control.next_rip = svm->nested.ctl.next_rip;
+}
+
 static void svm_queue_exception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -415,6 +453,9 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
 		| (has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
 		| SVM_EVTINJ_TYPE_EXEPT;
 	svm->vmcb->control.event_inj_err = error_code;
+
+	if (kvm_exception_is_soft(nr))
+		maybe_fixup_next_rip(vcpu, true);
 }
 
 static void svm_init_erratum_383(void)
@@ -3331,6 +3372,8 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
 		SVM_EVTINJ_VALID;
 	if (vcpu->arch.interrupt.soft) {
 		svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_SOFT;
+
+		maybe_fixup_next_rip(vcpu, false);
 	} else {
 		svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_INTR;
 	}

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] KVM: selftests: nSVM: Add svm_nested_soft_inject_test
  2022-03-10 21:38 [PATCH 0/5] nSVM: L1 -> L2 event injection fixes and a self-test Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2022-03-10 21:38 ` [PATCH 4/5] KVM: nSVM: Restore next_rip when doing L1 -> L2 event re-injection Maciej S. Szmigiero
@ 2022-03-10 21:38 ` Maciej S. Szmigiero
  4 siblings, 0 replies; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-10 21:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a KVM self-test that checks whether a nSVM L1 is able to successfully
inject a software interrupt and a soft exception into its L2 guest.

In practice, this tests both the next_rip field consistency and
L1-injected event with intervening L0 VMEXIT during its delivery:
the first nested VMRUN (that's also trying to inject a software interrupt)
will immediately trigger a L0 NPF.
This L0 NPF will have zero in its CPU-returned next_rip field, which if
incorrectly reused by KVM will trigger a #PF when trying to return to
such address 0 from the interrupt handler.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/svm_util.h   |   2 +
 .../kvm/x86_64/svm_nested_soft_inject_test.c  | 147 ++++++++++++++++++
 4 files changed, 151 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 9b67343dc4ab..bc7e2c5a8560 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -32,6 +32,7 @@
 /x86_64/state_test
 /x86_64/svm_vmcall_test
 /x86_64/svm_int_ctl_test
+/x86_64/svm_nested_soft_inject_test
 /x86_64/sync_regs_test
 /x86_64/tsc_msrs_test
 /x86_64/userspace_io_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 04099f453b59..ff63e3caac9b 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -65,6 +65,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_preemption_timer_test
 TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test
 TEST_GEN_PROGS_x86_64 += x86_64/svm_int_ctl_test
+TEST_GEN_PROGS_x86_64 += x86_64/svm_nested_soft_inject_test
 TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
 TEST_GEN_PROGS_x86_64 += x86_64/userspace_io_test
 TEST_GEN_PROGS_x86_64 += x86_64/userspace_msr_exit_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/svm_util.h b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
index a25aabd8f5e7..d49f7c9b4564 100644
--- a/tools/testing/selftests/kvm/include/x86_64/svm_util.h
+++ b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
@@ -16,6 +16,8 @@
 #define CPUID_SVM_BIT		2
 #define CPUID_SVM		BIT_ULL(CPUID_SVM_BIT)
 
+#define SVM_EXIT_EXCP_BASE	0x040
+#define SVM_EXIT_HLT		0x078
 #define SVM_EXIT_MSR		0x07c
 #define SVM_EXIT_VMMCALL	0x081
 
diff --git a/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
new file mode 100644
index 000000000000..d39be5d885c1
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/svm_nested_soft_inject_test.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 Oracle and/or its affiliates.
+ *
+ * Based on:
+ *   svm_int_ctl_test
+ *
+ *   Copyright (C) 2021, Red Hat, Inc.
+ *
+ */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "svm_util.h"
+
+#define VCPU_ID		0
+#define INT_NR			0x20
+#define X86_FEATURE_NRIPS	BIT(3)
+
+#define vmcall()		\
+	__asm__ __volatile__(	\
+		"vmmcall\n"	\
+		)
+
+#define ud2()			\
+	__asm__ __volatile__(	\
+		"ud2\n"	\
+		)
+
+#define hlt()			\
+	__asm__ __volatile__(	\
+		"hlt\n"	\
+		)
+
+static unsigned int bp_fired;
+static void guest_bp_handler(struct ex_regs *regs)
+{
+	bp_fired++;
+}
+
+static unsigned int int_fired;
+static void guest_int_handler(struct ex_regs *regs)
+{
+	int_fired++;
+}
+
+static void l2_guest_code(void)
+{
+	GUEST_ASSERT(int_fired == 1);
+	vmcall();
+	ud2();
+
+	GUEST_ASSERT(bp_fired == 1);
+	hlt();
+}
+
+static void l1_guest_code(struct svm_test_data *svm)
+{
+	#define L2_GUEST_STACK_SIZE 64
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+	struct vmcb *vmcb = svm->vmcb;
+
+	/* Prepare for L2 execution. */
+	generic_svm_setup(svm, l2_guest_code,
+			  &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+	vmcb->control.intercept_exceptions |= BIT(PF_VECTOR) | BIT(UD_VECTOR);
+	vmcb->control.intercept |= BIT(INTERCEPT_HLT);
+
+	vmcb->control.event_inj = INT_NR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_SOFT;
+	/* The return address pushed on stack */
+	vmcb->control.next_rip = vmcb->save.rip;
+
+	run_guest(vmcb, svm->vmcb_gpa);
+	GUEST_ASSERT_3(vmcb->control.exit_code == SVM_EXIT_VMMCALL,
+		       vmcb->control.exit_code,
+		       vmcb->control.exit_info_1, vmcb->control.exit_info_2);
+
+	/* Skip over VMCALL */
+	vmcb->save.rip += 3;
+
+	vmcb->control.event_inj = BP_VECTOR | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
+	/* The return address pushed on stack, skip over UD2 */
+	vmcb->control.next_rip = vmcb->save.rip + 2;
+
+	run_guest(vmcb, svm->vmcb_gpa);
+	GUEST_ASSERT_3(vmcb->control.exit_code == SVM_EXIT_HLT,
+		       vmcb->control.exit_code,
+		       vmcb->control.exit_info_1, vmcb->control.exit_info_2);
+
+	GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_cpuid_entry2 *cpuid;
+	struct kvm_vm *vm;
+	vm_vaddr_t svm_gva;
+	struct kvm_guest_debug debug;
+
+	nested_svm_check_supported();
+
+	cpuid = kvm_get_supported_cpuid_entry(0x8000000a);
+	if (!(cpuid->edx & X86_FEATURE_NRIPS)) {
+		print_skip("nRIP Save unavailable");
+		exit(KSFT_SKIP);
+	}
+
+	vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
+
+	vm_init_descriptor_tables(vm);
+	vcpu_init_descriptor_tables(vm, VCPU_ID);
+
+	vm_install_exception_handler(vm, BP_VECTOR, guest_bp_handler);
+	vm_install_exception_handler(vm, INT_NR, guest_int_handler);
+
+	vcpu_alloc_svm(vm, &svm_gva);
+	vcpu_args_set(vm, VCPU_ID, 1, svm_gva);
+
+	memset(&debug, 0, sizeof(debug));
+	vcpu_set_guest_debug(vm, VCPU_ID, &debug);
+
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+	struct ucall uc;
+
+	vcpu_run(vm, VCPU_ID);
+	TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+		    "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
+		    run->exit_reason,
+		    exit_reason_str(run->exit_reason));
+
+	switch (get_ucall(vm, VCPU_ID, &uc)) {
+	case UCALL_ABORT:
+		TEST_FAIL("%s at %s:%ld, vals = 0x%lx 0x%lx 0x%lx", (const char *)uc.args[0],
+			  __FILE__, uc.args[1], uc.args[2], uc.args[3], uc.args[4]);
+		break;
+		/* NOT REACHED */
+	case UCALL_DONE:
+		goto done;
+	default:
+		TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+	}
+done:
+	kvm_vm_free(vm);
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-03-10 21:38 ` [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events Maciej S. Szmigiero
@ 2022-03-30 21:59   ` Sean Christopherson
  2022-03-30 22:16     ` Maciej S. Szmigiero
  2022-04-04  9:53   ` Maxim Levitsky
  1 sibling, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-03-30 21:59 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> In SVM synthetic software interrupts or INT3 or INTO exception that L1
> wants to inject into its L2 guest are forgotten if there is an intervening
> L0 VMEXIT during their delivery.
> 
> They are re-injected correctly with VMX, however.
> 
> This is because there is an assumption in SVM that such exceptions will be
> re-delivered by simply re-executing the current instruction.
> Which might not be true if this is a synthetic exception injected by L1,
> since in this case the re-executed instruction will be one already in L2,
> not the VMRUN instruction in L1 that attempted the injection.
> 
> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
> it is either re-injected successfully or returned to L1 upon a nested
> VMEXIT.
> Make sure to always re-queue such event if returned in EXITINTINFO.
> 
> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
> unforeseen regressions.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---

...

> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
>  		return;
>  
> +	/* L1 -> L2 event re-injection needs a different handling */
> +	if (is_guest_mode(vcpu) &&
> +	    exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> +					svm->nested.ctl.event_inj_err)) {
> +		nested_svm_maybe_reinject(vcpu);

Why is this manually re-injecting?  More specifically, why does the below (out of
sight in the diff) code that re-queues the exception/interrupt not work?  The
re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
propagatred to vmcb12.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-03-30 21:59   ` Sean Christopherson
@ 2022-03-30 22:16     ` Maciej S. Szmigiero
  2022-03-30 23:20       ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-30 22:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On 30.03.2022 23:59, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> In SVM synthetic software interrupts or INT3 or INTO exception that L1
>> wants to inject into its L2 guest are forgotten if there is an intervening
>> L0 VMEXIT during their delivery.
>>
>> They are re-injected correctly with VMX, however.
>>
>> This is because there is an assumption in SVM that such exceptions will be
>> re-delivered by simply re-executing the current instruction.
>> Which might not be true if this is a synthetic exception injected by L1,
>> since in this case the re-executed instruction will be one already in L2,
>> not the VMRUN instruction in L1 that attempted the injection.
>>
>> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
>> it is either re-injected successfully or returned to L1 upon a nested
>> VMEXIT.
>> Make sure to always re-queue such event if returned in EXITINTINFO.
>>
>> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
>> unforeseen regressions.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
> 
> ...
> 
>> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>>   	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
>>   		return;
>>   
>> +	/* L1 -> L2 event re-injection needs a different handling */
>> +	if (is_guest_mode(vcpu) &&
>> +	    exit_during_event_injection(svm, svm->nested.ctl.event_inj,
>> +					svm->nested.ctl.event_inj_err)) {
>> +		nested_svm_maybe_reinject(vcpu);
> 
> Why is this manually re-injecting?  More specifically, why does the below (out of
> sight in the diff) code that re-queues the exception/interrupt not work?  The
> re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
> propagatred to vmcb12.

A L1 -> L2 injected event should either be re-injected until successfully
injected into L2 or propagated to VMCB12 if there is a nested VMEXIT
during its delivery.

svm_complete_interrupts() does not do such re-injection in some cases
(soft interrupts, soft exceptions, #VC) - it is trying to resort to
emulation instead, which is incorrect in this case.

I think it's better to split out this L1 -> L2 nested case to a
separate function in nested.c rather than to fill
svm_complete_interrupts() in already very large svm.c with "if" blocks
here and there.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-03-30 22:16     ` Maciej S. Szmigiero
@ 2022-03-30 23:20       ` Sean Christopherson
  2022-03-31 23:09         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-03-30 23:20 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Thu, Mar 31, 2022, Maciej S. Szmigiero wrote:
> On 30.03.2022 23:59, Sean Christopherson wrote:
> > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> > > @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> > >   	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
> > >   		return;
> > > +	/* L1 -> L2 event re-injection needs a different handling */
> > > +	if (is_guest_mode(vcpu) &&
> > > +	    exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> > > +					svm->nested.ctl.event_inj_err)) {
> > > +		nested_svm_maybe_reinject(vcpu);
> > 
> > Why is this manually re-injecting?  More specifically, why does the below (out of
> > sight in the diff) code that re-queues the exception/interrupt not work?  The
> > re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
> > propagatred to vmcb12.
> 
> A L1 -> L2 injected event should either be re-injected until successfully
> injected into L2 or propagated to VMCB12 if there is a nested VMEXIT
> during its delivery.
> 
> svm_complete_interrupts() does not do such re-injection in some cases
> (soft interrupts, soft exceptions, #VC) - it is trying to resort to
> emulation instead, which is incorrect in this case.
> 
> I think it's better to split out this L1 -> L2 nested case to a
> separate function in nested.c rather than to fill
> svm_complete_interrupts() in already very large svm.c with "if" blocks
> here and there.

Ah, I see it now.  WTF.

Ugh, commit 66fd3f7f901f ("KVM: Do not re-execute INTn instruction.") fixed VMX,
but left SVM broken.

Re-executing the INTn is wrong, the instruction has already completed decode and
execution.  E.g. if there's there's a code breakpoint on the INTn, rewinding will
cause a spurious #DB.

KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
that the APM says "Software interrupts cannot be properly injected if the processor
does not support the NextRIP field.".  What a mess.

Anyways, for the common nrips=true case, I strongly prefer that we properly fix
the non-nested case and re-inject software interrupts, which should in turn
naturally fix this nested case.  And for nrips=false, my vote is to either punt
and document it as a "KVM erratum", or straight up make nested require nrips.

Note, that also requires updating svm_queue_exception(), which assumes it will
only be handed hardware exceptions, i.e. hardcodes type EXEPT.  That's blatantly
wrong, e.g. if userspace injects a software exception via KVM_SET_VCPU_EVENTS.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-03-30 23:20       ` Sean Christopherson
@ 2022-03-31 23:09         ` Maciej S. Szmigiero
  2022-04-01  0:08           ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-03-31 23:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On 31.03.2022 01:20, Sean Christopherson wrote:
> On Thu, Mar 31, 2022, Maciej S. Szmigiero wrote:
>> On 30.03.2022 23:59, Sean Christopherson wrote:
>>> On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
>>>> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>>>>    	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
>>>>    		return;
>>>> +	/* L1 -> L2 event re-injection needs a different handling */
>>>> +	if (is_guest_mode(vcpu) &&
>>>> +	    exit_during_event_injection(svm, svm->nested.ctl.event_inj,
>>>> +					svm->nested.ctl.event_inj_err)) {
>>>> +		nested_svm_maybe_reinject(vcpu);
>>>
>>> Why is this manually re-injecting?  More specifically, why does the below (out of
>>> sight in the diff) code that re-queues the exception/interrupt not work?  The
>>> re-queued event should be picked up by nested_save_pending_event_to_vmcb12() and
>>> propagatred to vmcb12.
>>
>> A L1 -> L2 injected event should either be re-injected until successfully
>> injected into L2 or propagated to VMCB12 if there is a nested VMEXIT
>> during its delivery.
>>
>> svm_complete_interrupts() does not do such re-injection in some cases
>> (soft interrupts, soft exceptions, #VC) - it is trying to resort to
>> emulation instead, which is incorrect in this case.
>>
>> I think it's better to split out this L1 -> L2 nested case to a
>> separate function in nested.c rather than to fill
>> svm_complete_interrupts() in already very large svm.c with "if" blocks
>> here and there.
> 
> Ah, I see it now.  WTF.
> 
> Ugh, commit 66fd3f7f901f ("KVM: Do not re-execute INTn instruction.") fixed VMX,
> but left SVM broken.
> 
> Re-executing the INTn is wrong, the instruction has already completed decode and
> execution.  E.g. if there's there's a code breakpoint on the INTn, rewinding will
> cause a spurious #DB.
> 
> KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
> that the APM says "Software interrupts cannot be properly injected if the processor
> does not support the NextRIP field.".  What a mess.

Note that KVM currently always tries to re-execute the current instruction
when asked to re-inject a #BP or a #OF, even when nrips are enabled.

Also, #BP (and #OF, too) is returned as type SVM_EXITINTINFO_TYPE_EXEPT,
not as SVM_EXITINTINFO_TYPE_SOFT (soft interrupt), so it should be
re-injected accordingly.

> Anyways, for the common nrips=true case, I strongly prefer that we properly fix
> the non-nested case and re-inject software interrupts, which should in turn
> naturally fix this nested case.  

This would also need making the #BP or #OF current instruction
re-execution conditional on (at least) nrips support.

I am not sure, however, whether this won't introduce any regressions.
That's why this patch set changed the behavior here only for the
L1 -> L2 case.

Another issue is whether a L1 hypervisor can legally inject a #VC
into its L2 (since these are never re-injected).

We still need L1 -> L2 event injection detection to restore the NextRIP
field when re-injecting an event that uses it.

> And for nrips=false, my vote is to either punt
> and document it as a "KVM erratum", or straight up make nested require nrips.

A quick Internet search shows that the first CPUs with NextRIP were the
second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
They started being released in early 2009, so we probably don't need to
worry about the non-nrips case too much.

For the nested case, orthodox reading of the aforementioned APM sentence
would mean that a L1 hypervisor is not allowed either to make use of such
event injection in the non-nrips case.

> Note, that also requires updating svm_queue_exception(), which assumes it will
> only be handed hardware exceptions, i.e. hardcodes type EXEPT.  That's blatantly
> wrong, e.g. if userspace injects a software exception via KVM_SET_VCPU_EVENTS.

svm_queue_exception() uses SVM_EVTINJ_TYPE_EXEPT, which is correct even
for software exceptions (#BP or #OF).
It does work indeed, as the self test included in this patch set
demonstrates.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-03-31 23:09         ` Maciej S. Szmigiero
@ 2022-04-01  0:08           ` Sean Christopherson
  2022-04-01 16:05             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-04-01  0:08 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> On 31.03.2022 01:20, Sean Christopherson wrote:
> > Re-executing the INTn is wrong, the instruction has already completed decode and
> > execution.  E.g. if there's there's a code breakpoint on the INTn, rewinding will
> > cause a spurious #DB.
> > 
> > KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
> > that the APM says "Software interrupts cannot be properly injected if the processor
> > does not support the NextRIP field.".  What a mess.
> 
> Note that KVM currently always tries to re-execute the current instruction
> when asked to re-inject a #BP or a #OF, even when nrips are enabled.

Yep, and my vote is to fix that.

> Also, #BP (and #OF, too) is returned as type SVM_EXITINTINFO_TYPE_EXEPT,
> not as SVM_EXITINTINFO_TYPE_SOFT (soft interrupt), so it should be
> re-injected accordingly.

Ahhh, SVM doesn't differentiate between software exceptions and hardware exceptions.
Finally found the relevant section in the APM:

  Despite the instruction name, the events raised by the INT1 (also known as ICEBP),
  INT3 and INTO instructions (opcodes F1h, CCh and CEh) are considered exceptions for
  the purposes of EXITINTINFO, not software interrupts. Only events raised by the INTn
  instruction (opcode CDh) are considered software interrupts.

VMX has separate identifiers for software interrupts and for software exceptions,
where as SVM unconditionally treats #BP and #OF as soft:

  Injecting an exception (TYPE = 3) with vectors 3 or 4 behaves like a trap raised by
  INT3 and INTO instructions

Now I'm curious why Intel doesn't do the same...

> > Anyways, for the common nrips=true case, I strongly prefer that we properly fix
> > the non-nested case and re-inject software interrupts, which should in turn
> > naturally fix this nested case.
> 
> This would also need making the #BP or #OF current instruction
> re-execution conditional on (at least) nrips support.
> 
> I am not sure, however, whether this won't introduce any regressions.
> That's why this patch set changed the behavior here only for the
> L1 -> L2 case.
> 
> Another issue is whether a L1 hypervisor can legally inject a #VC
> into its L2 (since these are never re-injected).

I would expect to work, and it's easy to find out.  I know VMX allows injecting
non-existent exceptions, but the APM is vague as usual and says VMRUN will fail...

  If the VMM attempts to inject an event that is impossible for the guest mode

> We still need L1 -> L2 event injection detection to restore the NextRIP
> field when re-injecting an event that uses it.

You lost me on this one.  KVM L0 is only (and always!) responsible for saving the
relevant info into vmcb12, why does it need to detect where the vectoring exception
came from?

> > And for nrips=false, my vote is to either punt
> > and document it as a "KVM erratum", or straight up make nested require nrips.
> 
> A quick Internet search shows that the first CPUs with NextRIP were the
> second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
> They started being released in early 2009, so we probably don't need to
> worry about the non-nrips case too much.
> 
> For the nested case, orthodox reading of the aforementioned APM sentence
> would mean that a L1 hypervisor is not allowed either to make use of such
> event injection in the non-nrips case.

Heh, my reading of it is that it's not disallowed, it just won't work correctly,
i.e. the INTn won't be skipped.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-04-01  0:08           ` Sean Christopherson
@ 2022-04-01 16:05             ` Maciej S. Szmigiero
  2022-04-01 22:07               ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-01 16:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On 1.04.2022 02:08, Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
>> On 31.03.2022 01:20, Sean Christopherson wrote:
>>> Re-executing the INTn is wrong, the instruction has already completed decode and
>>> execution.  E.g. if there's there's a code breakpoint on the INTn, rewinding will
>>> cause a spurious #DB.
>>>
>>> KVM's INT3 shenanigans are bonkers, but I guess there's no better option given
>>> that the APM says "Software interrupts cannot be properly injected if the processor
>>> does not support the NextRIP field.".  What a mess.
>>
>> Note that KVM currently always tries to re-execute the current instruction
>> when asked to re-inject a #BP or a #OF, even when nrips are enabled.
> 
> Yep, and my vote is to fix that.

The only issue I have with making SVM re-inject a #BP or a #OF in the
nrips case is that this might introduce some regression with respect to
host-side guest debugging.

On the other hand, VMX does re-inject these unconditionally so it might
not be a problem after all.

>> Also, #BP (and #OF, too) is returned as type SVM_EXITINTINFO_TYPE_EXEPT,
>> not as SVM_EXITINTINFO_TYPE_SOFT (soft interrupt), so it should be
>> re-injected accordingly.
> 
> Ahhh, SVM doesn't differentiate between software exceptions and hardware exceptions.
> Finally found the relevant section in the APM:
> 
>    Despite the instruction name, the events raised by the INT1 (also known as ICEBP),
>    INT3 and INTO instructions (opcodes F1h, CCh and CEh) are considered exceptions for
>    the purposes of EXITINTINFO, not software interrupts. Only events raised by the INTn
>    instruction (opcode CDh) are considered software interrupts.
> 
> VMX has separate identifiers for software interrupts and for software exceptions,

I guess the sentence above was supposed to read "for *hardware exceptions*
and for software exceptions", just like in the previous paragraph about SVM.

> where as SVM unconditionally treats #BP and #OF as soft:
> 
>    Injecting an exception (TYPE = 3) with vectors 3 or 4 behaves like a trap raised by
>    INT3 and INTO instructions
> 
> Now I'm curious why Intel doesn't do the same...

Perhaps it's just a result of VMX and SVM being developed independently,
in parallel.

>>> Anyways, for the common nrips=true case, I strongly prefer that we properly fix
>>> the non-nested case and re-inject software interrupts, which should in turn
>>> naturally fix this nested case.
>>
>> This would also need making the #BP or #OF current instruction
>> re-execution conditional on (at least) nrips support.
>>
>> I am not sure, however, whether this won't introduce any regressions.
>> That's why this patch set changed the behavior here only for the
>> L1 -> L2 case.
>>
>> Another issue is whether a L1 hypervisor can legally inject a #VC
>> into its L2 (since these are never re-injected).
> 
> I would expect to work, and it's easy to find out.  I know VMX allows injecting
> non-existent exceptions, but the APM is vague as usual and says VMRUN will fail...

I've done a quick test right now and a VMRUN attempt with #VC event
injection does seem to fail.
So we don't need to worry about not re-injecting a #VC.

>    If the VMM attempts to inject an event that is impossible for the guest mode
> 
>> We still need L1 -> L2 event injection detection to restore the NextRIP
>> field when re-injecting an event that uses it.
> 
> You lost me on this one.  KVM L0 is only (and always!) responsible for saving the
> relevant info into vmcb12, why does it need to detect where the vectoring exception
> came from?

Look at the description of patch 4 of this patch set - after some
L2 VMEXITs handled by L0 (in other words, which do *not* result in
a nested VMEXIT to L1) the VMCB02 NextRIP field will be zero
(APM 15.7.1 "State Saved on Exit" describes when this happens).

If we attempt to re-inject some types of events with the NextRIP field
being zero the return address pushed on stack will also be zero, which
will obviously do bad things to the L2 when it returns from
an (exception|interrupt) handler.

>>> And for nrips=false, my vote is to either punt
>>> and document it as a "KVM erratum", or straight up make nested require nrips.
>>
>> A quick Internet search shows that the first CPUs with NextRIP were the
>> second-generation Family 10h CPUs (Phenom II, Athlon II, etc.).
>> They started being released in early 2009, so we probably don't need to
>> worry about the non-nrips case too much.
>>
>> For the nested case, orthodox reading of the aforementioned APM sentence
>> would mean that a L1 hypervisor is not allowed either to make use of such
>> event injection in the non-nrips case.
> 
> Heh, my reading of it is that it's not disallowed, it just won't work correctly,
> i.e. the INTn won't be skipped.

Either way, we probably don't need to worry that this case don't get handled
100% right.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-03-10 21:38 ` [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Maciej S. Szmigiero
@ 2022-04-01 18:32   ` Sean Christopherson
  2022-04-01 19:08     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-04-01 18:32 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
> This field value (instead of the saved guest RIP) in used by the CPU for
> the return address pushed on stack when injecting a software interrupt or
> INT3 or INTO exception.
> 
> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
> loading a nested state.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  arch/x86/kvm/svm/nested.c | 4 ++++
>  arch/x86/kvm/svm/svm.h    | 1 +
>  2 files changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index d736ec6514ca..9656f0d6815c 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -366,6 +366,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>  	to->nested_ctl          = from->nested_ctl;
>  	to->event_inj           = from->event_inj;
>  	to->event_inj_err       = from->event_inj_err;
> +	to->next_rip            = from->next_rip;
>  	to->nested_cr3          = from->nested_cr3;
>  	to->virt_ext            = from->virt_ext;
>  	to->pause_filter_count  = from->pause_filter_count;
> @@ -638,6 +639,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
>  	svm->vmcb->control.int_state           = svm->nested.ctl.int_state;
>  	svm->vmcb->control.event_inj           = svm->nested.ctl.event_inj;
>  	svm->vmcb->control.event_inj_err       = svm->nested.ctl.event_inj_err;
> +	/* The return address pushed on stack by the CPU for some injected events */
> +	svm->vmcb->control.next_rip            = svm->nested.ctl.next_rip;

This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.

	if (svm->nrips_enabled)
		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;

though I don't see any reason to add the condition to the copy to/from cache flows.

>  	if (!nested_vmcb_needs_vls_intercept(svm))
>  		svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
> @@ -1348,6 +1351,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
>  	dst->nested_ctl           = from->nested_ctl;
>  	dst->event_inj            = from->event_inj;
>  	dst->event_inj_err        = from->event_inj_err;
> +	dst->next_rip             = from->next_rip;
>  	dst->nested_cr3           = from->nested_cr3;
>  	dst->virt_ext              = from->virt_ext;
>  	dst->pause_filter_count   = from->pause_filter_count;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 93502d2a52ce..f757400fc933 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -138,6 +138,7 @@ struct vmcb_ctrl_area_cached {
>  	u64 nested_ctl;
>  	u32 event_inj;
>  	u32 event_inj_err;
> +	u64 next_rip;
>  	u64 nested_cr3;
>  	u64 virt_ext;
>  	u32 clean;

I don't know why this struct has

	u8 reserved_sw[32];

but presumably it's for padding, i.e. probably should be reduced to 24 bytes.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-01 18:32   ` Sean Christopherson
@ 2022-04-01 19:08     ` Maciej S. Szmigiero
  2022-04-01 21:51       ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-01 19:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On 1.04.2022 20:32, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
>> This field value (instead of the saved guest RIP) in used by the CPU for
>> the return address pushed on stack when injecting a software interrupt or
>> INT3 or INTO exception.
>>
>> Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
>> loading a nested state.
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   arch/x86/kvm/svm/nested.c | 4 ++++
>>   arch/x86/kvm/svm/svm.h    | 1 +
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index d736ec6514ca..9656f0d6815c 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -366,6 +366,7 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>>   	to->nested_ctl          = from->nested_ctl;
>>   	to->event_inj           = from->event_inj;
>>   	to->event_inj_err       = from->event_inj_err;
>> +	to->next_rip            = from->next_rip;
>>   	to->nested_cr3          = from->nested_cr3;
>>   	to->virt_ext            = from->virt_ext;
>>   	to->pause_filter_count  = from->pause_filter_count;
>> @@ -638,6 +639,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
>>   	svm->vmcb->control.int_state           = svm->nested.ctl.int_state;
>>   	svm->vmcb->control.event_inj           = svm->nested.ctl.event_inj;
>>   	svm->vmcb->control.event_inj_err       = svm->nested.ctl.event_inj_err;
>> +	/* The return address pushed on stack by the CPU for some injected events */
>> +	svm->vmcb->control.next_rip            = svm->nested.ctl.next_rip;
> 
> This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.
> 
> 	if (svm->nrips_enabled)
> 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;

It can be done, however what if we run on a nrips-capable CPU,
but don't expose this capability to the L1?

The CPU will then push whatever value was left in this field as
the return address for some L1 injected events.

Although without nrips feature the L1 shouldn't even attempt event
injection, copying this field anyway will make it work if L1 just
expects this capability based on the current CPU model rather than
by checking specific CPUID feature bits.

> though I don't see any reason to add the condition to the copy to/from cache flows.
>
>>   	if (!nested_vmcb_needs_vls_intercept(svm))
>>   		svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
>> @@ -1348,6 +1351,7 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
>>   	dst->nested_ctl           = from->nested_ctl;
>>   	dst->event_inj            = from->event_inj;
>>   	dst->event_inj_err        = from->event_inj_err;
>> +	dst->next_rip             = from->next_rip;
>>   	dst->nested_cr3           = from->nested_cr3;
>>   	dst->virt_ext              = from->virt_ext;
>>   	dst->pause_filter_count   = from->pause_filter_count;
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index 93502d2a52ce..f757400fc933 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -138,6 +138,7 @@ struct vmcb_ctrl_area_cached {
>>   	u64 nested_ctl;
>>   	u32 event_inj;
>>   	u32 event_inj_err;
>> +	u64 next_rip;
>>   	u64 nested_cr3;
>>   	u64 virt_ext;
>>   	u32 clean;
> 
> I don't know why this struct has
> 
> 	u8 reserved_sw[32];
> 
> but presumably it's for padding, i.e. probably should be reduced to 24 bytes.

Apparently the "reserved_sw" field stores Hyper-V enlightenments state -
see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature")
and nested_svm_vmrun_msrpm() in nested.c.

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-01 19:08     ` Maciej S. Szmigiero
@ 2022-04-01 21:51       ` Sean Christopherson
  2022-04-04  9:50         ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-04-01 21:51 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> On 1.04.2022 20:32, Sean Christopherson wrote:
> > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> > > +	/* The return address pushed on stack by the CPU for some injected events */
> > > +	svm->vmcb->control.next_rip            = svm->nested.ctl.next_rip;
> > 
> > This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.
> > 
> > 	if (svm->nrips_enabled)
> > 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> 
> It can be done, however what if we run on a nrips-capable CPU,
> but don't expose this capability to the L1?

Oh, right, because the field will be populated by the CPU on VM-Exit.  Ah, the
correct behavior is to grab RIP from vmcb12 to emulate nrips=0 hardware simply
not updating RIP.  E.g. zeroing it out would send L2 into the weeds on IRET due
the CPU pushing '0' on the stack when vectoring the injected event.

	if (svm->nrips_enabled)
		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
	else if (boot_cpu_has(X86_FEATURE_NRIPS))
		vmcb02->control.next_rip    = vmcb12_rip;

> The CPU will then push whatever value was left in this field as
> the return address for some L1 injected events.
> 
> Although without nrips feature the L1 shouldn't even attempt event
> injection, copying this field anyway will make it work if L1 just
> expects this capability based on the current CPU model rather than
> by checking specific CPUID feature bits.

L1 may still inject the exception, it just advances the RIP manually.  As above,
the really messy thing is that, because there's no flag to say "don't use NextRIP!",
the CPU will still consume NextRIP and push '0' on the stack for the return RIP
from the INTn/INT3/INTO.  Yay.

I found that out the hard way (patch in-progress).  The way to handle event
injection if KVM is loaded with nrips=0 but nrips is supported in hardware is to
stuff NextRIP on event injection even if nrips=0, otherwise the guest is hosed.

> > > +	u64 next_rip;
> > >   	u64 nested_cr3;
> > >   	u64 virt_ext;
> > >   	u32 clean;
> > 
> > I don't know why this struct has
> > 
> > 	u8 reserved_sw[32];
> > 
> > but presumably it's for padding, i.e. probably should be reduced to 24 bytes.
> 
> Apparently the "reserved_sw" field stores Hyper-V enlightenments state -
> see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature")
> and nested_svm_vmrun_msrpm() in nested.c.

Argh, that's a terrible name.  Thanks for doing the homework, I was being lazy.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-04-01 16:05             ` Maciej S. Szmigiero
@ 2022-04-01 22:07               ` Sean Christopherson
  0 siblings, 0 replies; 20+ messages in thread
From: Sean Christopherson @ 2022-04-01 22:07 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2830 bytes --]

On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> On 1.04.2022 02:08, Sean Christopherson wrote:
> > where as SVM unconditionally treats #BP and #OF as soft:
> > 
> >    Injecting an exception (TYPE = 3) with vectors 3 or 4 behaves like a trap raised by
> >    INT3 and INTO instructions
> > 
> > Now I'm curious why Intel doesn't do the same...
> 
> Perhaps it's just a result of VMX and SVM being developed independently,
> in parallel.

That definitely factors in, but nothing in VMX is spurious/random, there's always
a reason/motivation behind the architecture, even if that reason isn't obvious.
It bugs me that I can't figure out the "why" :-)

Ugh, but SVM still needs to fix injection for software interrupts, i.e. svm_inject_irq()
is broken.

> > > We still need L1 -> L2 event injection detection to restore the NextRIP
> > > field when re-injecting an event that uses it.
> > 
> > You lost me on this one.  KVM L0 is only (and always!) responsible for saving the
> > relevant info into vmcb12, why does it need to detect where the vectoring exception
> > came from?
> 
> Look at the description of patch 4 of this patch set - after some
> L2 VMEXITs handled by L0 (in other words, which do *not* result in
> a nested VMEXIT to L1) the VMCB02 NextRIP field will be zero
> (APM 15.7.1 "State Saved on Exit" describes when this happens).
> 
> If we attempt to re-inject some types of events with the NextRIP field
> being zero the return address pushed on stack will also be zero, which
> will obviously do bad things to the L2 when it returns from
> an (exception|interrupt) handler.

Right, but that issue isn't unique to L2 exits handled by L0.  E.g. if KVM is using
shadow paging and a #PF "owned" by KVM occurs while vectoring INT3/INTO, then KVM
needs to restore NextRIP after resolving the #PF.

Argh, but NextRIP being zeroed makes it painful to handle the scenario where the
INT3/INT0 isn't injected, because doesn't have a record of the instruction length.
That's a big fail on SVM's part :-/

Huh, actually, it's not too bad to support.  SVM already has the code to compute
the next RIP via the emulator, it's easy enough to reuse that code to calculate
NextRIP and then unwind RIP.

With your patch 1, your selftest and all KVM unit tests passes[*] with the attached
patch.  I'll split it up and send out a proper series with your other fixes and
selftests.  I apologize for usurping your series, I was sketching up a diff to
show what I had in mind for reinjecting and it ended up being less code than I
expected to actually get it working.

[*] The new SVM KUT tests don't pass, but that's a different mess, and anything
    that uses INVCPID doesn't pass with nrips=0 && npt=0 because KVM's emulator
    doesn't know how to decode INVPCID, but KVM needs to intercept INVPCID when
    using shadow paging.

[-- Attachment #2: 0001-KVM-SVM-Re-inject-soft-interrupts-instead-of-retryin.patch --]
[-- Type: text/x-diff, Size: 6506 bytes --]

From 45c0408ca6738a98a3284837d9383be966e86901 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 1 Apr 2022 10:01:18 -0700
Subject: [PATCH] KVM: SVM: Re-inject soft interrupts instead of retrying
 instruction

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 101 +++++++++++++++++++++++++++++------------
 arch/x86/kvm/svm/svm.h |   4 +-
 2 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2c86bd9176c6..c534d00ae194 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -370,6 +370,45 @@ static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
+{
+	unsigned long rip, old_rip = kvm_rip_read(vcpu);
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	/*
+	 * Due to architectural shortcomings, the CPU doesn't always provide
+	 * NextRIP, e.g. if KVM intercepted an exception that occurred while
+	 * the CPU was vectoring an INTO/INT3 in the guest.  Temporarily skip
+	 * the instruction even if NextRIP is supported to acquire the next
+	 * RIP so that it can be shoved into the NextRIP field, otherwise
+	 * hardware will fail to advance guest RIP during event injection.
+	 * Drop the exception/interrupt if emulation fails and effectively
+	 * retry the instruction, it's the least awful option.
+	 */
+	if (!svm_skip_emulated_instruction(vcpu))
+		return -EIO;
+
+	rip = kvm_rip_read(vcpu);
+
+	/*
+	 * If NextRIP is supported, rewind RIP and update NextRip.  If NextRip
+	 * isn't supported, keep the result of the skip as the CPU obviously
+	 * won't advance RIP, but stash away the injection information so that
+	 * RIP can be unwound if injection fails.
+	 */
+	if (nrips) {
+		kvm_rip_write(vcpu, old_rip);
+		svm->vmcb->control.next_rip = rip;
+	} else {
+		if (static_cpu_has(X86_FEATURE_NRIPS))
+			svm->vmcb->control.next_rip = rip;
+
+		svm->soft_int_linear_rip = rip + svm->vmcb->save.cs.base;
+		svm->soft_int_injected = rip - old_rip;
+	}
+	return 0;
+}
+
 static void svm_queue_exception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -379,21 +418,9 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
 
 	kvm_deliver_exception_payload(vcpu);
 
-	if (nr == BP_VECTOR && !nrips) {
-		unsigned long rip, old_rip = kvm_rip_read(vcpu);
-
-		/*
-		 * For guest debugging where we have to reinject #BP if some
-		 * INT3 is guest-owned:
-		 * Emulate nRIP by moving RIP forward. Will fail if injection
-		 * raises a fault that is not intercepted. Still better than
-		 * failing in all cases.
-		 */
-		(void)svm_skip_emulated_instruction(vcpu);
-		rip = kvm_rip_read(vcpu);
-		svm->int3_rip = rip + svm->vmcb->save.cs.base;
-		svm->int3_injected = rip - old_rip;
-	}
+	if (kvm_exception_is_soft(nr) &&
+	    svm_update_soft_interrupt_rip(vcpu))
+		return;
 
 	svm->vmcb->control.event_inj = nr
 		| SVM_EVTINJ_VALID
@@ -3382,14 +3409,24 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 static void svm_inject_irq(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 type;
 
 	WARN_ON(!gif_set(svm));
 
+	if (vcpu->arch.interrupt.soft) {
+		if (svm_update_soft_interrupt_rip(vcpu))
+			return;
+
+		type = SVM_EVTINJ_TYPE_SOFT;
+	} else {
+		type = SVM_EVTINJ_TYPE_INTR;
+	}
+
 	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
 	++vcpu->stat.irq_injections;
 
 	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
-		SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
+				       SVM_EVTINJ_VALID | type;
 }
 
 void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
@@ -3672,9 +3709,9 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	u8 vector;
 	int type;
 	u32 exitintinfo = svm->vmcb->control.exit_int_info;
-	unsigned int3_injected = svm->int3_injected;
+	unsigned soft_int_injected = svm->soft_int_injected;
 
-	svm->int3_injected = 0;
+	svm->soft_int_injected = 0;
 
 	/*
 	 * If we've made progress since setting HF_IRET_MASK, we've
@@ -3694,6 +3731,18 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
 		return;
 
+	/*
+	 * If NextRIP isn't enabled, KVM must manually advance RIP prior to
+	 * injecting the soft exception/interrupt.  That advancement needs to
+	 * be unwound if vectoring didn't complete.  Note, the _new_ event may
+	 * not be the injected event, e.g. if KVM injected an INTn, the INTn
+	 * hit a #NP in the guest, and the #NP encountered a #PF, the #NP will
+	 * be the reported vectored event, but RIP still needs to be unwound.
+	 */
+	if (soft_int_injected &&
+	    kvm_is_linear_rip(vcpu, to_svm(vcpu)->soft_int_linear_rip))
+		kvm_rip_write(vcpu, kvm_rip_read(vcpu) - soft_int_injected);
+
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
@@ -3710,18 +3759,6 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 		if (vector == X86_TRAP_VC)
 			break;
 
-		/*
-		 * In case of software exceptions, do not reinject the vector,
-		 * but re-execute the instruction instead. Rewind RIP first
-		 * if we emulated INT3 before.
-		 */
-		if (kvm_exception_is_soft(vector)) {
-			if (vector == BP_VECTOR && int3_injected &&
-			    kvm_is_linear_rip(vcpu, svm->int3_rip))
-				kvm_rip_write(vcpu,
-					      kvm_rip_read(vcpu) - int3_injected);
-			break;
-		}
 		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
 			u32 err = svm->vmcb->control.exit_int_info_err;
 			kvm_requeue_exception_e(vcpu, vector, err);
@@ -3732,9 +3769,13 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 	case SVM_EXITINTINFO_TYPE_INTR:
 		kvm_queue_interrupt(vcpu, vector, false);
 		break;
+	case SVM_EXITINTINFO_TYPE_SOFT:
+		kvm_queue_interrupt(vcpu, vector, true);
+		break;
 	default:
 		break;
 	}
+
 }
 
 static void svm_cancel_injection(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 47e7427d0395..a770a1c7ddd2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -230,8 +230,8 @@ struct vcpu_svm {
 	bool nmi_singlestep;
 	u64 nmi_singlestep_guest_rflags;
 
-	unsigned int3_injected;
-	unsigned long int3_rip;
+	unsigned soft_int_injected;
+	unsigned long soft_int_linear_rip;
 
 	/* optional nested SVM features that are enabled for this guest  */
 	bool nrips_enabled                : 1;

base-commit: 26f97f8db06dc08a2b6a48692cdc1d89b288905d
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
  2022-04-01 21:51       ` Sean Christopherson
@ 2022-04-04  9:50         ` Maxim Levitsky
  0 siblings, 0 replies; 20+ messages in thread
From: Maxim Levitsky @ 2022-04-04  9:50 UTC (permalink / raw)
  To: Sean Christopherson, Maciej S. Szmigiero
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Fri, 2022-04-01 at 21:51 +0000, Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote:
> > On 1.04.2022 20:32, Sean Christopherson wrote:
> > > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote:
> > > > +	/* The return address pushed on stack by the CPU for some injected events */
> > > > +	svm->vmcb->control.next_rip            = svm->nested.ctl.next_rip;
> > > 
> > > This needs to be gated by nrips being enabled _and_ exposed to L1, i.e.
> > > 
> > > 	if (svm->nrips_enabled)
> > > 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> > 
> > It can be done, however what if we run on a nrips-capable CPU,
> > but don't expose this capability to the L1?
> 
> Oh, right, because the field will be populated by the CPU on VM-Exit.  Ah, the
> correct behavior is to grab RIP from vmcb12 to emulate nrips=0 hardware simply
> not updating RIP.  E.g. zeroing it out would send L2 into the weeds on IRET due
> the CPU pushing '0' on the stack when vectoring the injected event.
> 
> 	if (svm->nrips_enabled)
> 		vmcb02->control.next_rip    = svm->nested.ctl.next_rip;
> 	else if (boot_cpu_has(X86_FEATURE_NRIPS))
> 		vmcb02->control.next_rip    = vmcb12_rip;
> 
> > The CPU will then push whatever value was left in this field as
> > the return address for some L1 injected events.

This makes sense.

Note that even AMD's PRM has a note about this:

"
15.20 Event Injection
...
Software interrupts cannot be properly injected if the processor does not support the NextRIP field.
Support is indicated by CPUID Fn8000_000A_EDX[NRIPS] = 1. Hypervisor software should
emulate the event injection of software interrupts if NextRIP is not supported
"



> > 
> > Although without nrips feature the L1 shouldn't even attempt event
> > injection, copying this field anyway will make it work if L1 just
> > expects this capability based on the current CPU model rather than
> > by checking specific CPUID feature bits.

The guest really ought to check CPUID bits. Plus the CPU model is also
usually virtualized (for named machine types in Qemu for example).

> 
> L1 may still inject the exception, it just advances the RIP manually.  As above,
> the really messy thing is that, because there's no flag to say "don't use NextRIP!",
> the CPU will still consume NextRIP and push '0' on the stack for the return RIP
> from the INTn/INT3/INTO.  Yay.
> 
> I found that out the hard way (patch in-progress).  The way to handle event
> injection if KVM is loaded with nrips=0 but nrips is supported in hardware is to
> stuff NextRIP on event injection even if nrips=0, otherwise the guest is hosed.
> 
> > > > +	u64 next_rip;
> > > >   	u64 nested_cr3;
> > > >   	u64 virt_ext;
> > > >   	u32 clean;
> > > 
> > > I don't know why this struct has
> > > 
> > > 	u8 reserved_sw[32];
> > > 
> > > but presumably it's for padding, i.e. probably should be reduced to 24 bytes.
> > 
> > Apparently the "reserved_sw" field stores Hyper-V enlightenments state -
> > see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature")
> > and nested_svm_vmrun_msrpm() in nested.c.
> 
> Argh, that's a terrible name.  Thanks for doing the homework, I was being lazy.


That was added around the commit 

1183646a67d01 ("KVM: SVM: hyper-v: Direct Virtual Flush support")

Seems to be used by HV to store 'struct hv_enlightenments',
but I don't know 100% if that is the only thing that can be stored
in this area.



Best regards,
	Maxim Levitsky


> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq()
  2022-03-10 21:38 ` [PATCH 2/5] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq() Maciej S. Szmigiero
@ 2022-04-04  9:50   ` Maxim Levitsky
  0 siblings, 0 replies; 20+ messages in thread
From: Maxim Levitsky @ 2022-04-04  9:50 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Thu, 2022-03-10 at 22:38 +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> There is no need to bring down the whole host just because there might be
> some issue with respect to guest GIF handling in KVM.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  arch/x86/kvm/svm/svm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b069493ad5c7..1e5d904aeec3 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3322,7 +3322,7 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  
> -	BUG_ON(!(gif_set(svm)));
> +	WARN_ON(!gif_set(svm));
>  
>  	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
>  	++vcpu->stat.irq_injections;
> 
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-03-10 21:38 ` [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events Maciej S. Szmigiero
  2022-03-30 21:59   ` Sean Christopherson
@ 2022-04-04  9:53   ` Maxim Levitsky
  2022-04-04 21:05     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2022-04-04  9:53 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel

On Thu, 2022-03-10 at 22:38 +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> In SVM synthetic software interrupts or INT3 or INTO exception that L1
> wants to inject into its L2 guest are forgotten if there is an intervening
> L0 VMEXIT during their delivery.
> 
> They are re-injected correctly with VMX, however.
> 
> This is because there is an assumption in SVM that such exceptions will be
> re-delivered by simply re-executing the current instruction.
> Which might not be true if this is a synthetic exception injected by L1,
> since in this case the re-executed instruction will be one already in L2,
> not the VMRUN instruction in L1 that attempted the injection.
> 
> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
> it is either re-injected successfully or returned to L1 upon a nested
> VMEXIT.
> Make sure to always re-queue such event if returned in EXITINTINFO.
> 
> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
> unforeseen regressions.

Some time ago I noticed this too, but haven't dug into this too much.
I rememeber I even had some half-baked patch for this I never posted,
because I didn't think about the possibility of this syntetic injection. 

Just to be clear that I understand this correctly:

1. What is happening is that L1 is injecting INTn/INTO/INT3 but L2 code
   doesn't actualy contain an INTn/INTO/INT3 instruction.
   This is wierd but legal thing to do.
   Again, if L2 actually contained the instruction, it would have worked?


2. When actual INTn/INT0/INT3 are intercepted on SVM, then
   save.RIP points to address of the instruction, and control.next_rip
   points to address of next instruction after (as expected)

3. When EVENTINJ is used with '(TYPE = 3) with vectors 3 or 4'
   or 'TYPE=4', then next_rip is pushed on the stack, while save.RIP is
   pretty much ignored, and exectution jumps to the handler in the IDT.
   also at least for INT3/INTO, PRM states that IDT's DPL field is checked
   before dispatch, meaning that we can get legit #GP during delivery.
   (this looks like another legit reason to fix exception merging in KVM)


Best regards,
	Maxim Levitsky


> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/svm/svm.c    | 17 ++++++++--
>  arch/x86/kvm/svm/svm.h    | 47 ++++++++++++++++++++++++++++
>  3 files changed, 125 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 9656f0d6815c..75017bf77955 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
>  void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
>  {
>  	u32 mask;
> -	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
> -	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
> +
> +	/*
> +	 * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err}
> +	 * if its re-injection is needed
> +	 */
> +	if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> +					 svm->nested.ctl.event_inj_err)) {
> +		WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID);
> +		svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
> +		svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
> +	}

Beware that this could backfire in regard to nested migration.

I once chased a nasty bug related to this.

The bug was:

- L1 does VMRUN with injection (EVENTINJ set)

- VMRUN exit handler is 'executed' by KVM,
  This copies EVENINJ fields from VMCB12 to VMCB02

- Once VMRUN exit handler is done executing, we exit to userspace to start migration
  (basically static_call(kvm_x86_handle_exit)(...) handles the SVM_EXIT_VMRUN,
   and that is all, vcpu_enter_guest isn't called again, so injection is not canceled)

- migration happens and it migrates the control area of vmcb02 with EVENTINJ fields set.

- on migration target, we inject another interrupt to the guest via EVENTINJ
  because svm_check_nested_events checks nested_run_pending to avoid doing this
  but nested_run_pending was not migrated correctly, 
  and overwrites the EVENTINJ - injection is lost.

Paolo back then proposed to me that instead of doing direct copy from VMCB12 to VMCB02
we should instead go through 'vcpu->arch.interrupt' and such.
I had a prototype of this but never gotten to clean it up to be accepted upstream,
knowing that current way also works.

  


>  
>  	/* Only a few fields of int_ctl are written by the processor.  */
>  	mask = V_IRQ_MASK | V_TPR_MASK;
> @@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
>  	to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
>  }
>  
> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu)

A personal taste note: I don't like the 'maybe' for some reason.
But I won't fight over this.

> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	unsigned int vector, type;
> +	u32 exitintinfo = svm->vmcb->control.exit_int_info;
> +
> +	if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
> +		return;
> +
> +	/*
> +	 * No L1 -> L2 event to re-inject?
> +	 *
> +	 * In this case event_inj will be cleared by
> +	 * nested_sync_control_from_vmcb02().
> +	 */
> +	if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID))
> +		return;
> +
> +	/* If the last event injection was successful there shouldn't be any pending event */
> +	if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID)))
> +		return;
> +
> +	kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
> +	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> +	type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
> +
> +	switch (type) {
> +	case SVM_EXITINTINFO_TYPE_NMI:
> +		vcpu->arch.nmi_injected = true;
> +		break;
> +	case SVM_EXITINTINFO_TYPE_EXEPT:
> +		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
> +			kvm_requeue_exception_e(vcpu, vector,
> +						svm->vmcb->control.exit_int_info_err);
> +		else
> +			kvm_requeue_exception(vcpu, vector);
> +		break;
> +	case SVM_EXITINTINFO_TYPE_SOFT:

Note that AFAIK, SVM_EXITINTINFO_TYPE_SOFT is only for INTn instructions,
while INT3 and INTO are considered normal exceptions but EVENTINJ
handling has special case for them.


On VMX it is a bit cleaner:
It has:

3 - normal stock exception caused by CPU itself, like #PF and such
      
4 - INTn
      * does DPL check and uses VM_EXIT_INSTRUCTION_LEN
       
5 - ICEBP/INT1, which SVM doesnt support to re-inject
      * doesn't do DPL check, but uses VM_EXIT_INSTRUCTION_LEN I think

6 - software exception (INT3/INTO)
      * does DPL check and uses VM_EXIT_INSTRUCTION_LEN as well

I don't know if there is any difference between 4 and 6.




> +	case SVM_EXITINTINFO_TYPE_INTR:
> +		kvm_queue_interrupt(vcpu, vector, type == SVM_EXITINTINFO_TYPE_SOFT);
> +		break;
> +	default:
> +		vcpu_unimpl(vcpu, "unknown L1 -> L2 exitintinfo type 0x%x\n", type);
> +		break;
> +	}
> +}
> +
>  int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
>  			 struct vmcb *vmcb12, bool from_vmrun)
>  {
> @@ -898,6 +955,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>  	if (svm->nrips_enabled)
>  		vmcb12->control.next_rip  = vmcb->control.next_rip;
>  
> +	/* Forget about any pending L1 event injection since it's a L1 worry now */
> +	svm->nested.ctl.event_inj = 0;
> +	svm->nested.ctl.event_inj_err = 0;
> +
>  	vmcb12->control.int_ctl           = svm->nested.ctl.int_ctl;
>  	vmcb12->control.tlb_ctl           = svm->nested.ctl.tlb_ctl;
>  	vmcb12->control.event_inj         = svm->nested.ctl.event_inj;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 1e5d904aeec3..5b128baa5e57 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3322,13 +3322,18 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  
> -	WARN_ON(!gif_set(svm));
> +	WARN_ON(!(vcpu->arch.interrupt.soft || gif_set(svm)));
>  
>  	trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
>  	++vcpu->stat.irq_injections;
>  
>  	svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
> -		SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
> +		SVM_EVTINJ_VALID;
> +	if (vcpu->arch.interrupt.soft) {
> +		svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_SOFT;
> +	} else {
> +		svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_INTR;
> +	}
>  }
>  
>  void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
>  	if (!(exitintinfo & SVM_EXITINTINFO_VALID))
>  		return;
>  
> +	/* L1 -> L2 event re-injection needs a different handling */
> +	if (is_guest_mode(vcpu) &&
> +	    exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> +					svm->nested.ctl.event_inj_err)) {
> +		nested_svm_maybe_reinject(vcpu);
> +		return;
> +	}
> +
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
>  	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f757400fc933..7cafc2e6c82a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -488,6 +488,52 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
>  	return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
>  }
>  
> +static inline bool event_inj_same(u32 event_inj1, u32 event_inj_err1,
> +				  u32 event_inj2, u32 event_inj_err2)
> +{
> +	unsigned int vector_1, vector_2, type_1, type_2;
> +
> +	/* Either of them not valid? */
> +	if (!(event_inj1 & SVM_EVTINJ_VALID) ||
> +	    !(event_inj2 & SVM_EVTINJ_VALID))
> +		return false;
> +
> +	vector_1 = event_inj1 & SVM_EVTINJ_VEC_MASK;
> +	type_1 = event_inj1 & SVM_EVTINJ_TYPE_MASK;
> +	vector_2 = event_inj2 & SVM_EVTINJ_VEC_MASK;
> +	type_2 = event_inj2 & SVM_EVTINJ_TYPE_MASK;
> +
> +	/* Different vector or type? */
> +	if (vector_1 != vector_2 || type_1 != type_2)
> +		return false;
> +
> +	/* Different error code presence flag? */
> +	if ((event_inj1 & SVM_EVTINJ_VALID_ERR) !=
> +	    (event_inj2 & SVM_EVTINJ_VALID_ERR))
> +		return false;
> +
> +	/* No error code? */
> +	if (!(event_inj1 & SVM_EVTINJ_VALID_ERR))
> +		return true;
> +
> +	/* Same error code? */
> +	return event_inj_err1 == event_inj_err2;
> +}
> +
> +/* Did the last VMEXIT happen when attempting to inject that event? */
> +static inline bool exit_during_event_injection(struct vcpu_svm *svm,
> +					       u32 event_inj, u32 event_inj_err)
> +{
> +	BUILD_BUG_ON(SVM_EXITINTINFO_VEC_MASK != SVM_EVTINJ_VEC_MASK ||
> +		     SVM_EXITINTINFO_TYPE_MASK != SVM_EVTINJ_TYPE_MASK ||
> +		     SVM_EXITINTINFO_VALID != SVM_EVTINJ_VALID ||
> +		     SVM_EXITINTINFO_VALID_ERR != SVM_EVTINJ_VALID_ERR);
> +
> +	return event_inj_same(svm->vmcb->control.exit_int_info,
> +			      svm->vmcb->control.exit_int_info_err,
> +			      event_inj, event_inj_err);
> +}
> +
>  /* svm.c */
>  #define MSR_INVALID				0xffffffffU
>  
> @@ -540,6 +586,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
>  	return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
>  }
>  
> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu);
>  int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
>  			 u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
>  void svm_leave_nested(struct kvm_vcpu *vcpu);
> 


I will also review Sean's take on this, let see which one is simplier.


Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
  2022-04-04  9:53   ` Maxim Levitsky
@ 2022-04-04 21:05     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 20+ messages in thread
From: Maciej S. Szmigiero @ 2022-04-04 21:05 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Tom Lendacky, Brijesh Singh, Jon Grimm,
	David Kaplan, Boris Ostrovsky, Liam Merwick, kvm, linux-kernel,
	Paolo Bonzini

On 4.04.2022 11:53, Maxim Levitsky wrote:
> On Thu, 2022-03-10 at 22:38 +0100, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> In SVM synthetic software interrupts or INT3 or INTO exception that L1
>> wants to inject into its L2 guest are forgotten if there is an intervening
>> L0 VMEXIT during their delivery.
>>
>> They are re-injected correctly with VMX, however.
>>
>> This is because there is an assumption in SVM that such exceptions will be
>> re-delivered by simply re-executing the current instruction.
>> Which might not be true if this is a synthetic exception injected by L1,
>> since in this case the re-executed instruction will be one already in L2,
>> not the VMRUN instruction in L1 that attempted the injection.
>>
>> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
>> it is either re-injected successfully or returned to L1 upon a nested
>> VMEXIT.
>> Make sure to always re-queue such event if returned in EXITINTINFO.
>>
>> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
>> unforeseen regressions.
> 
> Some time ago I noticed this too, but haven't dug into this too much.
> I rememeber I even had some half-baked patch for this I never posted,
> because I didn't think about the possibility of this syntetic injection.
> 
> Just to be clear that I understand this correctly:
> 
> 1. What is happening is that L1 is injecting INTn/INTO/INT3 but L2 code
>     doesn't actualy contain an INTn/INTO/INT3 instruction.
>     This is wierd but legal thing to do.
>     Again, if L2 actually contained the instruction, it would have worked?

I think so (haven't tested it though).

> 
> 2. When actual INTn/INT0/INT3 are intercepted on SVM, then
>     save.RIP points to address of the instruction, and control.next_rip
>     points to address of next instruction after (as expected)

Yes.

> 3. When EVENTINJ is used with '(TYPE = 3) with vectors 3 or 4'
>     or 'TYPE=4', then next_rip is pushed on the stack, while save.RIP is
>     pretty much ignored, and exectution jumps to the handler in the IDT.

Yes.

>     also at least for INT3/INTO, PRM states that IDT's DPL field is checked
>     before dispatch, meaning that we can get legit #GP during delivery.
>     (this looks like another legit reason to fix exception merging in KVM)
> 

That's right.

> Best regards,
> 	Maxim Levitsky
> 
> 
>>
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++--
>>   arch/x86/kvm/svm/svm.c    | 17 ++++++++--
>>   arch/x86/kvm/svm/svm.h    | 47 ++++++++++++++++++++++++++++
>>   3 files changed, 125 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index 9656f0d6815c..75017bf77955 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
>>   void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
>>   {
>>   	u32 mask;
>> -	svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
>> -	svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
>> +
>> +	/*
>> +	 * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err}
>> +	 * if its re-injection is needed
>> +	 */
>> +	if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj,
>> +					 svm->nested.ctl.event_inj_err)) {
>> +		WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID);
>> +		svm->nested.ctl.event_inj      = svm->vmcb->control.event_inj;
>> +		svm->nested.ctl.event_inj_err  = svm->vmcb->control.event_inj_err;
>> +	}
> 
> Beware that this could backfire in regard to nested migration.
> 
> I once chased a nasty bug related to this.
> 
> The bug was:
> 
> - L1 does VMRUN with injection (EVENTINJ set)
> 
> - VMRUN exit handler is 'executed' by KVM,
>    This copies EVENINJ fields from VMCB12 to VMCB02
> 
> - Once VMRUN exit handler is done executing, we exit to userspace to start migration
>    (basically static_call(kvm_x86_handle_exit)(...) handles the SVM_EXIT_VMRUN,
>     and that is all, vcpu_enter_guest isn't called again, so injection is not canceled)
> 
> - migration happens and it migrates the control area of vmcb02 with EVENTINJ fields set.
> 
> - on migration target, we inject another interrupt to the guest via EVENTINJ
>    because svm_check_nested_events checks nested_run_pending to avoid doing this
>    but nested_run_pending was not migrated correctly,
>    and overwrites the EVENTINJ - injection is lost.
> 
> Paolo back then proposed to me that instead of doing direct copy from VMCB12 to VMCB02
> we should instead go through 'vcpu->arch.interrupt' and such.
> I had a prototype of this but never gotten to clean it up to be accepted upstream,
> knowing that current way also works.
> 

This sounds like a valid, but different, bug - to be honest, it would
look much cleaner to me, too, if EVENTINJ was parsed from VMCB12 into
relevant KVM injection structures on a nested VMRUN rather than following
a hybrid approach:
1) Copy the field from VMCB12 to VMCB02 directly on a nested VMRUN,

2) Parse the EXITINTINFO into KVM injection structures when re-injecting.

>>   
>>   	/* Only a few fields of int_ctl are written by the processor.  */
>>   	mask = V_IRQ_MASK | V_TPR_MASK;
>> @@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
>>   	to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
>>   }
>>   
>> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu)
> 
> A personal taste note: I don't like the 'maybe' for some reason.
> But I won't fight over this.

What's you proposed name then?

>> +{
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +	unsigned int vector, type;
>> +	u32 exitintinfo = svm->vmcb->control.exit_int_info;
>> +
>> +	if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
>> +		return;
>> +
>> +	/*
>> +	 * No L1 -> L2 event to re-inject?
>> +	 *
>> +	 * In this case event_inj will be cleared by
>> +	 * nested_sync_control_from_vmcb02().
>> +	 */
>> +	if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID))
>> +		return;
>> +
>> +	/* If the last event injection was successful there shouldn't be any pending event */
>> +	if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID)))
>> +		return;
>> +
>> +	kvm_make_request(KVM_REQ_EVENT, vcpu);
>> +
>> +	vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
>> +	type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
>> +
>> +	switch (type) {
>> +	case SVM_EXITINTINFO_TYPE_NMI:
>> +		vcpu->arch.nmi_injected = true;
>> +		break;
>> +	case SVM_EXITINTINFO_TYPE_EXEPT:
>> +		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
>> +			kvm_requeue_exception_e(vcpu, vector,
>> +						svm->vmcb->control.exit_int_info_err);
>> +		else
>> +			kvm_requeue_exception(vcpu, vector);
>> +		break;
>> +	case SVM_EXITINTINFO_TYPE_SOFT:
> 
> Note that AFAIK, SVM_EXITINTINFO_TYPE_SOFT is only for INTn instructions,
> while INT3 and INTO are considered normal exceptions but EVENTINJ
> handling has special case for them.
> 
That's right.

> On VMX it is a bit cleaner:
> It has:
> 
> 3 - normal stock exception caused by CPU itself, like #PF and such
>        
> 4 - INTn
>        * does DPL check and uses VM_EXIT_INSTRUCTION_LEN
>         
> 5 - ICEBP/INT1, which SVM doesnt support to re-inject
>        * doesn't do DPL check, but uses VM_EXIT_INSTRUCTION_LEN I think
> 
> 6 - software exception (INT3/INTO)
>        * does DPL check and uses VM_EXIT_INSTRUCTION_LEN as well
> 
> I don't know if there is any difference between 4 and 6.
> 
> 
> 
> 
(..)
> 
> 
> I will also review Sean's take on this, let see which one is simplier.

Since Sean's patch set is supposed to supersede this one let's continue
the discussion there.

> Best regards,
> 	Maxim Levitsky

Thanks,
Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-04-04 21:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-10 21:38 [PATCH 0/5] nSVM: L1 -> L2 event injection fixes and a self-test Maciej S. Szmigiero
2022-03-10 21:38 ` [PATCH 1/5] KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 Maciej S. Szmigiero
2022-04-01 18:32   ` Sean Christopherson
2022-04-01 19:08     ` Maciej S. Szmigiero
2022-04-01 21:51       ` Sean Christopherson
2022-04-04  9:50         ` Maxim Levitsky
2022-03-10 21:38 ` [PATCH 2/5] KVM: SVM: Downgrade BUG_ON() to WARN_ON() in svm_inject_irq() Maciej S. Szmigiero
2022-04-04  9:50   ` Maxim Levitsky
2022-03-10 21:38 ` [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events Maciej S. Szmigiero
2022-03-30 21:59   ` Sean Christopherson
2022-03-30 22:16     ` Maciej S. Szmigiero
2022-03-30 23:20       ` Sean Christopherson
2022-03-31 23:09         ` Maciej S. Szmigiero
2022-04-01  0:08           ` Sean Christopherson
2022-04-01 16:05             ` Maciej S. Szmigiero
2022-04-01 22:07               ` Sean Christopherson
2022-04-04  9:53   ` Maxim Levitsky
2022-04-04 21:05     ` Maciej S. Szmigiero
2022-03-10 21:38 ` [PATCH 4/5] KVM: nSVM: Restore next_rip when doing L1 -> L2 event re-injection Maciej S. Szmigiero
2022-03-10 21:38 ` [PATCH 5/5] KVM: selftests: nSVM: Add svm_nested_soft_inject_test Maciej S. Szmigiero

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.