All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events
@ 2017-11-21 15:30 Liran Alon
  2017-11-21 15:30 ` [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt Liran Alon
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm; +Cc: jmattson, wanpeng.li, idan.brown

Hi,

This series aim to fix multiple issues in how KVM defines & handles
pending vs. injected events. Both in non-nested and nested
scenarios.

The first patch removes wrong unnecessary code of clearing pending
NMI/interrupt when injecting event in real-mode when coming from
one of the VMCS event-injection methods (vmx_queue_exception(),
vmx_inject_nmi(), vmx_inject_irq()).

The second patch renames interrupt.pending to interrupt.injected in
order to better represent it's meaning and make it aligned with the
meaning of "pending" & "injected" event for exceptions & NMIs.
A "pending" event should represent an event that it's side-effect
have not been applied yet. In contrast, an "injected" event should
represent an event that it's side-effect have been applied.

The third until fifth patches make set/get_events ioctl to consider
userspace-injected exception as in "injected" state instead of
"pending" state. This makes it's handling also aligned with how
userspace-injected interrupts are handled.

The sixth patch fix some misleading comments in
inject_pending_events() regarding why exception.pending blocks
re-injection of NMI/Interrupt and why it is evaluated first among the
pending events.

The 7th patch fixes a critical bug in nVMX which cause L1 to
lose an IPI when it is sent in the exact moment a destination L2 CPU
exits to L0 due to event-delivery.

The last patch removes now irrelevant code which sets
KVM_REQ_EVENT on conditions when it is unnecessary.

I will appreciate if you could review all patches in series even if
you encounter some issues with previous patches. This is because this
series address multiple issues in current KVM code.

Regards,
-Liran Alon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  2017-12-01 23:45   ` Jim Mattson
  2017-11-21 15:30 ` [PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected Liran Alon
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Krish Sadhukhan

kvm_inject_realmode_interrupt() is called from one of the injection
functions which writes event-injection to VMCS: vmx_queue_exception(),
vmx_inject_irq() and vmx_inject_nmi().

All these functions are called just to cause an event-injection to
guest. They are not responsible of manipulating the event-pending
flag. The only purpose of kvm_inject_realmode_interrupt() should be
to emulate real-mode interrupt-injection.

This was also incorrect when called from vmx_queue_exception().

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
---
 arch/x86/kvm/x86.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 34c85aa2e2d1..c85fc4406a7d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5411,11 +5411,6 @@ int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip)
 	kvm_rip_write(vcpu, ctxt->eip);
 	kvm_set_rflags(vcpu, ctxt->eflags);
 
-	if (irq == NMI_VECTOR)
-		vcpu->arch.nmi_pending = 0;
-	else
-		vcpu->arch.interrupt.pending = false;
-
 	return EMULATE_DONE;
 }
 EXPORT_SYMBOL_GPL(kvm_inject_realmode_interrupt);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
  2017-11-21 15:30 ` [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  2017-11-28 17:02   ` Jim Mattson
  2017-11-21 15:30 ` [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions Liran Alon
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Krish Sadhukhan

For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.

However, interrupts don't confirm to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.

This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/irq.c              | 4 ++--
 arch/x86/kvm/vmx.c              | 2 +-
 arch/x86/kvm/x86.c              | 8 ++++----
 arch/x86/kvm/x86.h              | 6 +++---
 5 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1bfb99770c34..f8ad3ca11a3a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -556,7 +556,7 @@ struct kvm_vcpu_arch {
 	} exception;
 
 	struct kvm_queued_interrupt {
-		bool pending;
+		bool injected;
 		bool soft;
 		u8 nr;
 	} interrupt;
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 5c24811e8b0b..1f7f37d1c8b9 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -74,7 +74,7 @@ static int kvm_cpu_has_extint(struct kvm_vcpu *v)
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
 {
 	if (!lapic_in_kernel(v))
-		return v->arch.interrupt.pending;
+		return v->arch.interrupt.injected;
 
 	if (kvm_cpu_has_extint(v))
 		return 1;
@@ -92,7 +92,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
 int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 {
 	if (!lapic_in_kernel(v))
-		return v->arch.interrupt.pending;
+		return v->arch.interrupt.injected;
 
 	if (kvm_cpu_has_extint(v))
 		return 1;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c8a7bcc1bbd4..d939ed84f136 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11007,7 +11007,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 	} else if (vcpu->arch.nmi_injected) {
 		vmcs12->idt_vectoring_info_field =
 			INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR;
-	} else if (vcpu->arch.interrupt.pending) {
+	} else if (vcpu->arch.interrupt.injected) {
 		nr = vcpu->arch.interrupt.nr;
 		idt_vectoring = nr | VECTORING_INFO_VALID_MASK;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c85fc4406a7d..45baba8bc02e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3111,7 +3111,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 	events->exception.error_code = vcpu->arch.exception.error_code;
 
 	events->interrupt.injected =
-		vcpu->arch.interrupt.pending && !vcpu->arch.interrupt.soft;
+		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
 	events->interrupt.nr = vcpu->arch.interrupt.nr;
 	events->interrupt.soft = 0;
 	events->interrupt.shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);
@@ -3164,7 +3164,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 	vcpu->arch.exception.has_error_code = events->exception.has_error_code;
 	vcpu->arch.exception.error_code = events->exception.error_code;
 
-	vcpu->arch.interrupt.pending = events->interrupt.injected;
+	vcpu->arch.interrupt.injected = events->interrupt.injected;
 	vcpu->arch.interrupt.nr = events->interrupt.nr;
 	vcpu->arch.interrupt.soft = events->interrupt.soft;
 	if (events->flags & KVM_VCPUEVENT_VALID_SHADOW)
@@ -6406,7 +6406,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 			return 0;
 		}
 
-		if (vcpu->arch.interrupt.pending) {
+		if (vcpu->arch.interrupt.injected) {
 			kvm_x86_ops->set_irq(vcpu);
 			return 0;
 		}
@@ -7413,7 +7413,7 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 
 	memset(sregs->interrupt_bitmap, 0, sizeof sregs->interrupt_bitmap);
 
-	if (vcpu->arch.interrupt.pending && !vcpu->arch.interrupt.soft)
+	if (vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft)
 		set_bit(vcpu->arch.interrupt.nr,
 			(unsigned long *)sregs->interrupt_bitmap);
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6d112d8f799c..4eab2bae5937 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -19,19 +19,19 @@ static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,
 	bool soft)
 {
-	vcpu->arch.interrupt.pending = true;
+	vcpu->arch.interrupt.injected = true;
 	vcpu->arch.interrupt.soft = soft;
 	vcpu->arch.interrupt.nr = vector;
 }
 
 static inline void kvm_clear_interrupt_queue(struct kvm_vcpu *vcpu)
 {
-	vcpu->arch.interrupt.pending = false;
+	vcpu->arch.interrupt.injected = false;
 }
 
 static inline bool kvm_event_needs_reinjection(struct kvm_vcpu *vcpu)
 {
-	return vcpu->arch.exception.injected || vcpu->arch.interrupt.pending ||
+	return vcpu->arch.exception.injected || vcpu->arch.interrupt.injected ||
 		vcpu->arch.nmi_injected;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
  2017-11-21 15:30 ` [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt Liran Alon
  2017-11-21 15:30 ` [PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  2017-11-22 20:25   ` Radim Krčmář
  2017-11-22 21:00   ` Jim Mattson
  2017-11-21 15:30 ` [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt Liran Alon
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Krish Sadhukhan

Do not consider pending exception when return injected exception
to user-mode. A "pending" exception means it's side-effect have not
been applied yet. In contrast, an "injected" exception means it's
side-effect have been already applied.
Therefore, we only need to report of injected exceptions to user-mode.
This is aligned with how interrupts are reported in same ioctl.

Being symmetrical, injecting an exception from user-mode should
consider injected exception as in "injected" state and not in
"pending" state.

Fixes: 664f8e26b00c ("KVM: X86: Fix loss of exception which has not
yet been injected")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
---
 arch/x86/kvm/x86.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 45baba8bc02e..1490da89de4b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3096,14 +3096,9 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 					       struct kvm_vcpu_events *events)
 {
 	process_nmi(vcpu);
-	/*
-	 * FIXME: pass injected and pending separately.  This is only
-	 * needed for nested virtualization, whose state cannot be
-	 * migrated yet.  For now we can combine them.
-	 */
+
 	events->exception.injected =
-		(vcpu->arch.exception.pending ||
-		 vcpu->arch.exception.injected) &&
+		vcpu->arch.exception.injected &&
 		!kvm_exception_is_soft(vcpu->arch.exception.nr);
 	events->exception.nr = vcpu->arch.exception.nr;
 	events->exception.has_error_code = vcpu->arch.exception.has_error_code;
@@ -3158,8 +3153,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	process_nmi(vcpu);
-	vcpu->arch.exception.injected = false;
-	vcpu->arch.exception.pending = events->exception.injected;
+	vcpu->arch.exception.injected = events->exception.injected;
+	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.nr = events->exception.nr;
 	vcpu->arch.exception.has_error_code = events->exception.has_error_code;
 	vcpu->arch.exception.error_code = events->exception.error_code;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
                   ` (2 preceding siblings ...)
  2017-11-21 15:30 ` [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  2017-11-22 20:34   ` Radim Krčmář
  2017-11-21 15:30 ` [PATCH v2 5/8] Revert "kvm: nVMX: Disallow userspace-injected exceptions in guest mode" Liran Alon
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Krish Sadhukhan

An alternative could have been done to return -EBUSY in this case.
For now, we decided to just silently override exception and warn on
such an attempt.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
---
 arch/x86/kvm/x86.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1490da89de4b..c8cec7c39c1c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3153,12 +3153,25 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	process_nmi(vcpu);
+
+	/*
+	 * Warn if userspace is overriding existing
+	 * injected exception
+	 */
+	WARN_ON_ONCE(vcpu->arch.exception.injected &&
+		     events->exception.injected);
 	vcpu->arch.exception.injected = events->exception.injected;
 	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.nr = events->exception.nr;
 	vcpu->arch.exception.has_error_code = events->exception.has_error_code;
 	vcpu->arch.exception.error_code = events->exception.error_code;
 
+	/*
+	 * Warn if userspace is overriding existing
+	 * injected interrupt
+	 */
+	WARN_ON_ONCE(vcpu->arch.interrupt.injected &&
+		     events->interrupt.injected);
 	vcpu->arch.interrupt.injected = events->interrupt.injected;
 	vcpu->arch.interrupt.nr = events->interrupt.nr;
 	vcpu->arch.interrupt.soft = events->interrupt.soft;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 5/8] Revert "kvm: nVMX: Disallow userspace-injected exceptions in guest mode"
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
                   ` (3 preceding siblings ...)
  2017-11-21 15:30 ` [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  2017-11-21 15:30 ` [PATCH v2 6/8] KVM: x86: Fix misleading comments on handling pending exceptions Liran Alon
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Krish Sadhukhan

The reverted commit seems to assume that userspace-injected exceptions
are in "pending" state and not in "injected" state.
However, they should be treated as in "injected" state. Therefore,
their side-effects should be assumed to have already taken place.

This is aligned with how userspace-injected interrupts are handled in
same ioctl.

Because the userspace-injected exceptions should be treated in
"injected" state, they cannot be intercepted by L1 when running in
L2 and therefore none of the cases the reverted commit mentiones
applies.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@orcle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@orcle.com>
---
 arch/x86/kvm/x86.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c8cec7c39c1c..f4a623ba087e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3142,8 +3142,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	if (events->exception.injected &&
-	    (events->exception.nr > 31 || events->exception.nr == NMI_VECTOR ||
-	     is_guest_mode(vcpu)))
+	    (events->exception.nr > 31 || events->exception.nr == NMI_VECTOR))
 		return -EINVAL;
 
 	/* INITs are latched while in SMM */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 6/8] KVM: x86: Fix misleading comments on handling pending exceptions
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
                   ` (4 preceding siblings ...)
  2017-11-21 15:30 ` [PATCH v2 5/8] Revert "kvm: nVMX: Disallow userspace-injected exceptions in guest mode" Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  2017-11-21 15:30 ` [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Liran Alon
  2017-11-21 15:30 ` [PATCH v2 8/8] KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending Liran Alon
  7 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Krish Sadhukhan

The reason that exception.pending should block re-injection of
NMI/interrupt is not described correctly in comment in code.
Instead, it describes why a pending exception should be injected
before a pending NMI/interrupt.

Therefore, move currently present comment to code-block evaluating
a new pending event which explains why exception.pending is evaluated
first.
In addition, create a new comment describing that exception.pending
blocks re-injection of NMI/interrupt because the exception may have
been queued during handling of exit due to NMI/interrupt delivery.

Fixes: 664f8e26b00c ("KVM: X86: Fix loss of exception which has not
yet been injected")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@orcle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@orcle.com>
---
 arch/x86/kvm/x86.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f4a623ba087e..a4b5a013064b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6404,8 +6404,9 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 	}
 
 	/*
-	 * Exceptions must be injected immediately, or the exception
-	 * frame will have the address of the NMI or interrupt handler.
+	 * NMI/interrupt must not be injected if an exception is
+	 * pending, because the latter may have been queued by
+	 * handling exit due to NMI/interrupt delivery.
 	 */
 	if (!vcpu->arch.exception.pending) {
 		if (vcpu->arch.nmi_injected) {
@@ -6426,6 +6427,12 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 	}
 
 	/* try to inject new event if pending */
+
+	/*
+	 * Exception must be injected before NMI/interrupt,
+	 * otherwise the exception frame will have the address of the
+	 * NMI or interrupt handler.
+	 */
 	if (vcpu->arch.exception.pending) {
 		trace_kvm_inj_exception(vcpu->arch.exception.nr,
 					vcpu->arch.exception.has_error_code,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
                   ` (5 preceding siblings ...)
  2017-11-21 15:30 ` [PATCH v2 6/8] KVM: x86: Fix misleading comments on handling pending exceptions Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  2017-11-27 20:48   ` Jim Mattson
  2017-11-28  6:39   ` Jim Mattson
  2017-11-21 15:30 ` [PATCH v2 8/8] KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending Liran Alon
  7 siblings, 2 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Konrad Rzeszutek Wilk

In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.

While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.

When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!

This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.

The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hangs completely and analyzing the hang reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).

Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)
(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15

Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).

We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4b5a013064b..63359ab0e798 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 	int r;
 
 	/* try to reinject previous events if any */
-	if (vcpu->arch.exception.injected) {
-		kvm_x86_ops->queue_exception(vcpu);
-		return 0;
-	}
 
+	if (vcpu->arch.exception.injected)
+		kvm_x86_ops->queue_exception(vcpu);
 	/*
 	 * NMI/interrupt must not be injected if an exception is
 	 * pending, because the latter may have been queued by
 	 * handling exit due to NMI/interrupt delivery.
 	 */
-	if (!vcpu->arch.exception.pending) {
-		if (vcpu->arch.nmi_injected) {
+	else if (!vcpu->arch.exception.pending) {
+		if (vcpu->arch.nmi_injected)
 			kvm_x86_ops->set_nmi(vcpu);
-			return 0;
-		}
-
-		if (vcpu->arch.interrupt.injected) {
+		else if (vcpu->arch.interrupt.injected)
 			kvm_x86_ops->set_irq(vcpu);
-			return 0;
-		}
 	}
 
+	/*
+	 * Call check_nested_events() even if we reinjected a previous event
+	 * in order for caller to determine if it should require immediate-exit
+	 * from L2 to L1 due to pending L1 events which require exit
+	 * from L2 to L1.
+	 */
 	if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
 		r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
 		if (r != 0)
@@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 					vcpu->arch.exception.has_error_code,
 					vcpu->arch.exception.error_code);
 
+		WARN_ON_ONCE(vcpu->arch.exception.injected);
 		vcpu->arch.exception.pending = false;
 		vcpu->arch.exception.injected = true;
 
@@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
 		}
 
 		kvm_x86_ops->queue_exception(vcpu);
-	} else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
+	}
+
+	/* Don't consider new event if we re-injected an event */
+	if (kvm_event_needs_reinjection(vcpu))
+		return 0;
+
+	if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
+	    kvm_x86_ops->smi_allowed(vcpu)) {
 		vcpu->arch.smi_pending = false;
 		enter_smm(vcpu);
 	} else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 8/8] KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
  2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
                   ` (6 preceding siblings ...)
  2017-11-21 15:30 ` [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Liran Alon
@ 2017-11-21 15:30 ` Liran Alon
  7 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-21 15:30 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Konrad Rzeszutek Wilk

When vCPU runs L2 and there is a pending event that requires to exit
from L2 to L1 and nested_run_pending=1, vcpu_enter_guest() will request
an immediate-exit from L2 (See req_immediate_exit).

Since now handling of req_immediate_exit also makes sure to set
KVM_REQ_EVENT, there is no need to also set it on vmx_vcpu_run() when
nested_run_pending=1.

This optimizes cases where VMRESUME was executed by L1 to enter L2 and
there is no pending events that require exit from L2 to L1. Previously,
this would have set KVM_REQ_EVENT unnecessarly.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/kvm/vmx.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d939ed84f136..7c7187da7fd1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9440,14 +9440,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 			__write_pkru(vmx->host_pkru);
 	}
 
-	/*
-	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
-	 * we did not inject a still-pending event to L1 now because of
-	 * nested_run_pending, we need to re-enable this bit.
-	 */
-	if (vmx->nested.nested_run_pending)
-		kvm_make_request(KVM_REQ_EVENT, vcpu);
-
 	vmx->nested.nested_run_pending = 0;
 	vmx->idt_vectoring_info = 0;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-21 15:30 ` [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions Liran Alon
@ 2017-11-22 20:25   ` Radim Krčmář
  2017-11-23 18:20     ` Liran Alon
  2017-11-22 21:00   ` Jim Mattson
  1 sibling, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2017-11-22 20:25 UTC (permalink / raw)
  To: Liran Alon
  Cc: pbonzini, kvm, jmattson, wanpeng.li, idan.brown, Krish Sadhukhan

2017-11-21 17:30+0200, Liran Alon:
> Do not consider pending exception when return injected exception
> to user-mode. A "pending" exception means it's side-effect have not
> been applied yet. In contrast, an "injected" exception means it's
> side-effect have been already applied.
> Therefore, we only need to report of injected exceptions to user-mode.
> This is aligned with how interrupts are reported in same ioctl.

Pending interrupts are stored in IRR, but we don't have anything for
exceptions -- we would lose a trap exception that was made pending after
handling inject_pending_event() if the VCPU got a userspace signal and
save+restored before starting the next vcpu_enter_guest() cycle.
(Non-trap exceptions should be generated again when re-executing, so
 losing them isn't that bad.)

I think we should add state for pending exceptions in kvm_vcpu_events,
like the FIXME says.  Pending and injected are actually exclusive (for
now?), so maybe we can use only one bit for that,

thanks.

An alternative, probably unattainable, would be to process the
side-effects as we hit the exception.  Using IRR to store pending
interrupts also seems possible, but I'd expect more problems down the
road.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt
  2017-11-21 15:30 ` [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt Liran Alon
@ 2017-11-22 20:34   ` Radim Krčmář
  2017-11-22 22:27     ` Liran Alon
  0 siblings, 1 reply; 32+ messages in thread
From: Radim Krčmář @ 2017-11-22 20:34 UTC (permalink / raw)
  To: Liran Alon
  Cc: pbonzini, kvm, jmattson, wanpeng.li, idan.brown, Krish Sadhukhan

2017-11-21 17:30+0200, Liran Alon:
> An alternative could have been done to return -EBUSY in this case.
> For now, we decided to just silently override exception and warn on
> such an attempt.
> 
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> ---
>  arch/x86/kvm/x86.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1490da89de4b..c8cec7c39c1c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3153,12 +3153,25 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>  		return -EINVAL;
>  
>  	process_nmi(vcpu);
> +
> +	/*
> +	 * Warn if userspace is overriding existing
> +	 * injected exception
> +	 */
> +	WARN_ON_ONCE(vcpu->arch.exception.injected &&
> +		     events->exception.injected);

I think that overwriting the injected exception/interrupt is a perfectly
valid operation -- userspace could have rolled back the state to a time
of the previous injection.

Syzkaller would complain sooner or later and I don't see it as a useful
printk, so dropping this patch would be preferred,

thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-21 15:30 ` [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions Liran Alon
  2017-11-22 20:25   ` Radim Krčmář
@ 2017-11-22 21:00   ` Jim Mattson
  2017-11-23 18:45     ` Liran Alon
  1 sibling, 1 reply; 32+ messages in thread
From: Jim Mattson @ 2017-11-22 21:00 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan

On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:

> Being symmetrical, injecting an exception from user-mode should
> consider injected exception as in "injected" state and not in
> "pending" state.

I disagree with this contention. Suppose, for example, that we are
executing in a nested context (i.e. vmcs02 is live) and usermode
"injects" a page-fault. The page fault may be delivered on L2's IDT or
it may cause an emulated VM-exit from L2 to L1, depending on the page
fault error code, the exception bitmap in the vmcs12, and the
page-fault error-code mask and match fields in the vmcs12. If the page
fault is delivered on L2's IDT, then the exception can be considered
as "injected," with the CR2 side effect already processed. However, if
the page fault causes an emulated VM-exit from L2 to L1, then it must
be considered as "pending" and the CR2 side effect must not have been
processed.

So, where do we find the new CR2 value? Admittedly, the existing
KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
effects have already been performed for exceptions injected from
userspace, though this assumption isn't documented AFAICT.
Fortunately, there's enough padding in the kvm_vcpu_events structure
to fix the API (with the possible exception of injected #VE*): one bit
to indicate that userspace is providing the side effects in the events
structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
(#DB).

* One could argue that userspace should not be delivering a #VE
directly, but should be injecting an EPT Violation instead.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt
  2017-11-22 20:34   ` Radim Krčmář
@ 2017-11-22 22:27     ` Liran Alon
  0 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-22 22:27 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: pbonzini, kvm, jmattson, wanpeng.li, idan.brown, Krish Sadhukhan



On 22/11/17 22:34, Radim Krčmář wrote:
> 2017-11-21 17:30+0200, Liran Alon:
>> An alternative could have been done to return -EBUSY in this case.
>> For now, we decided to just silently override exception and warn on
>> such an attempt.
>>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>> ---
>>   arch/x86/kvm/x86.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1490da89de4b..c8cec7c39c1c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3153,12 +3153,25 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>>   		return -EINVAL;
>>
>>   	process_nmi(vcpu);
>> +
>> +	/*
>> +	 * Warn if userspace is overriding existing
>> +	 * injected exception
>> +	 */
>> +	WARN_ON_ONCE(vcpu->arch.exception.injected &&
>> +		     events->exception.injected);
>
> I think that overwriting the injected exception/interrupt is a perfectly
> valid operation -- userspace could have rolled back the state to a time
> of the previous injection.
>
> Syzkaller would complain sooner or later and I don't see it as a useful
> printk, so dropping this patch would be preferred,
>
> thanks.
>
Hmm haven't thought about this use-case. I agree. This patch should be 
dropped from series.

Thanks for spotting this.

-Liran

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-22 20:25   ` Radim Krčmář
@ 2017-11-23 18:20     ` Liran Alon
  0 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-23 18:20 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: pbonzini, kvm, jmattson, wanpeng.li, idan.brown, Krish Sadhukhan



On 22/11/17 22:25, Radim Krčmář wrote:
> 2017-11-21 17:30+0200, Liran Alon:
>> Do not consider pending exception when return injected exception
>> to user-mode. A "pending" exception means it's side-effect have not
>> been applied yet. In contrast, an "injected" exception means it's
>> side-effect have been already applied.
>> Therefore, we only need to report of injected exceptions to user-mode.
>> This is aligned with how interrupts are reported in same ioctl.
>
> Pending interrupts are stored in IRR, but we don't have anything for
> exceptions -- we would lose a trap exception that was made pending after
> handling inject_pending_event() if the VCPU got a userspace signal and
> save+restored before starting the next vcpu_enter_guest() cycle.
> (Non-trap exceptions should be generated again when re-executing, so
>   losing them isn't that bad.)
I see your point. I was indeed not considering correctly what will 
happen with trap exceptions.
>
> I think we should add state for pending exceptions in kvm_vcpu_events,
> like the FIXME says.  Pending and injected are actually exclusive (for
> now?), so maybe we can use only one bit for that,
The FIXME is a bit incorrect as it states "This is only needed for 
nested virtualization". As I understand this, it is not true as the 
issue relates to handling trap exceptions in general. But it is true 
that we should return extra state here.

I also observed that pending & injected should be exclusive but didn't 
want to change code too much in one shot.

You are right that in order to handle trap exceptions correctly we 
should also return to user-space if exception is in pending or injected 
state.

I can change struct kvm_vcpu_events.exception such that:
1. "injected" field will remain logical OR of "exception.pending | 
.injected".
2. "pad" will become a "flags" field which currently contain a single 
flag indicating if exception is pending or not.

That way, I think I don't break any backwards compatibility.
What do you think?

Regards,
-Liran

P.S:
I would be glad if you could review also the rest of the patches in this 
series as they are independent of this change and are also a bit 
complex. Especially "[PATCH 7/8]: KVM: nVMX: Require immediate-exit when 
event reinjected to L2 and L1 event pending".
>
> thanks.
>
> An alternative, probably unattainable, would be to process the
> side-effects as we hit the exception.  Using IRR to store pending
> interrupts also seems possible, but I'd expect more problems down the
> road.
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-22 21:00   ` Jim Mattson
@ 2017-11-23 18:45     ` Liran Alon
  2017-11-23 23:05       ` Paolo Bonzini
  2017-11-27 17:26       ` Jim Mattson
  0 siblings, 2 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-23 18:45 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan



On 22/11/17 23:00, Jim Mattson wrote:
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>
>> Being symmetrical, injecting an exception from user-mode should
>> consider injected exception as in "injected" state and not in
>> "pending" state.
>
> I disagree with this contention. Suppose, for example, that we are
> executing in a nested context (i.e. vmcs02 is live) and usermode
> "injects" a page-fault. The page fault may be delivered on L2's IDT or
> it may cause an emulated VM-exit from L2 to L1, depending on the page
> fault error code, the exception bitmap in the vmcs12, and the
> page-fault error-code mask and match fields in the vmcs12. If the page
> fault is delivered on L2's IDT, then the exception can be considered
> as "injected," with the CR2 side effect already processed. However, if
> the page fault causes an emulated VM-exit from L2 to L1, then it must
> be considered as "pending" and the CR2 side effect must not have been
> processed.
I understand you are referring here to the FIXME comment that is present 
on nested_vmx_check_exception()?
>
> So, where do we find the new CR2 value? Admittedly, the existing
> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
> effects have already been performed for exceptions injected from
> userspace, though this assumption isn't documented AFAICT.
> Fortunately, there's enough padding in the kvm_vcpu_events structure
> to fix the API (with the possible exception of injected #VE*): one bit
> to indicate that userspace is providing the side effects in the events
> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
> (#DB).
I see. Then what do you think about the following change:
1. Rename kvm_vcpu_events.exception.injected to 
kvm_vcpu_events.exception.raised and remain still to be the logical OR 
of "exception.pending | exception.injected" as of today (to not break 
backwards compatibility).
2. Add a new flag to kvm_vcpu_events.flags to indicate if 
kvm_vcpu_events.exception.raised is actually exception.pending or 
exception.injected (they are mutually exclusive). A value of 0 will be 
considered as "injected" to preserve backwards compatibility.
3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for 
exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
4. Add to kvm_queued_exception() a u64 exception_extra_info that will 
either be CR2 for #PF or DR6 for #DB. Make sure that these will be set 
on relevant places and filled to vcpu.arch.cr2/VMCS only on 
inject_pending_event() injection of a pending exception.

I think that in the above proposal, we don't need to be conditional on a 
new capability because old user-space shouldn't be affected.
(They will still get same value in kvm_vcpu_events.exception.raised and 
the rest were ignored fields).

What do you think?

Thanks for the excellent review!
-Liran
>
> * One could argue that userspace should not be delivering a #VE
> directly, but should be injecting an EPT Violation instead.
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-23 18:45     ` Liran Alon
@ 2017-11-23 23:05       ` Paolo Bonzini
  2017-11-27 17:26       ` Jim Mattson
  1 sibling, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2017-11-23 23:05 UTC (permalink / raw)
  To: Liran Alon, Jim Mattson
  Cc: Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan

On 23/11/2017 19:45, Liran Alon wrote:
> 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for
> exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
> 4. Add to kvm_queued_exception() a u64 exception_extra_info that will
> either be CR2 for #PF or DR6 for #DB. Make sure that these will be set
> on relevant places and filled to vcpu.arch.cr2/VMCS only on
> inject_pending_event() injection of a pending exception.

An aside: just to complicate things a bit, AMD overwrites the actual DR6
(and thus the "guest" DR6) even if you intercept the #DB vmexit.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-23 18:45     ` Liran Alon
  2017-11-23 23:05       ` Paolo Bonzini
@ 2017-11-27 17:26       ` Jim Mattson
  2017-11-27 18:30         ` Liran Alon
  1 sibling, 1 reply; 32+ messages in thread
From: Jim Mattson @ 2017-11-27 17:26 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan

I think we need some way for userspace to indicate whether or not it
can deal with pending events (with side effects recorded in
kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1]) when it
issues a KVM_GET_VCPU_EVENTS ioctl. That seems to argue for a new flag
bit in kvm_vcpu_events.flags (as input to KVM_GET_VCPU_EVENTS) and a
capability indicating that the flag can be set by userspace.

On Thu, Nov 23, 2017 at 10:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote:
>
>
> On 22/11/17 23:00, Jim Mattson wrote:
>>
>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>
>>> Being symmetrical, injecting an exception from user-mode should
>>> consider injected exception as in "injected" state and not in
>>> "pending" state.
>>
>>
>> I disagree with this contention. Suppose, for example, that we are
>> executing in a nested context (i.e. vmcs02 is live) and usermode
>> "injects" a page-fault. The page fault may be delivered on L2's IDT or
>> it may cause an emulated VM-exit from L2 to L1, depending on the page
>> fault error code, the exception bitmap in the vmcs12, and the
>> page-fault error-code mask and match fields in the vmcs12. If the page
>> fault is delivered on L2's IDT, then the exception can be considered
>> as "injected," with the CR2 side effect already processed. However, if
>> the page fault causes an emulated VM-exit from L2 to L1, then it must
>> be considered as "pending" and the CR2 side effect must not have been
>> processed.
>
> I understand you are referring here to the FIXME comment that is present on
> nested_vmx_check_exception()?
>>
>>
>> So, where do we find the new CR2 value? Admittedly, the existing
>> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
>> effects have already been performed for exceptions injected from
>> userspace, though this assumption isn't documented AFAICT.
>> Fortunately, there's enough padding in the kvm_vcpu_events structure
>> to fix the API (with the possible exception of injected #VE*): one bit
>> to indicate that userspace is providing the side effects in the events
>> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
>> (#DB).
>
> I see. Then what do you think about the following change:
> 1. Rename kvm_vcpu_events.exception.injected to
> kvm_vcpu_events.exception.raised and remain still to be the logical OR of
> "exception.pending | exception.injected" as of today (to not break backwards
> compatibility).
> 2. Add a new flag to kvm_vcpu_events.flags to indicate if
> kvm_vcpu_events.exception.raised is actually exception.pending or
> exception.injected (they are mutually exclusive). A value of 0 will be
> considered as "injected" to preserve backwards compatibility.
> 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for
> exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
> 4. Add to kvm_queued_exception() a u64 exception_extra_info that will either
> be CR2 for #PF or DR6 for #DB. Make sure that these will be set on relevant
> places and filled to vcpu.arch.cr2/VMCS only on inject_pending_event()
> injection of a pending exception.
>
> I think that in the above proposal, we don't need to be conditional on a new
> capability because old user-space shouldn't be affected.
> (They will still get same value in kvm_vcpu_events.exception.raised and the
> rest were ignored fields).
>
> What do you think?
>
> Thanks for the excellent review!
> -Liran
>
>>
>> * One could argue that userspace should not be delivering a #VE
>> directly, but should be injecting an EPT Violation instead.
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions
  2017-11-27 17:26       ` Jim Mattson
@ 2017-11-27 18:30         ` Liran Alon
  0 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-27 18:30 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan



On 27/11/17 19:26, Jim Mattson wrote:
> I think we need some way for userspace to indicate whether or not it
> can deal with pending events (with side effects recorded in
> kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1]) when it
> issues a KVM_GET_VCPU_EVENTS ioctl. That seems to argue for a new flag
> bit in kvm_vcpu_events.flags (as input to KVM_GET_VCPU_EVENTS) and a
> capability indicating that the flag can be set by userspace.
Yep I agree. I will do the relevant changes for the next version of this 
series.

Thanks,
-Liran
>
> On Thu, Nov 23, 2017 at 10:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote:
>>
>>
>> On 22/11/17 23:00, Jim Mattson wrote:
>>>
>>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>>
>>>> Being symmetrical, injecting an exception from user-mode should
>>>> consider injected exception as in "injected" state and not in
>>>> "pending" state.
>>>
>>>
>>> I disagree with this contention. Suppose, for example, that we are
>>> executing in a nested context (i.e. vmcs02 is live) and usermode
>>> "injects" a page-fault. The page fault may be delivered on L2's IDT or
>>> it may cause an emulated VM-exit from L2 to L1, depending on the page
>>> fault error code, the exception bitmap in the vmcs12, and the
>>> page-fault error-code mask and match fields in the vmcs12. If the page
>>> fault is delivered on L2's IDT, then the exception can be considered
>>> as "injected," with the CR2 side effect already processed. However, if
>>> the page fault causes an emulated VM-exit from L2 to L1, then it must
>>> be considered as "pending" and the CR2 side effect must not have been
>>> processed.
>>
>> I understand you are referring here to the FIXME comment that is present on
>> nested_vmx_check_exception()?
>>>
>>>
>>> So, where do we find the new CR2 value? Admittedly, the existing
>>> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
>>> effects have already been performed for exceptions injected from
>>> userspace, though this assumption isn't documented AFAICT.
>>> Fortunately, there's enough padding in the kvm_vcpu_events structure
>>> to fix the API (with the possible exception of injected #VE*): one bit
>>> to indicate that userspace is providing the side effects in the events
>>> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
>>> (#DB).
>>
>> I see. Then what do you think about the following change:
>> 1. Rename kvm_vcpu_events.exception.injected to
>> kvm_vcpu_events.exception.raised and remain still to be the logical OR of
>> "exception.pending | exception.injected" as of today (to not break backwards
>> compatibility).
>> 2. Add a new flag to kvm_vcpu_events.flags to indicate if
>> kvm_vcpu_events.exception.raised is actually exception.pending or
>> exception.injected (they are mutually exclusive). A value of 0 will be
>> considered as "injected" to preserve backwards compatibility.
>> 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for
>> exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
>> 4. Add to kvm_queued_exception() a u64 exception_extra_info that will either
>> be CR2 for #PF or DR6 for #DB. Make sure that these will be set on relevant
>> places and filled to vcpu.arch.cr2/VMCS only on inject_pending_event()
>> injection of a pending exception.
>>
>> I think that in the above proposal, we don't need to be conditional on a new
>> capability because old user-space shouldn't be affected.
>> (They will still get same value in kvm_vcpu_events.exception.raised and the
>> rest were ignored fields).
>>
>> What do you think?
>>
>> Thanks for the excellent review!
>> -Liran
>>
>>>
>>> * One could argue that userspace should not be delivering a #VE
>>> directly, but should be injecting an EPT Violation instead.
>>>
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-21 15:30 ` [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Liran Alon
@ 2017-11-27 20:48   ` Jim Mattson
  2017-11-27 22:42     ` Liran Alon
  2017-11-28  6:39   ` Jim Mattson
  1 sibling, 1 reply; 32+ messages in thread
From: Jim Mattson @ 2017-11-27 20:48 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk

I am concerned about the possible conflation of L1 and L2 events.
Vmx_check_nested_events() reads as though vcpu->arch.exception.pending
is always associated with an L1 event, yet inject_pending_event()
reads as though vcpu->arch.exception.injected is associated with the
current level. That seems odd, to split this structure that way.

Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set
vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual
interrupt is interrupted by another event. But
vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an
L1 "physical" interrupt should cause a VM-exit from L2 to L1, and
kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when
!lapic_in_kernel(). Clearly, there is some disagreement here over what
vcpu->arch.interrupt.pending means.

I think there may be some more fundamental problems lurking here.

On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
> reinject before next resume of L2.
>
> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
> to the L1 vCPU which currently runs L2 and exited to L0.
>
> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
> it will note that a previous event was re-injected to L2 (by
> IDT-vectoring-info) and therefore won't check if there are pending L1
> events which require exit from L2 to L1. Thus, L0 enters L2 without
> immediate VMExit even though there are pending L1 events!
>
> This commit fixes the issue by making sure to check for L1 pending
> events even if a previous event was reinjected to L2 and bailing out
> from inject_pending_event() before evaluating a new pending event in
> case an event was already reinjected.
>
> The bug was observed by the following setup:
> * L0 is a 64CPU machine which runs KVM.
> * L1 is a 16CPU machine which runs KVM.
> * L0 & L1 runs with APICv disabled.
> (Also reproduced with APICv enabled but easier to analyze below info
> with APICv disabled)
> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
> During L2 boot, L1 hangs completely and analyzing the hang reveals that
> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
> that he has sent for another L1 vCPU. And all other L1 vCPUs are
> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
> forever (as L1 runs with kernel-preemption disabled).
>
> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
> series of events:
> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
> vec 252 (Fixed|edge)
> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
> rip 0xffffe00069202690 info 83 0
> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
> ext_int: 0x00000000 ext_int_err: 0x00000000
> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>
> Which can be analyzed as follows:
> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
> a pending-interrupt of 0xd2.
> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
> notes that interrupt 0xd2 was reinjected and therefore calls
> vmx_inject_irq() and returns. Without checking for pending L1 events!
> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
> before calling inject_pending_event().
> (4) L0 resumes L2 without immediate-exit even though there is a pending
> L1 event (The IPI pending in LAPIC's IRR).
>
> We have already reached the buggy scenario but events could be
> furthered analyzed:
> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
> event-delivery.
> (7) L0 decides to forward the VMExit to L1 for further handling.
> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
> LAPIC's IRR is not examined and therefore the IPI is still not delivered
> into L1!
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a4b5a013064b..63359ab0e798 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>         int r;
>
>         /* try to reinject previous events if any */
> -       if (vcpu->arch.exception.injected) {
> -               kvm_x86_ops->queue_exception(vcpu);
> -               return 0;
> -       }
>
> +       if (vcpu->arch.exception.injected)
> +               kvm_x86_ops->queue_exception(vcpu);
>         /*
>          * NMI/interrupt must not be injected if an exception is
>          * pending, because the latter may have been queued by
>          * handling exit due to NMI/interrupt delivery.
>          */
> -       if (!vcpu->arch.exception.pending) {
> -               if (vcpu->arch.nmi_injected) {
> +       else if (!vcpu->arch.exception.pending) {
> +               if (vcpu->arch.nmi_injected)
>                         kvm_x86_ops->set_nmi(vcpu);
> -                       return 0;
> -               }
> -
> -               if (vcpu->arch.interrupt.injected) {
> +               else if (vcpu->arch.interrupt.injected)
>                         kvm_x86_ops->set_irq(vcpu);
> -                       return 0;
> -               }
>         }
>
> +       /*
> +        * Call check_nested_events() even if we reinjected a previous event
> +        * in order for caller to determine if it should require immediate-exit
> +        * from L2 to L1 due to pending L1 events which require exit
> +        * from L2 to L1.
> +        */
>         if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>                 r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>                 if (r != 0)
> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>                                         vcpu->arch.exception.has_error_code,
>                                         vcpu->arch.exception.error_code);
>
> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>                 vcpu->arch.exception.pending = false;
>                 vcpu->arch.exception.injected = true;
>
> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>                 }
>
>                 kvm_x86_ops->queue_exception(vcpu);
> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
> +       }
> +
> +       /* Don't consider new event if we re-injected an event */
> +       if (kvm_event_needs_reinjection(vcpu))
> +               return 0;
> +
> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
> +           kvm_x86_ops->smi_allowed(vcpu)) {
>                 vcpu->arch.smi_pending = false;
>                 enter_smm(vcpu);
>         } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-27 20:48   ` Jim Mattson
@ 2017-11-27 22:42     ` Liran Alon
  2017-11-28  4:55       ` Jim Mattson
  0 siblings, 1 reply; 32+ messages in thread
From: Liran Alon @ 2017-11-27 22:42 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk



On 27/11/17 22:48, Jim Mattson wrote:
> I am concerned about the possible conflation of L1 and L2 events.
> Vmx_check_nested_events() reads as though vcpu->arch.exception.pending
> is always associated with an L1 event, yet inject_pending_event()
> reads as though vcpu->arch.exception.injected is associated with the
> current level. That seems odd, to split this structure that way.

I think you misunderstand code (rightfully as it is confusing...). 
vmx_check_nested_events() treats exception.pending as related to 
current-level (L1 or L2). Specifically, because 
vmx_check_nested_events() is always called when 
is_guest_mode(vcpu)==true, then it treats it as a "pending L2 
exception". This is also why it checks if L1 intercept the "pending L2 
exception" and if yes, perform an exit from L2 to L1.

I understand the structures in the following way:
1. "injected" status is always associated with the current-level.
(Same for interrupt.pending because it is actually interrupt.injected, 
see comment below marked with (*))
2. exception.pending is associated with current-level.
3. nmi_pending is associated with L1.
4. In general, all the exception/nmi/interrupt structures are "supposed" 
to follow the same convention:
(a) A "pending" status means that the event should be injected to guest 
but it's side-effects have not occurred yet (not yet injected to guest).
(b) An "injected" status means that the event was already "processed" 
and therefore injected to guest along with it's side-effects.

(*) Currently, interrupt.pending is a confusing wrong name. It actually 
represents "interrupt.injected". This is why I made the a patch which 
renames this variable:
("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected")
https://marc.info/?l=kvm&m=151127826204179&w=2
(BTW, the reason why interrupt doesn't have an "interrupt.pending" flag 
is because it's "pending" status can be considered to be held in LAPIC 
IRR. This is why for example vmx_check_nested_events() call 
kvm_cpu_has_interrupt() to determine if there is a "pending" L1 
interrupt that may cause exit from L2 to L1).

>
> Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set
> vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual
> interrupt is interrupted by another event. But
> vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an
> L1 "physical" interrupt should cause a VM-exit from L2 to L1, and
> kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when
> !lapic_in_kernel(). Clearly, there is some disagreement here over what
> vcpu->arch.interrupt.pending means.

Again, there is some confusion here.

__vmx_complete_interrupts() only re-queues an "injected" event. (Either 
due to exit during event-delivery and therefore valid IDT-vectoring-info 
or because of injection-cancellation in vcpu_enter_guest()).
Therefore, this function should only re-insert events in "injected" 
state. As it really does:
1. NMI is re-injected with nmi_injected status.
2. exception is injected using kvm_requeue_exception() which will mark 
exception.injected=true.
3. interrupt is injected using kvm_queue_interrupt() which indeed mark 
interrupt.pending. But as stated above, this is just a wrong name and it 
is actually "interrupt.injected".

>
> I think there may be some more fundamental problems lurking here.
Yep. You are indeed right.
We have found several more issues revolving treatments of pending events 
in regard to nested-virtualization.
I am about to post another patch series which handles some such cases 
which relates to nested-posted-interrupts handling. Stay tuned. Your 
review will be extremely valuable :)

Regards,
-Liran

>
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>> reinject before next resume of L2.
>>
>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>> to the L1 vCPU which currently runs L2 and exited to L0.
>>
>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>> it will note that a previous event was re-injected to L2 (by
>> IDT-vectoring-info) and therefore won't check if there are pending L1
>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>> immediate VMExit even though there are pending L1 events!
>>
>> This commit fixes the issue by making sure to check for L1 pending
>> events even if a previous event was reinjected to L2 and bailing out
>> from inject_pending_event() before evaluating a new pending event in
>> case an event was already reinjected.
>>
>> The bug was observed by the following setup:
>> * L0 is a 64CPU machine which runs KVM.
>> * L1 is a 16CPU machine which runs KVM.
>> * L0 & L1 runs with APICv disabled.
>> (Also reproduced with APICv enabled but easier to analyze below info
>> with APICv disabled)
>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>> forever (as L1 runs with kernel-preemption disabled).
>>
>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>> series of events:
>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>> vec 252 (Fixed|edge)
>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>> rip 0xffffe00069202690 info 83 0
>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>> ext_int: 0x00000000 ext_int_err: 0x00000000
>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>
>> Which can be analyzed as follows:
>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>> a pending-interrupt of 0xd2.
>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>> notes that interrupt 0xd2 was reinjected and therefore calls
>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>> before calling inject_pending_event().
>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>> L1 event (The IPI pending in LAPIC's IRR).
>>
>> We have already reached the buggy scenario but events could be
>> furthered analyzed:
>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>> event-delivery.
>> (7) L0 decides to forward the VMExit to L1 for further handling.
>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>> into L1!
>>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> ---
>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a4b5a013064b..63359ab0e798 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>          int r;
>>
>>          /* try to reinject previous events if any */
>> -       if (vcpu->arch.exception.injected) {
>> -               kvm_x86_ops->queue_exception(vcpu);
>> -               return 0;
>> -       }
>>
>> +       if (vcpu->arch.exception.injected)
>> +               kvm_x86_ops->queue_exception(vcpu);
>>          /*
>>           * NMI/interrupt must not be injected if an exception is
>>           * pending, because the latter may have been queued by
>>           * handling exit due to NMI/interrupt delivery.
>>           */
>> -       if (!vcpu->arch.exception.pending) {
>> -               if (vcpu->arch.nmi_injected) {
>> +       else if (!vcpu->arch.exception.pending) {
>> +               if (vcpu->arch.nmi_injected)
>>                          kvm_x86_ops->set_nmi(vcpu);
>> -                       return 0;
>> -               }
>> -
>> -               if (vcpu->arch.interrupt.injected) {
>> +               else if (vcpu->arch.interrupt.injected)
>>                          kvm_x86_ops->set_irq(vcpu);
>> -                       return 0;
>> -               }
>>          }
>>
>> +       /*
>> +        * Call check_nested_events() even if we reinjected a previous event
>> +        * in order for caller to determine if it should require immediate-exit
>> +        * from L2 to L1 due to pending L1 events which require exit
>> +        * from L2 to L1.
>> +        */
>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>                  r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>                  if (r != 0)
>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                                          vcpu->arch.exception.has_error_code,
>>                                          vcpu->arch.exception.error_code);
>>
>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>                  vcpu->arch.exception.pending = false;
>>                  vcpu->arch.exception.injected = true;
>>
>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                  }
>>
>>                  kvm_x86_ops->queue_exception(vcpu);
>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
>> +       }
>> +
>> +       /* Don't consider new event if we re-injected an event */
>> +       if (kvm_event_needs_reinjection(vcpu))
>> +               return 0;
>> +
>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>                  vcpu->arch.smi_pending = false;
>>                  enter_smm(vcpu);
>>          } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-27 22:42     ` Liran Alon
@ 2017-11-28  4:55       ` Jim Mattson
  2017-11-28 11:14         ` Paolo Bonzini
  2017-11-28 11:36         ` Liran Alon
  0 siblings, 2 replies; 32+ messages in thread
From: Jim Mattson @ 2017-11-28  4:55 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk

On Mon, Nov 27, 2017 at 2:42 PM, Liran Alon <LIRAN.ALON@oracle.com> wrote:
>
>
> On 27/11/17 22:48, Jim Mattson wrote:
>>
>> I am concerned about the possible conflation of L1 and L2 events.
>> Vmx_check_nested_events() reads as though vcpu->arch.exception.pending
>> is always associated with an L1 event, yet inject_pending_event()
>> reads as though vcpu->arch.exception.injected is associated with the
>> current level. That seems odd, to split this structure that way.
>
>
> I think you misunderstand code (rightfully as it is confusing...).
> vmx_check_nested_events() treats exception.pending as related to
> current-level (L1 or L2). Specifically, because vmx_check_nested_events() is
> always called when is_guest_mode(vcpu)==true, then it treats it as a
> "pending L2 exception". This is also why it checks if L1 intercept the
> "pending L2 exception" and if yes, perform an exit from L2 to L1.
>
> I understand the structures in the following way:
> 1. "injected" status is always associated with the current-level.
> (Same for interrupt.pending because it is actually interrupt.injected, see
> comment below marked with (*))
> 2. exception.pending is associated with current-level.
> 3. nmi_pending is associated with L1.
> 4. In general, all the exception/nmi/interrupt structures are "supposed" to
> follow the same convention:
> (a) A "pending" status means that the event should be injected to guest but
> it's side-effects have not occurred yet (not yet injected to guest).
> (b) An "injected" status means that the event was already "processed" and
> therefore injected to guest along with it's side-effects.
>
> (*) Currently, interrupt.pending is a confusing wrong name. It actually
> represents "interrupt.injected". This is why I made the a patch which
> renames this variable:
> ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected")
> https://marc.info/?l=kvm&m=151127826204179&w=2
> (BTW, the reason why interrupt doesn't have an "interrupt.pending" flag is
> because it's "pending" status can be considered to be held in LAPIC IRR.
> This is why for example vmx_check_nested_events() call
> kvm_cpu_has_interrupt() to determine if there is a "pending" L1 interrupt
> that may cause exit from L2 to L1).
>
>>
>> Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set
>> vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual
>> interrupt is interrupted by another event. But
>> vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an
>> L1 "physical" interrupt should cause a VM-exit from L2 to L1, and
>> kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when
>> !lapic_in_kernel(). Clearly, there is some disagreement here over what
>> vcpu->arch.interrupt.pending means.
>
>
> Again, there is some confusion here.
>
> __vmx_complete_interrupts() only re-queues an "injected" event. (Either due
> to exit during event-delivery and therefore valid IDT-vectoring-info or
> because of injection-cancellation in vcpu_enter_guest()).
> Therefore, this function should only re-insert events in "injected" state.
> As it really does:
> 1. NMI is re-injected with nmi_injected status.
> 2. exception is injected using kvm_requeue_exception() which will mark
> exception.injected=true.
> 3. interrupt is injected using kvm_queue_interrupt() which indeed mark
> interrupt.pending. But as stated above, this is just a wrong name and it is
> actually "interrupt.injected".

kvm_queue_interrupt() begins as follows:
        vcpu->arch.interrupt.pending = true;

kvm_cpu_has_interrupt() begins as follows:
        if (!lapic_in_kernel(v))
                return v->arch.interrupt.pending;

In the referenced [patch 2/8], you change interrupt.pending to
interrupt.injected, but the same field is still referenced by these
two functions.
It's clear that __vmx_complete_interrupts sets this field to indicate
an interrupted L2 virtual interrupt, and I agree with your assertion
above that kvm_cpu_has_interrupt() *should* determine if there is a
"pending" L1 physical interrupt that may cause exit from L2 to L1, but
it seems to me that the injected L2 event and the pending L1 event
have been conflated in this field, at least when lapic_in_kernel() is
false.

>>
>> I think there may be some more fundamental problems lurking here.
>
> Yep. You are indeed right.
> We have found several more issues revolving treatments of pending events in
> regard to nested-virtualization.
> I am about to post another patch series which handles some such cases which
> relates to nested-posted-interrupts handling. Stay tuned. Your review will
> be extremely valuable :)
>
> Regards,
> -Liran
>
>
>>
>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>>
>>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>>> reinject before next resume of L2.
>>>
>>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>>> to the L1 vCPU which currently runs L2 and exited to L0.
>>>
>>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>>> it will note that a previous event was re-injected to L2 (by
>>> IDT-vectoring-info) and therefore won't check if there are pending L1
>>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>>> immediate VMExit even though there are pending L1 events!
>>>
>>> This commit fixes the issue by making sure to check for L1 pending
>>> events even if a previous event was reinjected to L2 and bailing out
>>> from inject_pending_event() before evaluating a new pending event in
>>> case an event was already reinjected.
>>>
>>> The bug was observed by the following setup:
>>> * L0 is a 64CPU machine which runs KVM.
>>> * L1 is a 16CPU machine which runs KVM.
>>> * L0 & L1 runs with APICv disabled.
>>> (Also reproduced with APICv enabled but easier to analyze below info
>>> with APICv disabled)
>>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>>> forever (as L1 runs with kernel-preemption disabled).
>>>
>>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>>> series of events:
>>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>>> vec 252 (Fixed|edge)
>>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>>> rip 0xffffe00069202690 info 83 0
>>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>>> ext_int: 0x00000000 ext_int_err: 0x00000000
>>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>>
>>> Which can be analyzed as follows:
>>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>>> a pending-interrupt of 0xd2.
>>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>>> notes that interrupt 0xd2 was reinjected and therefore calls
>>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>>> before calling inject_pending_event().
>>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>>> L1 event (The IPI pending in LAPIC's IRR).
>>>
>>> We have already reached the buggy scenario but events could be
>>> furthered analyzed:
>>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>>> event-delivery.
>>> (7) L0 decides to forward the VMExit to L1 for further handling.
>>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>>> into L1!
>>>
>>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> ---
>>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index a4b5a013064b..63359ab0e798 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu
>>> *vcpu, bool req_int_win)
>>>          int r;
>>>
>>>          /* try to reinject previous events if any */
>>> -       if (vcpu->arch.exception.injected) {
>>> -               kvm_x86_ops->queue_exception(vcpu);
>>> -               return 0;
>>> -       }
>>>
>>> +       if (vcpu->arch.exception.injected)
>>> +               kvm_x86_ops->queue_exception(vcpu);
>>>          /*
>>>           * NMI/interrupt must not be injected if an exception is
>>>           * pending, because the latter may have been queued by
>>>           * handling exit due to NMI/interrupt delivery.
>>>           */
>>> -       if (!vcpu->arch.exception.pending) {
>>> -               if (vcpu->arch.nmi_injected) {
>>> +       else if (!vcpu->arch.exception.pending) {
>>> +               if (vcpu->arch.nmi_injected)
>>>                          kvm_x86_ops->set_nmi(vcpu);
>>> -                       return 0;
>>> -               }
>>> -
>>> -               if (vcpu->arch.interrupt.injected) {
>>> +               else if (vcpu->arch.interrupt.injected)
>>>                          kvm_x86_ops->set_irq(vcpu);
>>> -                       return 0;
>>> -               }
>>>          }
>>>
>>> +       /*
>>> +        * Call check_nested_events() even if we reinjected a previous
>>> event
>>> +        * in order for caller to determine if it should require
>>> immediate-exit
>>> +        * from L2 to L1 due to pending L1 events which require exit
>>> +        * from L2 to L1.
>>> +        */
>>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>>                  r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>>                  if (r != 0)
>>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu
>>> *vcpu, bool req_int_win)
>>>
>>> vcpu->arch.exception.has_error_code,
>>>
>>> vcpu->arch.exception.error_code);
>>>
>>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>>                  vcpu->arch.exception.pending = false;
>>>                  vcpu->arch.exception.injected = true;
>>>
>>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu
>>> *vcpu, bool req_int_win)
>>>                  }
>>>
>>>                  kvm_x86_ops->queue_exception(vcpu);
>>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>> kvm_x86_ops->smi_allowed(vcpu)) {
>>> +       }
>>> +
>>> +       /* Don't consider new event if we re-injected an event */
>>> +       if (kvm_event_needs_reinjection(vcpu))
>>> +               return 0;
>>> +
>>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>>                  vcpu->arch.smi_pending = false;
>>>                  enter_smm(vcpu);
>>>          } else if (vcpu->arch.nmi_pending &&
>>> kvm_x86_ops->nmi_allowed(vcpu)) {
>>> --
>>> 1.9.1
>>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-21 15:30 ` [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Liran Alon
  2017-11-27 20:48   ` Jim Mattson
@ 2017-11-28  6:39   ` Jim Mattson
  2017-11-28 18:26     ` Jim Mattson
  2017-11-28 19:33     ` Liran Alon
  1 sibling, 2 replies; 32+ messages in thread
From: Jim Mattson @ 2017-11-28  6:39 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk

On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
> reinject before next resume of L2.
>
> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
> to the L1 vCPU which currently runs L2 and exited to L0.
>
> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
> it will note that a previous event was re-injected to L2 (by
> IDT-vectoring-info) and therefore won't check if there are pending L1
> events which require exit from L2 to L1. Thus, L0 enters L2 without
> immediate VMExit even though there are pending L1 events!
>
> This commit fixes the issue by making sure to check for L1 pending
> events even if a previous event was reinjected to L2 and bailing out
> from inject_pending_event() before evaluating a new pending event in
> case an event was already reinjected.
>
> The bug was observed by the following setup:
> * L0 is a 64CPU machine which runs KVM.
> * L1 is a 16CPU machine which runs KVM.
> * L0 & L1 runs with APICv disabled.
> (Also reproduced with APICv enabled but easier to analyze below info
> with APICv disabled)
> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
> During L2 boot, L1 hangs completely and analyzing the hang reveals that
> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
> that he has sent for another L1 vCPU. And all other L1 vCPUs are
> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
> forever (as L1 runs with kernel-preemption disabled).
>
> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
> series of events:
> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000

This looks like an EPT violation for writing a stack page that wasn't
present in the vmcs02 shadow EPT tables. I assume that L0 filled in
the missing shadow EPT table entry (or entries) and dismissed this EPT
violation itself rather than forwarding it to L1, because L1's EPT
tables have a valid entry for this L2 guest-physical address.

> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
> vec 252 (Fixed|edge)

This is the source of the bug, right here. An interrupt can only be
accepted at an instruction boundary, and the virtual CPU is in the
midst of IDT-vectoring, which is not at an instruction boundary. The
EPT violation from L2 to L0 is an L0 artifact that should be
completely invisible to L1/L2. Interrupt 252 should remain pending
until we are at the first instruction of the IDT handler for interrupt
210.

To be more precise, if interrupt 210 was delivered via VM-entry event
injection on L1 VM-entry to L2, then we need to:
a) complete the event injection (i.e. vector through the L2 IDT for vector 210),
b) handle virtual VMX-preemption timer expiration during VM-entry (if any),
c) handle NMI-window exiting events (if any),
d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
L2 IDT, depending on the NMI-exiting control),
e) handle interrupt-window exiting events,
f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
vectoring through the L2 IDT, depending on the external-iinterrupt
exiting control).
...

Only when we get to step (f) can we accept a new IRQ from L0's local APIC.

> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
> rip 0xffffe00069202690 info 83 0
> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
> ext_int: 0x00000000 ext_int_err: 0x00000000
> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>
> Which can be analyzed as follows:
> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
> a pending-interrupt of 0xd2.
> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
> notes that interrupt 0xd2 was reinjected and therefore calls
> vmx_inject_irq() and returns. Without checking for pending L1 events!
> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
> before calling inject_pending_event().
> (4) L0 resumes L2 without immediate-exit even though there is a pending
> L1 event (The IPI pending in LAPIC's IRR).
>
> We have already reached the buggy scenario but events could be
> furthered analyzed:
> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
> event-delivery.
> (7) L0 decides to forward the VMExit to L1 for further handling.
> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
> LAPIC's IRR is not examined and therefore the IPI is still not delivered
> into L1!
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>  1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a4b5a013064b..63359ab0e798 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>         int r;
>
>         /* try to reinject previous events if any */
> -       if (vcpu->arch.exception.injected) {
> -               kvm_x86_ops->queue_exception(vcpu);
> -               return 0;
> -       }
>
> +       if (vcpu->arch.exception.injected)
> +               kvm_x86_ops->queue_exception(vcpu);
>         /*
>          * NMI/interrupt must not be injected if an exception is
>          * pending, because the latter may have been queued by
>          * handling exit due to NMI/interrupt delivery.
>          */
> -       if (!vcpu->arch.exception.pending) {
> -               if (vcpu->arch.nmi_injected) {
> +       else if (!vcpu->arch.exception.pending) {
> +               if (vcpu->arch.nmi_injected)
>                         kvm_x86_ops->set_nmi(vcpu);
> -                       return 0;
> -               }
> -
> -               if (vcpu->arch.interrupt.injected) {
> +               else if (vcpu->arch.interrupt.injected)
>                         kvm_x86_ops->set_irq(vcpu);
> -                       return 0;
> -               }
>         }
>
> +       /*
> +        * Call check_nested_events() even if we reinjected a previous event
> +        * in order for caller to determine if it should require immediate-exit
> +        * from L2 to L1 due to pending L1 events which require exit
> +        * from L2 to L1.
> +        */
>         if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>                 r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>                 if (r != 0)
> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>                                         vcpu->arch.exception.has_error_code,
>                                         vcpu->arch.exception.error_code);
>
> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>                 vcpu->arch.exception.pending = false;
>                 vcpu->arch.exception.injected = true;
>
> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>                 }
>
>                 kvm_x86_ops->queue_exception(vcpu);
> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
> +       }
> +
> +       /* Don't consider new event if we re-injected an event */
> +       if (kvm_event_needs_reinjection(vcpu))
> +               return 0;
> +
> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
> +           kvm_x86_ops->smi_allowed(vcpu)) {
>                 vcpu->arch.smi_pending = false;
>                 enter_smm(vcpu);
>         } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-28  4:55       ` Jim Mattson
@ 2017-11-28 11:14         ` Paolo Bonzini
  2017-11-28 13:59           ` Liran Alon
  2017-11-28 11:36         ` Liran Alon
  1 sibling, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2017-11-28 11:14 UTC (permalink / raw)
  To: Jim Mattson, Liran Alon
  Cc: Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk

On 28/11/2017 05:55, Jim Mattson wrote:
> kvm_queue_interrupt() begins as follows:
>         vcpu->arch.interrupt.pending = true;
> 
> kvm_cpu_has_interrupt() begins as follows:
>         if (!lapic_in_kernel(v))
>                 return v->arch.interrupt.pending;
> 
> In the referenced [patch 2/8], you change interrupt.pending to
> interrupt.injected, but the same field is still referenced by these
> two functions.

We cannot remove the !lapic_in_kernel(v) case, but it's okay if we
restrict nested VMX/SVM in CPUID when it is disabled (that is, check for
!lapic_in_kernel in nested_svm_check_permissions and nested_vmx_allowed,
so that setting VMXE and SVME will fail).

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-28  4:55       ` Jim Mattson
  2017-11-28 11:14         ` Paolo Bonzini
@ 2017-11-28 11:36         ` Liran Alon
  1 sibling, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-28 11:36 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk



On 28/11/17 06:55, Jim Mattson wrote:
> On Mon, Nov 27, 2017 at 2:42 PM, Liran Alon <LIRAN.ALON@oracle.com> wrote:
>>
>>
>> On 27/11/17 22:48, Jim Mattson wrote:
>>>
>>> I am concerned about the possible conflation of L1 and L2 events.
>>> Vmx_check_nested_events() reads as though vcpu->arch.exception.pending
>>> is always associated with an L1 event, yet inject_pending_event()
>>> reads as though vcpu->arch.exception.injected is associated with the
>>> current level. That seems odd, to split this structure that way.
>>
>>
>> I think you misunderstand code (rightfully as it is confusing...).
>> vmx_check_nested_events() treats exception.pending as related to
>> current-level (L1 or L2). Specifically, because vmx_check_nested_events() is
>> always called when is_guest_mode(vcpu)==true, then it treats it as a
>> "pending L2 exception". This is also why it checks if L1 intercept the
>> "pending L2 exception" and if yes, perform an exit from L2 to L1.
>>
>> I understand the structures in the following way:
>> 1. "injected" status is always associated with the current-level.
>> (Same for interrupt.pending because it is actually interrupt.injected, see
>> comment below marked with (*))
>> 2. exception.pending is associated with current-level.
>> 3. nmi_pending is associated with L1.
>> 4. In general, all the exception/nmi/interrupt structures are "supposed" to
>> follow the same convention:
>> (a) A "pending" status means that the event should be injected to guest but
>> it's side-effects have not occurred yet (not yet injected to guest).
>> (b) An "injected" status means that the event was already "processed" and
>> therefore injected to guest along with it's side-effects.
>>
>> (*) Currently, interrupt.pending is a confusing wrong name. It actually
>> represents "interrupt.injected". This is why I made the a patch which
>> renames this variable:
>> ("[PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected")
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dkvm-26m-3D151127826204179-26w-3D2&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jEXMkXjEtpUV8ujOR5HKAievAdmBLs-ernvMak55rWI&s=7LxsKUsoJs_ZxeyBEm9t8TtPML-Zhcim11Bdx6TKA60&e=
>> (BTW, the reason why interrupt doesn't have an "interrupt.pending" flag is
>> because it's "pending" status can be considered to be held in LAPIC IRR.
>> This is why for example vmx_check_nested_events() call
>> kvm_cpu_has_interrupt() to determine if there is a "pending" L1 interrupt
>> that may cause exit from L2 to L1).
>>
>>>
>>> Worse, __vmx_complete_interrupts() calls kvm_queue_interrupt() to set
>>> vcpu->arch.interrupt.pending if IDT vectoring for an L2 virtual
>>> interrupt is interrupted by another event. But
>>> vmx_check_nested_events() calls kvm_cpu_has_interrupt() to see if an
>>> L1 "physical" interrupt should cause a VM-exit from L2 to L1, and
>>> kvm_cpu_has_interrupt() looks at vcpu->arch.interrupt.pending when
>>> !lapic_in_kernel(). Clearly, there is some disagreement here over what
>>> vcpu->arch.interrupt.pending means.
>>
>>
>> Again, there is some confusion here.
>>
>> __vmx_complete_interrupts() only re-queues an "injected" event. (Either due
>> to exit during event-delivery and therefore valid IDT-vectoring-info or
>> because of injection-cancellation in vcpu_enter_guest()).
>> Therefore, this function should only re-insert events in "injected" state.
>> As it really does:
>> 1. NMI is re-injected with nmi_injected status.
>> 2. exception is injected using kvm_requeue_exception() which will mark
>> exception.injected=true.
>> 3. interrupt is injected using kvm_queue_interrupt() which indeed mark
>> interrupt.pending. But as stated above, this is just a wrong name and it is
>> actually "interrupt.injected".
>
> kvm_queue_interrupt() begins as follows:
>          vcpu->arch.interrupt.pending = true;
>
> kvm_cpu_has_interrupt() begins as follows:
>          if (!lapic_in_kernel(v))
>                  return v->arch.interrupt.pending;
>
> In the referenced [patch 2/8], you change interrupt.pending to
> interrupt.injected, but the same field is still referenced by these
> two functions.
> It's clear that __vmx_complete_interrupts sets this field to indicate
> an interrupted L2 virtual interrupt, and I agree with your assertion
> above that kvm_cpu_has_interrupt() *should* determine if there is a
> "pending" L1 physical interrupt that may cause exit from L2 to L1, but
> it seems to me that the injected L2 event and the pending L1 event
> have been conflated in this field, at least when lapic_in_kernel() is
> false.

First of all, I hope we agree that in case lapic_in_kernel()==true then 
interrupt.pending actually represents interrupt.injected.

In regards to lapic_in_kernel()==false:
"interrupt.pending" can currently only be set by user-mode in the 
following ways:
1. KVM_INTERRUPT ioctl.
2. KVM_SET_VCPU_EVENTS ioctl.
3. KVM_SET_SREGS ioctl.
Looking at how QEMU use these ioctls, one can see that they use it 
either to re-set an "interrupt.injected" state it has received from KVM 
(via KVM_GET_VCPU_EVENTS interrupt.injected or via KVM_GET_SREGS 
interrupt_bitmap) or by dispatching a new interrupt from QEMU's emulated 
LAPIC which reset bit in IRR and set bit in ISR before sending ioctl to KVM.

So it is true that "interrupt.pending" in KVM always represents 
"interrupt.injected".

One could argue however that kvm_cpu_has_interrupt() & 
kvm_cpu_has_injectable_intr() specifically is misusing 
interrupt.injected in order to return if there is a pending-interrupt in 
case lapic_in_kernel()==false. I agree with this. However, all other 
places in KVM code, as far as I can tell, don't misuse this flag and 
treat it as "interrupt.injected".

Because the issue with these functions always existed and it is 
unrelated to the fixes introduced by this series (and specially this 
nVMX fix patch), I suggest modifying my patch ("[PATCH v2 2/8] KVM: x86: 
Rename interrupt.pending to interrupt.injected") to also add a FIXME 
comment on top of these places.

What do you think?

-Liran

>
>>>
>>> I think there may be some more fundamental problems lurking here.
>>
>> Yep. You are indeed right.
>> We have found several more issues revolving treatments of pending events in
>> regard to nested-virtualization.
>> I am about to post another patch series which handles some such cases which
>> relates to nested-posted-interrupts handling. Stay tuned. Your review will
>> be extremely valuable :)
>>
>> Regards,
>> -Liran
>>
>>
>>>
>>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>>>
>>>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>>>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>>>> reinject before next resume of L2.
>>>>
>>>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>>>> to the L1 vCPU which currently runs L2 and exited to L0.
>>>>
>>>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>>>> it will note that a previous event was re-injected to L2 (by
>>>> IDT-vectoring-info) and therefore won't check if there are pending L1
>>>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>>>> immediate VMExit even though there are pending L1 events!
>>>>
>>>> This commit fixes the issue by making sure to check for L1 pending
>>>> events even if a previous event was reinjected to L2 and bailing out
>>>> from inject_pending_event() before evaluating a new pending event in
>>>> case an event was already reinjected.
>>>>
>>>> The bug was observed by the following setup:
>>>> * L0 is a 64CPU machine which runs KVM.
>>>> * L1 is a 16CPU machine which runs KVM.
>>>> * L0 & L1 runs with APICv disabled.
>>>> (Also reproduced with APICv enabled but easier to analyze below info
>>>> with APICv disabled)
>>>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>>>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>>>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>>>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>>>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>>>> forever (as L1 runs with kernel-preemption disabled).
>>>>
>>>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>>>> series of events:
>>>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>>>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>>>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>>>> vec 252 (Fixed|edge)
>>>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>>>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>>>> rip 0xffffe00069202690 info 83 0
>>>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>>>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>>>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>>>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>>>> ext_int: 0x00000000 ext_int_err: 0x00000000
>>>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>>>
>>>> Which can be analyzed as follows:
>>>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>>>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>>>> a pending-interrupt of 0xd2.
>>>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>>>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>>>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>>>> notes that interrupt 0xd2 was reinjected and therefore calls
>>>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>>>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>>>> before calling inject_pending_event().
>>>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>>>> L1 event (The IPI pending in LAPIC's IRR).
>>>>
>>>> We have already reached the buggy scenario but events could be
>>>> furthered analyzed:
>>>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>>>> event-delivery.
>>>> (7) L0 decides to forward the VMExit to L1 for further handling.
>>>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>>>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>>>> into L1!
>>>>
>>>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>>> ---
>>>>    arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>>>    1 file changed, 20 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index a4b5a013064b..63359ab0e798 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu
>>>> *vcpu, bool req_int_win)
>>>>           int r;
>>>>
>>>>           /* try to reinject previous events if any */
>>>> -       if (vcpu->arch.exception.injected) {
>>>> -               kvm_x86_ops->queue_exception(vcpu);
>>>> -               return 0;
>>>> -       }
>>>>
>>>> +       if (vcpu->arch.exception.injected)
>>>> +               kvm_x86_ops->queue_exception(vcpu);
>>>>           /*
>>>>            * NMI/interrupt must not be injected if an exception is
>>>>            * pending, because the latter may have been queued by
>>>>            * handling exit due to NMI/interrupt delivery.
>>>>            */
>>>> -       if (!vcpu->arch.exception.pending) {
>>>> -               if (vcpu->arch.nmi_injected) {
>>>> +       else if (!vcpu->arch.exception.pending) {
>>>> +               if (vcpu->arch.nmi_injected)
>>>>                           kvm_x86_ops->set_nmi(vcpu);
>>>> -                       return 0;
>>>> -               }
>>>> -
>>>> -               if (vcpu->arch.interrupt.injected) {
>>>> +               else if (vcpu->arch.interrupt.injected)
>>>>                           kvm_x86_ops->set_irq(vcpu);
>>>> -                       return 0;
>>>> -               }
>>>>           }
>>>>
>>>> +       /*
>>>> +        * Call check_nested_events() even if we reinjected a previous
>>>> event
>>>> +        * in order for caller to determine if it should require
>>>> immediate-exit
>>>> +        * from L2 to L1 due to pending L1 events which require exit
>>>> +        * from L2 to L1.
>>>> +        */
>>>>           if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>>>                   r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>>>                   if (r != 0)
>>>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu
>>>> *vcpu, bool req_int_win)
>>>>
>>>> vcpu->arch.exception.has_error_code,
>>>>
>>>> vcpu->arch.exception.error_code);
>>>>
>>>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>>>                   vcpu->arch.exception.pending = false;
>>>>                   vcpu->arch.exception.injected = true;
>>>>
>>>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu
>>>> *vcpu, bool req_int_win)
>>>>                   }
>>>>
>>>>                   kvm_x86_ops->queue_exception(vcpu);
>>>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>>> kvm_x86_ops->smi_allowed(vcpu)) {
>>>> +       }
>>>> +
>>>> +       /* Don't consider new event if we re-injected an event */
>>>> +       if (kvm_event_needs_reinjection(vcpu))
>>>> +               return 0;
>>>> +
>>>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>>>                   vcpu->arch.smi_pending = false;
>>>>                   enter_smm(vcpu);
>>>>           } else if (vcpu->arch.nmi_pending &&
>>>> kvm_x86_ops->nmi_allowed(vcpu)) {
>>>> --
>>>> 1.9.1
>>>>
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-28 11:14         ` Paolo Bonzini
@ 2017-11-28 13:59           ` Liran Alon
  0 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-28 13:59 UTC (permalink / raw)
  To: Paolo Bonzini, Jim Mattson
  Cc: Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk



On 28/11/17 13:14, Paolo Bonzini wrote:
> On 28/11/2017 05:55, Jim Mattson wrote:
>> kvm_queue_interrupt() begins as follows:
>>          vcpu->arch.interrupt.pending = true;
>>
>> kvm_cpu_has_interrupt() begins as follows:
>>          if (!lapic_in_kernel(v))
>>                  return v->arch.interrupt.pending;
>>
>> In the referenced [patch 2/8], you change interrupt.pending to
>> interrupt.injected, but the same field is still referenced by these
>> two functions.
>
> We cannot remove the !lapic_in_kernel(v) case, but it's okay if we
> restrict nested VMX/SVM in CPUID when it is disabled (that is, check for
> !lapic_in_kernel in nested_svm_check_permissions and nested_vmx_allowed,
> so that setting VMXE and SVME will fail).

I agree with this suggestion.

I think it is best to currently make a commit before ("[PATCH v2 2/8] 
KVM: x86: Rename interrupt.pending to interrupt.injected"), that will 
add FIXME comments in kvm_cpu_has_interrupt() & 
kvm_cpu_has_injectable_intr() specifying why it is problematic that they 
use interrupt.pending in nVMX/nSVM scenarios and in addition put the 
above checks Paolo mentioned in nested_svm_check_permissions() and 
nested_vmx_allowed().

Regards,
-Liran
>
> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected
  2017-11-21 15:30 ` [PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected Liran Alon
@ 2017-11-28 17:02   ` Jim Mattson
  0 siblings, 0 replies; 32+ messages in thread
From: Jim Mattson @ 2017-11-28 17:02 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan

This makes sense to me, though as we've discussed elsewhere, the
existing uses in x86.h are questionable.

Reviewed-by: Jim Mattson <jmattson@google.com>

On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
> For exceptions & NMIs events, KVM code use the following
> coding convention:
> *) "pending" represents an event that should be injected to guest at
> some point but it's side-effects have not yet occurred.
> *) "injected" represents an event that it's side-effects have already
> occurred.
>
> However, interrupts don't confirm to this coding convention.
> All current code flows mark interrupt.pending when it's side-effects
> have already taken place (For example, bit moved from LAPIC IRR to
> ISR). Therefore, it makes sense to just rename
> interrupt.pending to interrupt.injected.
>
> This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
> Fix loss of exception which has not yet been injected") which changed
> exception to follow this coding convention as well.
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 2 +-
>  arch/x86/kvm/irq.c              | 4 ++--
>  arch/x86/kvm/vmx.c              | 2 +-
>  arch/x86/kvm/x86.c              | 8 ++++----
>  arch/x86/kvm/x86.h              | 6 +++---
>  5 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1bfb99770c34..f8ad3ca11a3a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -556,7 +556,7 @@ struct kvm_vcpu_arch {
>         } exception;
>
>         struct kvm_queued_interrupt {
> -               bool pending;
> +               bool injected;
>                 bool soft;
>                 u8 nr;
>         } interrupt;
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 5c24811e8b0b..1f7f37d1c8b9 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -74,7 +74,7 @@ static int kvm_cpu_has_extint(struct kvm_vcpu *v)
>  int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
>  {
>         if (!lapic_in_kernel(v))
> -               return v->arch.interrupt.pending;
> +               return v->arch.interrupt.injected;
>
>         if (kvm_cpu_has_extint(v))
>                 return 1;
> @@ -92,7 +92,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
>  int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>  {
>         if (!lapic_in_kernel(v))
> -               return v->arch.interrupt.pending;
> +               return v->arch.interrupt.injected;
>
>         if (kvm_cpu_has_extint(v))
>                 return 1;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c8a7bcc1bbd4..d939ed84f136 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -11007,7 +11007,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
>         } else if (vcpu->arch.nmi_injected) {
>                 vmcs12->idt_vectoring_info_field =
>                         INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR;
> -       } else if (vcpu->arch.interrupt.pending) {
> +       } else if (vcpu->arch.interrupt.injected) {
>                 nr = vcpu->arch.interrupt.nr;
>                 idt_vectoring = nr | VECTORING_INFO_VALID_MASK;
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c85fc4406a7d..45baba8bc02e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3111,7 +3111,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>         events->exception.error_code = vcpu->arch.exception.error_code;
>
>         events->interrupt.injected =
> -               vcpu->arch.interrupt.pending && !vcpu->arch.interrupt.soft;
> +               vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
>         events->interrupt.nr = vcpu->arch.interrupt.nr;
>         events->interrupt.soft = 0;
>         events->interrupt.shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);
> @@ -3164,7 +3164,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>         vcpu->arch.exception.has_error_code = events->exception.has_error_code;
>         vcpu->arch.exception.error_code = events->exception.error_code;
>
> -       vcpu->arch.interrupt.pending = events->interrupt.injected;
> +       vcpu->arch.interrupt.injected = events->interrupt.injected;
>         vcpu->arch.interrupt.nr = events->interrupt.nr;
>         vcpu->arch.interrupt.soft = events->interrupt.soft;
>         if (events->flags & KVM_VCPUEVENT_VALID_SHADOW)
> @@ -6406,7 +6406,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>                         return 0;
>                 }
>
> -               if (vcpu->arch.interrupt.pending) {
> +               if (vcpu->arch.interrupt.injected) {
>                         kvm_x86_ops->set_irq(vcpu);
>                         return 0;
>                 }
> @@ -7413,7 +7413,7 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
>
>         memset(sregs->interrupt_bitmap, 0, sizeof sregs->interrupt_bitmap);
>
> -       if (vcpu->arch.interrupt.pending && !vcpu->arch.interrupt.soft)
> +       if (vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft)
>                 set_bit(vcpu->arch.interrupt.nr,
>                         (unsigned long *)sregs->interrupt_bitmap);
>
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 6d112d8f799c..4eab2bae5937 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -19,19 +19,19 @@ static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
>  static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,
>         bool soft)
>  {
> -       vcpu->arch.interrupt.pending = true;
> +       vcpu->arch.interrupt.injected = true;
>         vcpu->arch.interrupt.soft = soft;
>         vcpu->arch.interrupt.nr = vector;
>  }
>
>  static inline void kvm_clear_interrupt_queue(struct kvm_vcpu *vcpu)
>  {
> -       vcpu->arch.interrupt.pending = false;
> +       vcpu->arch.interrupt.injected = false;
>  }
>
>  static inline bool kvm_event_needs_reinjection(struct kvm_vcpu *vcpu)
>  {
> -       return vcpu->arch.exception.injected || vcpu->arch.interrupt.pending ||
> +       return vcpu->arch.exception.injected || vcpu->arch.interrupt.injected ||
>                 vcpu->arch.nmi_injected;
>  }
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-28  6:39   ` Jim Mattson
@ 2017-11-28 18:26     ` Jim Mattson
  2017-11-28 19:45       ` Liran Alon
  2017-11-28 19:33     ` Liran Alon
  1 sibling, 1 reply; 32+ messages in thread
From: Jim Mattson @ 2017-11-28 18:26 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk

One other thing to note is that there is no place to put the
interrupted IDT vectoring information for L2 virtual interrupt 210
when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt
252. I see that vmcs12_save_pending_event() shoves this information
into the VMCS12 IDT-vectoring information field, but it should not do
so when the VMCS12 exit reason is "exception or non-maskable
interrupt."

>From the SDM, volume 3, section 27.2.3:

Section 24.9.3 defined fields containing information for VM exits that
occur while delivering an event through the IDT and as a result of any
of the following cases:
o  A fault occurs during event delivery and causes a VM exit (because
the bit associated with the fault is set to 1 in the exception
bitmap).
o  A task switch is invoked through a task gate in the IDT. The VM
exit occurs due to the task switch only after the initial checks of
the task switch pass (see Section 25.4.2).
o  Event delivery causes an APIC-access VM exit (see Section 29.4).
o  An EPT violation, EPT misconfiguration, or page-modification
log-full event that occurs during event delivery.

For all other exit reasons (including "exception or non-maskable
interrupt"), the valid bit of the IDT-vectoring info field should be
clear.

On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson <jmattson@google.com> wrote:
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>> reinject before next resume of L2.
>>
>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>> to the L1 vCPU which currently runs L2 and exited to L0.
>>
>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>> it will note that a previous event was re-injected to L2 (by
>> IDT-vectoring-info) and therefore won't check if there are pending L1
>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>> immediate VMExit even though there are pending L1 events!
>>
>> This commit fixes the issue by making sure to check for L1 pending
>> events even if a previous event was reinjected to L2 and bailing out
>> from inject_pending_event() before evaluating a new pending event in
>> case an event was already reinjected.
>>
>> The bug was observed by the following setup:
>> * L0 is a 64CPU machine which runs KVM.
>> * L1 is a 16CPU machine which runs KVM.
>> * L0 & L1 runs with APICv disabled.
>> (Also reproduced with APICv enabled but easier to analyze below info
>> with APICv disabled)
>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>> forever (as L1 runs with kernel-preemption disabled).
>>
>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>> series of events:
>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>
> This looks like an EPT violation for writing a stack page that wasn't
> present in the vmcs02 shadow EPT tables. I assume that L0 filled in
> the missing shadow EPT table entry (or entries) and dismissed this EPT
> violation itself rather than forwarding it to L1, because L1's EPT
> tables have a valid entry for this L2 guest-physical address.
>
>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>> vec 252 (Fixed|edge)
>
> This is the source of the bug, right here. An interrupt can only be
> accepted at an instruction boundary, and the virtual CPU is in the
> midst of IDT-vectoring, which is not at an instruction boundary. The
> EPT violation from L2 to L0 is an L0 artifact that should be
> completely invisible to L1/L2. Interrupt 252 should remain pending
> until we are at the first instruction of the IDT handler for interrupt
> 210.
>
> To be more precise, if interrupt 210 was delivered via VM-entry event
> injection on L1 VM-entry to L2, then we need to:
> a) complete the event injection (i.e. vector through the L2 IDT for vector 210),
> b) handle virtual VMX-preemption timer expiration during VM-entry (if any),
> c) handle NMI-window exiting events (if any),
> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
> L2 IDT, depending on the NMI-exiting control),
> e) handle interrupt-window exiting events,
> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
> vectoring through the L2 IDT, depending on the external-iinterrupt
> exiting control).
> ...
>
> Only when we get to step (f) can we accept a new IRQ from L0's local APIC.
>
>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>> rip 0xffffe00069202690 info 83 0
>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>> ext_int: 0x00000000 ext_int_err: 0x00000000
>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>
>> Which can be analyzed as follows:
>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>> a pending-interrupt of 0xd2.
>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>> notes that interrupt 0xd2 was reinjected and therefore calls
>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>> before calling inject_pending_event().
>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>> L1 event (The IPI pending in LAPIC's IRR).
>>
>> We have already reached the buggy scenario but events could be
>> furthered analyzed:
>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>> event-delivery.
>> (7) L0 decides to forward the VMExit to L1 for further handling.
>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>> into L1!
>>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> ---
>>  arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>  1 file changed, 20 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a4b5a013064b..63359ab0e798 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>         int r;
>>
>>         /* try to reinject previous events if any */
>> -       if (vcpu->arch.exception.injected) {
>> -               kvm_x86_ops->queue_exception(vcpu);
>> -               return 0;
>> -       }
>>
>> +       if (vcpu->arch.exception.injected)
>> +               kvm_x86_ops->queue_exception(vcpu);
>>         /*
>>          * NMI/interrupt must not be injected if an exception is
>>          * pending, because the latter may have been queued by
>>          * handling exit due to NMI/interrupt delivery.
>>          */
>> -       if (!vcpu->arch.exception.pending) {
>> -               if (vcpu->arch.nmi_injected) {
>> +       else if (!vcpu->arch.exception.pending) {
>> +               if (vcpu->arch.nmi_injected)
>>                         kvm_x86_ops->set_nmi(vcpu);
>> -                       return 0;
>> -               }
>> -
>> -               if (vcpu->arch.interrupt.injected) {
>> +               else if (vcpu->arch.interrupt.injected)
>>                         kvm_x86_ops->set_irq(vcpu);
>> -                       return 0;
>> -               }
>>         }
>>
>> +       /*
>> +        * Call check_nested_events() even if we reinjected a previous event
>> +        * in order for caller to determine if it should require immediate-exit
>> +        * from L2 to L1 due to pending L1 events which require exit
>> +        * from L2 to L1.
>> +        */
>>         if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>                 r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>                 if (r != 0)
>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                                         vcpu->arch.exception.has_error_code,
>>                                         vcpu->arch.exception.error_code);
>>
>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>                 vcpu->arch.exception.pending = false;
>>                 vcpu->arch.exception.injected = true;
>>
>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                 }
>>
>>                 kvm_x86_ops->queue_exception(vcpu);
>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
>> +       }
>> +
>> +       /* Don't consider new event if we re-injected an event */
>> +       if (kvm_event_needs_reinjection(vcpu))
>> +               return 0;
>> +
>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>                 vcpu->arch.smi_pending = false;
>>                 enter_smm(vcpu);
>>         } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-28  6:39   ` Jim Mattson
  2017-11-28 18:26     ` Jim Mattson
@ 2017-11-28 19:33     ` Liran Alon
  1 sibling, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-11-28 19:33 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk



On 28/11/17 08:39, Jim Mattson wrote:
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>> reinject before next resume of L2.
>>
>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>> to the L1 vCPU which currently runs L2 and exited to L0.
>>
>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>> it will note that a previous event was re-injected to L2 (by
>> IDT-vectoring-info) and therefore won't check if there are pending L1
>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>> immediate VMExit even though there are pending L1 events!
>>
>> This commit fixes the issue by making sure to check for L1 pending
>> events even if a previous event was reinjected to L2 and bailing out
>> from inject_pending_event() before evaluating a new pending event in
>> case an event was already reinjected.
>>
>> The bug was observed by the following setup:
>> * L0 is a 64CPU machine which runs KVM.
>> * L1 is a 16CPU machine which runs KVM.
>> * L0 & L1 runs with APICv disabled.
>> (Also reproduced with APICv enabled but easier to analyze below info
>> with APICv disabled)
>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>> forever (as L1 runs with kernel-preemption disabled).
>>
>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>> series of events:
>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>
> This looks like an EPT violation for writing a stack page that wasn't
> present in the vmcs02 shadow EPT tables. I assume that L0 filled in
> the missing shadow EPT table entry (or entries) and dismissed this EPT
> violation itself rather than forwarding it to L1, because L1's EPT
> tables have a valid entry for this L2 guest-physical address.
Yes. This is how I see it as well. One can see in (4) that L0 resumed L2 
without exiting to L1.

>
>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>> vec 252 (Fixed|edge)
>
> This is the source of the bug, right here. An interrupt can only be
> accepted at an instruction boundary, and the virtual CPU is in the
> midst of IDT-vectoring, which is not at an instruction boundary. The
> EPT violation from L2 to L0 is an L0 artifact that should be
> completely invisible to L1/L2. Interrupt 252 should remain pending
> until we are at the first instruction of the IDT handler for interrupt
> 210.
The trace kvm_apic_accept_irq is printed from __apic_accept_irq().
This function is called by IPI *sender* CPU after it has found the vCPU 
LAPIC to which an IPI should be delivered to.
See how function is being called in the following code path: 
kvm_lapic_reg_write() APIC_ICR handler -> apic_send_ipi() -> 
kvm_irq_delivery_to_apic() -> kvm_apic_set_irq() -> __apic_accept_irq().

In addition, following __apic_accept_irq() APIC_DM_FIXED handler, one 
can see how the IPI is actually delivered to dest CPU. In this case 
APICv is disabled and therefore IPI is basically delivered by setting 
relevant bit in dest LAPIC IRR, setting KVM_REQ_EVENT and kicking the 
dest vCPU in case it is blocked / inside guest.
This will cause dest CPU to eventually reach vcpu_enter_guest() which 
will consume KVM_REQ_EVENT and call inject_pending_event(). Only then it 
should be noticed that kvm_cpu_has_injectable_intr() returns true (as 
bit is set in LAPIC IRR) and then kvm_x86_ops->set_irq() is called to 
actually inject the IRQ to guest by writing it to VMCS.

Therefore, this is not the source of the bug in my opinion. The 
interrupt isn't really accepted at an instruction boundary...
>
> To be more precise, if interrupt 210 was delivered via VM-entry event
> injection on L1 VM-entry to L2, then we need to:
> a) complete the event injection (i.e. vector through the L2 IDT for vector 210),
> b) handle virtual VMX-preemption timer expiration during VM-entry (if any),
> c) handle NMI-window exiting events (if any),
> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
> L2 IDT, depending on the NMI-exiting control),
> e) handle interrupt-window exiting events,
> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
> vectoring through the L2 IDT, depending on the external-iinterrupt
> exiting control).
> ...
>
> Only when we get to step (f) can we accept a new IRQ from L0's local APIC.

This is exactly what inject_pending_event() should be doing after this 
commit.

It starts by checking if there is some event to "re-inject". Which is 
basically step (a). Then it calls check_nested_events() which basically 
handles steps (b)+(c)+(d)+(e). Then it evaluates if there is a pending 
exception to inject (it could have been queued during handling of exit 
due to event-delivery) and if yes it injects it to guest.
Only then we check if we have already injected some event to guest. If 
yes, resume guest with the injected-event written in the VMCS already.
Otherwise, continue evaluating more pending events such as pending L1 
SMIs, NMIs or LAPIC interrupts and inject them if possible.

The problem this patch attempt to fix is that before this patch, after 
performing step (a), if it indeed "re-injected" some event, we would not 
perform any evaluation of if there is some pending event which should 
require us to exit from L2 to L1.
In other words, what happened here is that completing the "re-injection" 
of the event which was delivered and caused EPT_VIOLATION, blocked us 
from evaluating the pending L1 IPI which was sent by another CPU and 
therefore we missed an exit on EXTERNAL_INTERRUPT from L2 to L1.

Regards,
-Liran
>
>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>> rip 0xffffe00069202690 info 83 0
>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>> ext_int: 0x00000000 ext_int_err: 0x00000000
>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>
>> Which can be analyzed as follows:
>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>> a pending-interrupt of 0xd2.
>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>> notes that interrupt 0xd2 was reinjected and therefore calls
>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>> before calling inject_pending_event().
>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>> L1 event (The IPI pending in LAPIC's IRR).
>>
>> We have already reached the buggy scenario but events could be
>> furthered analyzed:
>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>> event-delivery.
>> (7) L0 decides to forward the VMExit to L1 for further handling.
>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>> into L1!
>>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> ---
>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index a4b5a013064b..63359ab0e798 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>          int r;
>>
>>          /* try to reinject previous events if any */
>> -       if (vcpu->arch.exception.injected) {
>> -               kvm_x86_ops->queue_exception(vcpu);
>> -               return 0;
>> -       }
>>
>> +       if (vcpu->arch.exception.injected)
>> +               kvm_x86_ops->queue_exception(vcpu);
>>          /*
>>           * NMI/interrupt must not be injected if an exception is
>>           * pending, because the latter may have been queued by
>>           * handling exit due to NMI/interrupt delivery.
>>           */
>> -       if (!vcpu->arch.exception.pending) {
>> -               if (vcpu->arch.nmi_injected) {
>> +       else if (!vcpu->arch.exception.pending) {
>> +               if (vcpu->arch.nmi_injected)
>>                          kvm_x86_ops->set_nmi(vcpu);
>> -                       return 0;
>> -               }
>> -
>> -               if (vcpu->arch.interrupt.injected) {
>> +               else if (vcpu->arch.interrupt.injected)
>>                          kvm_x86_ops->set_irq(vcpu);
>> -                       return 0;
>> -               }
>>          }
>>
>> +       /*
>> +        * Call check_nested_events() even if we reinjected a previous event
>> +        * in order for caller to determine if it should require immediate-exit
>> +        * from L2 to L1 due to pending L1 events which require exit
>> +        * from L2 to L1.
>> +        */
>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>                  r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>                  if (r != 0)
>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                                          vcpu->arch.exception.has_error_code,
>>                                          vcpu->arch.exception.error_code);
>>
>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>                  vcpu->arch.exception.pending = false;
>>                  vcpu->arch.exception.injected = true;
>>
>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>                  }
>>
>>                  kvm_x86_ops->queue_exception(vcpu);
>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
>> +       }
>> +
>> +       /* Don't consider new event if we re-injected an event */
>> +       if (kvm_event_needs_reinjection(vcpu))
>> +               return 0;
>> +
>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>                  vcpu->arch.smi_pending = false;
>>                  enter_smm(vcpu);
>>          } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-28 18:26     ` Jim Mattson
@ 2017-11-28 19:45       ` Liran Alon
  2017-11-28 21:04         ` Jim Mattson
  0 siblings, 1 reply; 32+ messages in thread
From: Liran Alon @ 2017-11-28 19:45 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk



On 28/11/17 20:26, Jim Mattson wrote:
> One other thing to note is that there is no place to put the
> interrupted IDT vectoring information for L2 virtual interrupt 210
> when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt
> 252. I see that vmcs12_save_pending_event() shoves this information
> into the VMCS12 IDT-vectoring information field, but it should not do
> so when the VMCS12 exit reason is "exception or non-maskable
> interrupt."
>
>  From the SDM, volume 3, section 27.2.3:
>
> Section 24.9.3 defined fields containing information for VM exits that
> occur while delivering an event through the IDT and as a result of any
> of the following cases:
> o  A fault occurs during event delivery and causes a VM exit (because
> the bit associated with the fault is set to 1 in the exception
> bitmap).
> o  A task switch is invoked through a task gate in the IDT. The VM
> exit occurs due to the task switch only after the initial checks of
> the task switch pass (see Section 25.4.2).
> o  Event delivery causes an APIC-access VM exit (see Section 29.4).
> o  An EPT violation, EPT misconfiguration, or page-modification
> log-full event that occurs during event delivery.
>
> For all other exit reasons (including "exception or non-maskable
> interrupt"), the valid bit of the IDT-vectoring info field should be
> clear.

Interesting.

In the case described in the commit-message of this patch this is not 
relevant as because kvm_event_needs_reinjection()==true, 
vmx_check_nested_events() won't really do an exit from L2 to L1 on 
EXTERNAL_INTERRUPT in this case. But rather it will return -EBUSY which 
will cause vcpu_enter_guest() to request an "immediate-exit" (via 
self-IPI. See code in vcpu_enter_guest() regarding req_immediate_exit 
variable for more info). This will make guest complete the 
event-injection of interrupt 210 but immediately afterwards exit to L0 
which will then exit from L2 to L1 on EXTERNAL_INTERRUPT because of L1 
interrupt 252.

However, you made me notice there is another issue here in 
vmx_check_nested_events() handling. If the event-delivery would have 
caused a fault that needs to cause exit from L2 to L1, then currently 
vmx_check_nested_events() will see that 
kvm_event_needs_reinjection()==true and therefore not exit from L2 to L1 
even though it should.
So this is just another other bug to fix I think...

Nice catch.
-Liran

>
> On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson <jmattson@google.com> wrote:
>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>>> reinject before next resume of L2.
>>>
>>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>>> to the L1 vCPU which currently runs L2 and exited to L0.
>>>
>>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>>> it will note that a previous event was re-injected to L2 (by
>>> IDT-vectoring-info) and therefore won't check if there are pending L1
>>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>>> immediate VMExit even though there are pending L1 events!
>>>
>>> This commit fixes the issue by making sure to check for L1 pending
>>> events even if a previous event was reinjected to L2 and bailing out
>>> from inject_pending_event() before evaluating a new pending event in
>>> case an event was already reinjected.
>>>
>>> The bug was observed by the following setup:
>>> * L0 is a 64CPU machine which runs KVM.
>>> * L1 is a 16CPU machine which runs KVM.
>>> * L0 & L1 runs with APICv disabled.
>>> (Also reproduced with APICv enabled but easier to analyze below info
>>> with APICv disabled)
>>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>>> forever (as L1 runs with kernel-preemption disabled).
>>>
>>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>>> series of events:
>>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>>
>> This looks like an EPT violation for writing a stack page that wasn't
>> present in the vmcs02 shadow EPT tables. I assume that L0 filled in
>> the missing shadow EPT table entry (or entries) and dismissed this EPT
>> violation itself rather than forwarding it to L1, because L1's EPT
>> tables have a valid entry for this L2 guest-physical address.
>>
>>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>>> vec 252 (Fixed|edge)
>>
>> This is the source of the bug, right here. An interrupt can only be
>> accepted at an instruction boundary, and the virtual CPU is in the
>> midst of IDT-vectoring, which is not at an instruction boundary. The
>> EPT violation from L2 to L0 is an L0 artifact that should be
>> completely invisible to L1/L2. Interrupt 252 should remain pending
>> until we are at the first instruction of the IDT handler for interrupt
>> 210.
>>
>> To be more precise, if interrupt 210 was delivered via VM-entry event
>> injection on L1 VM-entry to L2, then we need to:
>> a) complete the event injection (i.e. vector through the L2 IDT for vector 210),
>> b) handle virtual VMX-preemption timer expiration during VM-entry (if any),
>> c) handle NMI-window exiting events (if any),
>> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
>> L2 IDT, depending on the NMI-exiting control),
>> e) handle interrupt-window exiting events,
>> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
>> vectoring through the L2 IDT, depending on the external-iinterrupt
>> exiting control).
>> ...
>>
>> Only when we get to step (f) can we accept a new IRQ from L0's local APIC.
>>
>>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>>> rip 0xffffe00069202690 info 83 0
>>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>>> ext_int: 0x00000000 ext_int_err: 0x00000000
>>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>>
>>> Which can be analyzed as follows:
>>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>>> a pending-interrupt of 0xd2.
>>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>>> notes that interrupt 0xd2 was reinjected and therefore calls
>>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>>> before calling inject_pending_event().
>>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>>> L1 event (The IPI pending in LAPIC's IRR).
>>>
>>> We have already reached the buggy scenario but events could be
>>> furthered analyzed:
>>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>>> event-delivery.
>>> (7) L0 decides to forward the VMExit to L1 for further handling.
>>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>>> into L1!
>>>
>>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> ---
>>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index a4b5a013064b..63359ab0e798 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>>          int r;
>>>
>>>          /* try to reinject previous events if any */
>>> -       if (vcpu->arch.exception.injected) {
>>> -               kvm_x86_ops->queue_exception(vcpu);
>>> -               return 0;
>>> -       }
>>>
>>> +       if (vcpu->arch.exception.injected)
>>> +               kvm_x86_ops->queue_exception(vcpu);
>>>          /*
>>>           * NMI/interrupt must not be injected if an exception is
>>>           * pending, because the latter may have been queued by
>>>           * handling exit due to NMI/interrupt delivery.
>>>           */
>>> -       if (!vcpu->arch.exception.pending) {
>>> -               if (vcpu->arch.nmi_injected) {
>>> +       else if (!vcpu->arch.exception.pending) {
>>> +               if (vcpu->arch.nmi_injected)
>>>                          kvm_x86_ops->set_nmi(vcpu);
>>> -                       return 0;
>>> -               }
>>> -
>>> -               if (vcpu->arch.interrupt.injected) {
>>> +               else if (vcpu->arch.interrupt.injected)
>>>                          kvm_x86_ops->set_irq(vcpu);
>>> -                       return 0;
>>> -               }
>>>          }
>>>
>>> +       /*
>>> +        * Call check_nested_events() even if we reinjected a previous event
>>> +        * in order for caller to determine if it should require immediate-exit
>>> +        * from L2 to L1 due to pending L1 events which require exit
>>> +        * from L2 to L1.
>>> +        */
>>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>>                  r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>>                  if (r != 0)
>>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>>                                          vcpu->arch.exception.has_error_code,
>>>                                          vcpu->arch.exception.error_code);
>>>
>>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>>                  vcpu->arch.exception.pending = false;
>>>                  vcpu->arch.exception.injected = true;
>>>
>>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>>                  }
>>>
>>>                  kvm_x86_ops->queue_exception(vcpu);
>>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
>>> +       }
>>> +
>>> +       /* Don't consider new event if we re-injected an event */
>>> +       if (kvm_event_needs_reinjection(vcpu))
>>> +               return 0;
>>> +
>>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>>                  vcpu->arch.smi_pending = false;
>>>                  enter_smm(vcpu);
>>>          } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
>>> --
>>> 1.9.1
>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  2017-11-28 19:45       ` Liran Alon
@ 2017-11-28 21:04         ` Jim Mattson
  0 siblings, 0 replies; 32+ messages in thread
From: Jim Mattson @ 2017-11-28 21:04 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Konrad Rzeszutek Wilk

Thanks for the explanation (and for your patience)!

Reviewed-by: Jim Mattson <jmattson@google.com>

On Tue, Nov 28, 2017 at 11:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote:
>
>
> On 28/11/17 20:26, Jim Mattson wrote:
>>
>> One other thing to note is that there is no place to put the
>> interrupted IDT vectoring information for L2 virtual interrupt 210
>> when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt
>> 252. I see that vmcs12_save_pending_event() shoves this information
>> into the VMCS12 IDT-vectoring information field, but it should not do
>> so when the VMCS12 exit reason is "exception or non-maskable
>> interrupt."
>>
>>  From the SDM, volume 3, section 27.2.3:
>>
>> Section 24.9.3 defined fields containing information for VM exits that
>> occur while delivering an event through the IDT and as a result of any
>> of the following cases:
>> o  A fault occurs during event delivery and causes a VM exit (because
>> the bit associated with the fault is set to 1 in the exception
>> bitmap).
>> o  A task switch is invoked through a task gate in the IDT. The VM
>> exit occurs due to the task switch only after the initial checks of
>> the task switch pass (see Section 25.4.2).
>> o  Event delivery causes an APIC-access VM exit (see Section 29.4).
>> o  An EPT violation, EPT misconfiguration, or page-modification
>> log-full event that occurs during event delivery.
>>
>> For all other exit reasons (including "exception or non-maskable
>> interrupt"), the valid bit of the IDT-vectoring info field should be
>> clear.
>
>
> Interesting.
>
> In the case described in the commit-message of this patch this is not
> relevant as because kvm_event_needs_reinjection()==true,
> vmx_check_nested_events() won't really do an exit from L2 to L1 on
> EXTERNAL_INTERRUPT in this case. But rather it will return -EBUSY which will
> cause vcpu_enter_guest() to request an "immediate-exit" (via self-IPI. See
> code in vcpu_enter_guest() regarding req_immediate_exit variable for more
> info). This will make guest complete the event-injection of interrupt 210
> but immediately afterwards exit to L0 which will then exit from L2 to L1 on
> EXTERNAL_INTERRUPT because of L1 interrupt 252.
>
> However, you made me notice there is another issue here in
> vmx_check_nested_events() handling. If the event-delivery would have caused
> a fault that needs to cause exit from L2 to L1, then currently
> vmx_check_nested_events() will see that kvm_event_needs_reinjection()==true
> and therefore not exit from L2 to L1 even though it should.
> So this is just another other bug to fix I think...
>
> Nice catch.
> -Liran
>
>
>>
>> On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson <jmattson@google.com> wrote:
>>>
>>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com>
>>> wrote:
>>>>
>>>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>>>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>>>> reinject before next resume of L2.
>>>>
>>>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>>>> to the L1 vCPU which currently runs L2 and exited to L0.
>>>>
>>>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>>>> it will note that a previous event was re-injected to L2 (by
>>>> IDT-vectoring-info) and therefore won't check if there are pending L1
>>>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>>>> immediate VMExit even though there are pending L1 events!
>>>>
>>>> This commit fixes the issue by making sure to check for L1 pending
>>>> events even if a previous event was reinjected to L2 and bailing out
>>>> from inject_pending_event() before evaluating a new pending event in
>>>> case an event was already reinjected.
>>>>
>>>> The bug was observed by the following setup:
>>>> * L0 is a 64CPU machine which runs KVM.
>>>> * L1 is a 16CPU machine which runs KVM.
>>>> * L0 & L1 runs with APICv disabled.
>>>> (Also reproduced with APICv enabled but easier to analyze below info
>>>> with APICv disabled)
>>>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>>>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>>>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>>>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>>>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>>>> forever (as L1 runs with kernel-preemption disabled).
>>>>
>>>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>>>> series of events:
>>>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>>>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>>>
>>>
>>> This looks like an EPT violation for writing a stack page that wasn't
>>> present in the vmcs02 shadow EPT tables. I assume that L0 filled in
>>> the missing shadow EPT table entry (or entries) and dismissed this EPT
>>> violation itself rather than forwarding it to L1, because L1's EPT
>>> tables have a valid entry for this L2 guest-physical address.
>>>
>>>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>>>> vec 252 (Fixed|edge)
>>>
>>>
>>> This is the source of the bug, right here. An interrupt can only be
>>> accepted at an instruction boundary, and the virtual CPU is in the
>>> midst of IDT-vectoring, which is not at an instruction boundary. The
>>> EPT violation from L2 to L0 is an L0 artifact that should be
>>> completely invisible to L1/L2. Interrupt 252 should remain pending
>>> until we are at the first instruction of the IDT handler for interrupt
>>> 210.
>>>
>>> To be more precise, if interrupt 210 was delivered via VM-entry event
>>> injection on L1 VM-entry to L2, then we need to:
>>> a) complete the event injection (i.e. vector through the L2 IDT for
>>> vector 210),
>>> b) handle virtual VMX-preemption timer expiration during VM-entry (if
>>> any),
>>> c) handle NMI-window exiting events (if any),
>>> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
>>> L2 IDT, depending on the NMI-exiting control),
>>> e) handle interrupt-window exiting events,
>>> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
>>> vectoring through the L2 IDT, depending on the external-iinterrupt
>>> exiting control).
>>> ...
>>>
>>> Only when we get to step (f) can we accept a new IRQ from L0's local
>>> APIC.
>>>
>>>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>>>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>>>> rip 0xffffe00069202690 info 83 0
>>>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>>>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>>>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>>>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>>>> ext_int: 0x00000000 ext_int_err: 0x00000000
>>>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>>>
>>>> Which can be analyzed as follows:
>>>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>>>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>>>> a pending-interrupt of 0xd2.
>>>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>>>> vCPU 15. This will set relevant bit in LAPIC's IRR and set
>>>> KVM_REQ_EVENT.
>>>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>>>> notes that interrupt 0xd2 was reinjected and therefore calls
>>>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>>>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>>>> before calling inject_pending_event().
>>>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>>>> L1 event (The IPI pending in LAPIC's IRR).
>>>>
>>>> We have already reached the buggy scenario but events could be
>>>> furthered analyzed:
>>>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>>>> event-delivery.
>>>> (7) L0 decides to forward the VMExit to L1 for further handling.
>>>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>>>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>>>> into L1!
>>>>
>>>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>>> ---
>>>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index a4b5a013064b..63359ab0e798 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu
>>>> *vcpu, bool req_int_win)
>>>>          int r;
>>>>
>>>>          /* try to reinject previous events if any */
>>>> -       if (vcpu->arch.exception.injected) {
>>>> -               kvm_x86_ops->queue_exception(vcpu);
>>>> -               return 0;
>>>> -       }
>>>>
>>>> +       if (vcpu->arch.exception.injected)
>>>> +               kvm_x86_ops->queue_exception(vcpu);
>>>>          /*
>>>>           * NMI/interrupt must not be injected if an exception is
>>>>           * pending, because the latter may have been queued by
>>>>           * handling exit due to NMI/interrupt delivery.
>>>>           */
>>>> -       if (!vcpu->arch.exception.pending) {
>>>> -               if (vcpu->arch.nmi_injected) {
>>>> +       else if (!vcpu->arch.exception.pending) {
>>>> +               if (vcpu->arch.nmi_injected)
>>>>                          kvm_x86_ops->set_nmi(vcpu);
>>>> -                       return 0;
>>>> -               }
>>>> -
>>>> -               if (vcpu->arch.interrupt.injected) {
>>>> +               else if (vcpu->arch.interrupt.injected)
>>>>                          kvm_x86_ops->set_irq(vcpu);
>>>> -                       return 0;
>>>> -               }
>>>>          }
>>>>
>>>> +       /*
>>>> +        * Call check_nested_events() even if we reinjected a previous
>>>> event
>>>> +        * in order for caller to determine if it should require
>>>> immediate-exit
>>>> +        * from L2 to L1 due to pending L1 events which require exit
>>>> +        * from L2 to L1.
>>>> +        */
>>>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>>>                  r = kvm_x86_ops->check_nested_events(vcpu,
>>>> req_int_win);
>>>>                  if (r != 0)
>>>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu
>>>> *vcpu, bool req_int_win)
>>>>
>>>> vcpu->arch.exception.has_error_code,
>>>>
>>>> vcpu->arch.exception.error_code);
>>>>
>>>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>>>                  vcpu->arch.exception.pending = false;
>>>>                  vcpu->arch.exception.injected = true;
>>>>
>>>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu
>>>> *vcpu, bool req_int_win)
>>>>                  }
>>>>
>>>>                  kvm_x86_ops->queue_exception(vcpu);
>>>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>>> kvm_x86_ops->smi_allowed(vcpu)) {
>>>> +       }
>>>> +
>>>> +       /* Don't consider new event if we re-injected an event */
>>>> +       if (kvm_event_needs_reinjection(vcpu))
>>>> +               return 0;
>>>> +
>>>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>>>                  vcpu->arch.smi_pending = false;
>>>>                  enter_smm(vcpu);
>>>>          } else if (vcpu->arch.nmi_pending &&
>>>> kvm_x86_ops->nmi_allowed(vcpu)) {
>>>> --
>>>> 1.9.1
>>>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
  2017-11-21 15:30 ` [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt Liran Alon
@ 2017-12-01 23:45   ` Jim Mattson
  2017-12-02  0:19     ` Liran Alon
  0 siblings, 1 reply; 32+ messages in thread
From: Jim Mattson @ 2017-12-01 23:45 UTC (permalink / raw)
  To: Liran Alon
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan

Reviewed-by: Jim Mattson <jmattson@google.com>

On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
> kvm_inject_realmode_interrupt() is called from one of the injection
> functions which writes event-injection to VMCS: vmx_queue_exception(),
> vmx_inject_irq() and vmx_inject_nmi().
>
> All these functions are called just to cause an event-injection to
> guest. They are not responsible of manipulating the event-pending
> flag. The only purpose of kvm_inject_realmode_interrupt() should be
> to emulate real-mode interrupt-injection.
>
> This was also incorrect when called from vmx_queue_exception().
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> ---
>  arch/x86/kvm/x86.c | 5 -----
>  1 file changed, 5 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 34c85aa2e2d1..c85fc4406a7d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5411,11 +5411,6 @@ int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip)
>         kvm_rip_write(vcpu, ctxt->eip);
>         kvm_set_rflags(vcpu, ctxt->eflags);
>
> -       if (irq == NMI_VECTOR)
> -               vcpu->arch.nmi_pending = 0;
> -       else
> -               vcpu->arch.interrupt.pending = false;
> -
>         return EMULATE_DONE;
>  }
>  EXPORT_SYMBOL_GPL(kvm_inject_realmode_interrupt);
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
  2017-12-01 23:45   ` Jim Mattson
@ 2017-12-02  0:19     ` Liran Alon
  0 siblings, 0 replies; 32+ messages in thread
From: Liran Alon @ 2017-12-02  0:19 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm list, Wanpeng Li, Idan Brown, Krish Sadhukhan



On 02/12/17 01:45, Jim Mattson wrote:
> Reviewed-by: Jim Mattson <jmattson@google.com>
Thanks for the review.

>
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>> kvm_inject_realmode_interrupt() is called from one of the injection
>> functions which writes event-injection to VMCS: vmx_queue_exception(),
>> vmx_inject_irq() and vmx_inject_nmi().
>>
>> All these functions are called just to cause an event-injection to
>> guest. They are not responsible of manipulating the event-pending
>> flag. The only purpose of kvm_inject_realmode_interrupt() should be
>> to emulate real-mode interrupt-injection.
>>
>> This was also incorrect when called from vmx_queue_exception().
>>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>> ---
>>   arch/x86/kvm/x86.c | 5 -----
>>   1 file changed, 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 34c85aa2e2d1..c85fc4406a7d 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5411,11 +5411,6 @@ int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip)
>>          kvm_rip_write(vcpu, ctxt->eip);
>>          kvm_set_rflags(vcpu, ctxt->eflags);
>>
>> -       if (irq == NMI_VECTOR)
>> -               vcpu->arch.nmi_pending = 0;
>> -       else
>> -               vcpu->arch.interrupt.pending = false;
>> -
>>          return EMULATE_DONE;
>>   }
>>   EXPORT_SYMBOL_GPL(kvm_inject_realmode_interrupt);
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-12-02  0:20 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 15:30 [PATCH v2 0/8] KVM: Fix multiple issues in handling pending/injected events Liran Alon
2017-11-21 15:30 ` [PATCH v2 1/8] KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt Liran Alon
2017-12-01 23:45   ` Jim Mattson
2017-12-02  0:19     ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 2/8] KVM: x86: Rename interrupt.pending to interrupt.injected Liran Alon
2017-11-28 17:02   ` Jim Mattson
2017-11-21 15:30 ` [PATCH v2 3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions Liran Alon
2017-11-22 20:25   ` Radim Krčmář
2017-11-23 18:20     ` Liran Alon
2017-11-22 21:00   ` Jim Mattson
2017-11-23 18:45     ` Liran Alon
2017-11-23 23:05       ` Paolo Bonzini
2017-11-27 17:26       ` Jim Mattson
2017-11-27 18:30         ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 4/8] KVM: x86: Warn if userspace overrides existing injected exception/interrupt Liran Alon
2017-11-22 20:34   ` Radim Krčmář
2017-11-22 22:27     ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 5/8] Revert "kvm: nVMX: Disallow userspace-injected exceptions in guest mode" Liran Alon
2017-11-21 15:30 ` [PATCH v2 6/8] KVM: x86: Fix misleading comments on handling pending exceptions Liran Alon
2017-11-21 15:30 ` [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Liran Alon
2017-11-27 20:48   ` Jim Mattson
2017-11-27 22:42     ` Liran Alon
2017-11-28  4:55       ` Jim Mattson
2017-11-28 11:14         ` Paolo Bonzini
2017-11-28 13:59           ` Liran Alon
2017-11-28 11:36         ` Liran Alon
2017-11-28  6:39   ` Jim Mattson
2017-11-28 18:26     ` Jim Mattson
2017-11-28 19:45       ` Liran Alon
2017-11-28 21:04         ` Jim Mattson
2017-11-28 19:33     ` Liran Alon
2017-11-21 15:30 ` [PATCH v2 8/8] KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending Liran Alon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.