[PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts
@ 2017-12-24 16:12 Liran Alon
  2017-12-24 16:12 ` [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr() Liran Alon
                   ` (10 more replies)
  0 siblings, 11 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm; +Cc: jmattson, wanpeng.li, idan.brown

Hi,

This series aims to fix multiple issues with nested-posted-interrupts.

1st patch is a simple optimization to vcpu_enter_guest() hot-path to
not need to check if sync_pir_to_irr != NULL.

2nd & 3rd patches fixes an issue of not re-evaluating what should be done
with a new L1 pending interrupt that was discovered by syncing PIR to
IRR just before resuming L2 guest. For example, this pending L1 event
should in most cases result in exiting from L2 to L1 on
external-interrupt. But currently, we will just continue resuming L2
which is wrong.

4rd patch clean-up & fix handling of directly injecting a L1
interrupt to L2 when L1 don't intercept external-interrupts. The
current handling of this case doesn't correctly consider the LAPIC TPR
and doesn't update it's IRR & ISR after injecting the interrupt to L2.
Fix this by using standard interrupt injection code-path in this
scenario as-well.

5th-09th patches fix multiple issues in sending &
dispatching nested-posted-interrupts. The patch fixes these issues by
checking if there is pending nested-posted-interrupts before each
vmentry and if yes, use self-IPI to make hardware dispatch them
instead of emulating behavior in software.

10th-11th patches fixes bugs of not waking up a halted L2 when L1
sends it a nested posted-interrupt and L1 doesn't intercept HLT.

Regards,
-Liran Alon

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr()
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
@ 2017-12-24 16:12 ` Liran Alon
  2017-12-27  9:56   ` Paolo Bonzini
  2017-12-24 16:12 ` [PATCH v3 02/11] KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated Liran Alon
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick

sync_pir_to_irr() is only called if vcpu->arch.apicv_active()==true.
In case it is false, VMX code make sure to set sync_pir_to_irr
to NULL.

Therefore, having SVM stubs allows to remove check for if
sync_pir_to_irr != NULL from all calling sites.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
---
 arch/x86/kvm/lapic.c |  2 +-
 arch/x86/kvm/svm.c   |  6 ++++++
 arch/x86/kvm/x86.c   | 10 ++++------
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e2c1fb8d35ce..0928608750e3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -581,7 +581,7 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
 static int apic_has_interrupt_for_ppr(struct kvm_lapic *apic, u32 ppr)
 {
 	int highest_irr;
-	if (kvm_x86_ops->sync_pir_to_irr && apic->vcpu->arch.apicv_active)
+	if (apic->vcpu->arch.apicv_active)
 		highest_irr = kvm_x86_ops->sync_pir_to_irr(apic->vcpu);
 	else
 		highest_irr = apic_find_highest_irr(apic);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index eb714f1cdf7e..99c42deb742b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4449,6 +4449,11 @@ static void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
 {
 }
 
+static int svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+	return -1;
+}
+
 /* Note: Currently only used by Hyper-V. */
 static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
@@ -5581,6 +5586,7 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	.load_eoi_exitmap = svm_load_eoi_exitmap,
 	.hwapic_irr_update = svm_hwapic_irr_update,
 	.hwapic_isr_update = svm_hwapic_isr_update,
+	.sync_pir_to_irr = svm_sync_pir_to_irr,
 	.apicv_post_state_restore = avic_post_state_restore,
 
 	.set_tss_addr = svm_set_tss_addr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index faf843c9b916..82750791153e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2943,7 +2943,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
 				    struct kvm_lapic_state *s)
 {
-	if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active)
+	if (vcpu->arch.apicv_active)
 		kvm_x86_ops->sync_pir_to_irr(vcpu);
 
 	return kvm_apic_get_state(vcpu, s);
@@ -6749,7 +6749,7 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
 	if (irqchip_split(vcpu->kvm))
 		kvm_scan_ioapic_routes(vcpu, vcpu->arch.ioapic_handled_vectors);
 	else {
-		if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active)
+		if (vcpu->arch.apicv_active)
 			kvm_x86_ops->sync_pir_to_irr(vcpu);
 		kvm_ioapic_scan_entry(vcpu, vcpu->arch.ioapic_handled_vectors);
 	}
@@ -6981,10 +6981,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * This handles the case where a posted interrupt was
 	 * notified with kvm_vcpu_kick.
 	 */
-	if (kvm_lapic_enabled(vcpu)) {
-		if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active)
-			kvm_x86_ops->sync_pir_to_irr(vcpu);
-	}
+	if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active)
+		kvm_x86_ops->sync_pir_to_irr(vcpu);
 
 	if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu)
 	    || need_resched() || signal_pending(current)) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 02/11] KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
  2017-12-24 16:12 ` [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr() Liran Alon
@ 2017-12-24 16:12 ` Liran Alon
  2018-01-02  1:51   ` Quan Xu
  2017-12-24 16:12 ` [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt Liran Alon
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick

This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
---
 arch/x86/kvm/lapic.c | 23 ++++++++++++++++-------
 arch/x86/kvm/lapic.h |  4 ++--
 arch/x86/kvm/vmx.c   |  5 +++--
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0928608750e3..924ac8ce9d50 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -364,32 +364,41 @@ static u8 count_vectors(void *bitmap)
 	return count;
 }
 
-int __kvm_apic_update_irr(u32 *pir, void *regs)
+bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr)
 {
 	u32 i, vec;
-	u32 pir_val, irr_val;
-	int max_irr = -1;
+	u32 pir_val, irr_val, prev_irr_val;
+	int max_updated_irr;
+
+	max_updated_irr = -1;
+	*max_irr = -1;
 
 	for (i = vec = 0; i <= 7; i++, vec += 32) {
 		pir_val = READ_ONCE(pir[i]);
 		irr_val = *((u32 *)(regs + APIC_IRR + i * 0x10));
 		if (pir_val) {
+			prev_irr_val = irr_val;
 			irr_val |= xchg(&pir[i], 0);
 			*((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
+			if (prev_irr_val != irr_val) {
+				max_updated_irr =
+					__fls(irr_val ^ prev_irr_val) + vec;
+			}
 		}
 		if (irr_val)
-			max_irr = __fls(irr_val) + vec;
+			*max_irr = __fls(irr_val) + vec;
 	}
 
-	return max_irr;
+	return ((max_updated_irr != -1) &&
+		(max_updated_irr == *max_irr));
 }
 EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
 
-int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
+bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
 
-	return __kvm_apic_update_irr(pir, apic->regs);
+	return __kvm_apic_update_irr(pir, apic->regs, max_irr);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4b9935a38347..56c36014f7b7 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -75,8 +75,8 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len,
 bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
 			   int short_hand, unsigned int dest, int dest_mode);
 
-int __kvm_apic_update_irr(u32 *pir, void *regs);
-int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
+bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr);
+bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr);
 void kvm_apic_update_ppr(struct kvm_vcpu *vcpu);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
 		     struct dest_map *dest_map);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8eba631c4dbd..325608a1ed65 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5062,7 +5062,8 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
 	max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
 	if (max_irr != 256) {
 		vapic_page = kmap(vmx->nested.virtual_apic_page);
-		__kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page);
+		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
+			vapic_page, &max_irr);
 		kunmap(vmx->nested.virtual_apic_page);
 
 		status = vmcs_read16(GUEST_INTR_STATUS);
@@ -9040,7 +9041,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 		 * But on x86 this is just a compiler barrier anyway.
 		 */
 		smp_mb__after_atomic();
-		max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
+		kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
 	}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
  2017-12-24 16:12 ` [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr() Liran Alon
  2017-12-24 16:12 ` [PATCH v3 02/11] KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated Liran Alon
@ 2017-12-24 16:12 ` Liran Alon
  2018-01-02  2:45   ` Quan Xu
  2018-01-03  5:35   ` Quan Xu
  2017-12-24 16:12 ` [PATCH v3 04/11] KVM: nVMX: Fix injection to L2 when L1 don't intercept external-interrupts Liran Alon
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick,
	Konrad Rzeszutek Wilk

In case posted-interrupt was delivered to CPU while it is in host
(outside guest), then posted-interrupt delivery will be done by
calling sync_pir_to_irr() at vmentry after interrupts are disabled.

sync_pir_to_irr() will check vmx->pi_desc.control ON bit and if
set, it will sync vmx->pi_desc.pir to IRR and afterwards update RVI to
ensure virtual-interrupt-delivery will dispatch interrupt to guest.

However, it is possible that L1 will receive a posted-interrupt while
CPU runs at host and is about to enter L2. In this case, the call to
sync_pir_to_irr() will indeed update the L1's APIC IRR but
vcpu_enter_guest() will then just resume into L2 guest without
re-evaluating if it should exit from L2 to L1 as a result of this
new pending L1 event.

To address this case, if sync_pir_to_irr() has a new L1 injectable
interrupt and CPU is running L2, we force exit GUEST_MODE which will
result in another iteration of vcpu_run() run loop which will call
kvm_vcpu_running() which will call check_nested_events() which will
handle the pending L1 event properly.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/kvm/vmx.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 325608a1ed65..d307bf26462a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9032,6 +9032,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int max_irr;
+	bool max_irr_updated;
 
 	WARN_ON(!vcpu->arch.apicv_active);
 	if (pi_test_on(&vmx->pi_desc)) {
@@ -9041,7 +9042,16 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 		 * But on x86 this is just a compiler barrier anyway.
 		 */
 		smp_mb__after_atomic();
-		kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
+		max_irr_updated =
+			kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
+
+		/*
+		 * If we are running L2 and L1 has a new pending interrupt
+		 * which can be injected, we should re-evaluate
+		 * what should be done with this new L1 interrupt.
+		 */
+		if (is_guest_mode(vcpu) && max_irr_updated)
+			kvm_vcpu_exiting_guest_mode(vcpu);
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
 	}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 04/11] KVM: nVMX: Fix injection to L2 when L1 don't intercept external-interrupts
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (2 preceding siblings ...)
  2017-12-24 16:12 ` [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt Liran Alon
@ 2017-12-24 16:12 ` Liran Alon
  2017-12-24 16:12 ` [PATCH v3 05/11] KVM: x86: Rename functions which saves vCPU in per-cpu var Liran Alon
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick,
	Konrad Rzeszutek Wilk

Before each vmentry to guest, vcpu_enter_guest() calls sync_pir_to_irr()
which calls vmx_hwapic_irr_update() to update RVI.
Currently, vmx_hwapic_irr_update() contains a tweak in case it is called
when CPU is running L2 and L1 don't intercept external-interrupts.
In that case, code injects interrupt directly into L2 instead of
updating RVI.

Besides being hacky (wouldn't expect function updating RVI to also
inject interrupt), it also doesn't handle this case correctly.
The code contains several issues:
1. When code calls kvm_queue_interrupt() it just passes it max_irr which
represents the highest IRR currently pending in L1 LAPIC.
This is problematic as interrupt was injected to guest but it's bit is
still set in LAPIC IRR instead of being cleared from IRR and set in ISR.
2. Code doesn't check if LAPIC PPR is set to accept an interrupt of
max_irr priority. It just checks if interrupts are enabled in guest with
vmx_interrupt_allowed().

To fix the above issues:
1. Simplify vmx_hwapic_irr_update() to just update RVI.
Note that this shouldn't happen when CPU is running L2
(See comment in code).
2. Since now vmx_hwapic_irr_update() only does logic for L1
virtual-interrupt-delivery, inject_pending_event() should be the
one responsible for injecting the interrupt directly into L2.
Therefore, change kvm_cpu_has_injectable_intr() to check L1
LAPIC when CPU is running L2.
3. Change vmx_sync_pir_to_irr() to set KVM_REQ_EVENT when L1
has a pending injectable interrupt.

Fixes: 963fee165660 ("KVM: nVMX: Fix virtual interrupt delivery
injection")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/kvm/irq.c |  2 +-
 arch/x86/kvm/vmx.c | 41 +++++++++++++++++------------------------
 2 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 5c24811e8b0b..f171051eecf3 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -79,7 +79,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
 	if (kvm_cpu_has_extint(v))
 		return 1;
 
-	if (kvm_vcpu_apicv_active(v))
+	if (!is_guest_mode(v) && kvm_vcpu_apicv_active(v))
 		return 0;
 
 	return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d307bf26462a..41daf73cc528 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9002,30 +9002,16 @@ static void vmx_set_rvi(int vector)
 
 static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 {
-	if (!is_guest_mode(vcpu)) {
-		vmx_set_rvi(max_irr);
-		return;
-	}
-
-	if (max_irr == -1)
-		return;
-
 	/*
-	 * In guest mode.  If a vmexit is needed, vmx_check_nested_events
-	 * handles it.
+	 * When running L2, updating RVI is only relevant when
+	 * vmcs12 virtual-interrupt-delivery enabled.
+	 * However, it can be enabled only when L1 also
+	 * intercepts external-interrupts and in that case
+	 * we should not update vmcs02 RVI but instead intercept
+	 * interrupt. Therefore, do nothing when running L2.
 	 */
-	if (nested_exit_on_intr(vcpu))
-		return;
-
-	/*
-	 * Else, fall back to pre-APICv interrupt injection since L2
-	 * is run without virtual interrupt delivery.
-	 */
-	if (!kvm_event_needs_reinjection(vcpu) &&
-	    vmx_interrupt_allowed(vcpu)) {
-		kvm_queue_interrupt(vcpu, max_irr, false);
-		vmx_inject_irq(vcpu);
-	}
+	if (!is_guest_mode(vcpu))
+		vmx_set_rvi(max_irr);
 }
 
 static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
@@ -9049,9 +9035,16 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 		 * If we are running L2 and L1 has a new pending interrupt
 		 * which can be injected, we should re-evaluate
 		 * what should be done with this new L1 interrupt.
+		 * If L1 intercepts external-interrupts, we should
+		 * exit from L2 to L1. Otherwise, interrupt should be
+		 * delivered directly to L2.
 		 */
-		if (is_guest_mode(vcpu) && max_irr_updated)
-			kvm_vcpu_exiting_guest_mode(vcpu);
+		if (is_guest_mode(vcpu) && max_irr_updated) {
+			if (nested_exit_on_intr(vcpu))
+				kvm_vcpu_exiting_guest_mode(vcpu);
+			else
+				kvm_make_request(KVM_REQ_EVENT, vcpu);
+		}
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
 	}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 05/11] KVM: x86: Rename functions which saves vCPU in per-cpu var
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (3 preceding siblings ...)
  2017-12-24 16:12 ` [PATCH v3 04/11] KVM: nVMX: Fix injection to L2 when L1 don't intercept external-interrupts Liran Alon
@ 2017-12-24 16:12 ` Liran Alon
  2017-12-24 16:12 ` [PATCH v3 06/11] KVM: x86: Set current_vcpu per-cpu var before enabling interrupts at host Liran Alon
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick

This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
---
 arch/x86/kvm/svm.c | 4 ++--
 arch/x86/kvm/vmx.c | 4 ++--
 arch/x86/kvm/x86.c | 8 ++++----
 arch/x86/kvm/x86.h | 4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 99c42deb742b..cf51471d993b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5039,14 +5039,14 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.regs[VCPU_REGS_RIP] = svm->vmcb->save.rip;
 
 	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
-		kvm_before_handle_nmi(&svm->vcpu);
+		kvm_before_handle_host_interrupts(&svm->vcpu);
 
 	stgi();
 
 	/* Any pending NMI will happen here */
 
 	if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
-		kvm_after_handle_nmi(&svm->vcpu);
+		kvm_after_handle_host_interrupts(&svm->vcpu);
 
 	sync_cr8_to_lapic(vcpu);
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 41daf73cc528..34ffe929fb41 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9095,9 +9095,9 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
 
 	/* We need to handle NMIs before interrupts are enabled */
 	if (is_nmi(exit_intr_info)) {
-		kvm_before_handle_nmi(&vmx->vcpu);
+		kvm_before_handle_host_interrupts(&vmx->vcpu);
 		asm("int $2");
-		kvm_after_handle_nmi(&vmx->vcpu);
+		kvm_after_handle_host_interrupts(&vmx->vcpu);
 	}
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 82750791153e..8dfcdc8a0878 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6041,17 +6041,17 @@ static unsigned long kvm_get_guest_ip(void)
 	.get_guest_ip		= kvm_get_guest_ip,
 };
 
-void kvm_before_handle_nmi(struct kvm_vcpu *vcpu)
+void kvm_before_handle_host_interrupts(struct kvm_vcpu *vcpu)
 {
 	__this_cpu_write(current_vcpu, vcpu);
 }
-EXPORT_SYMBOL_GPL(kvm_before_handle_nmi);
+EXPORT_SYMBOL_GPL(kvm_before_handle_host_interrupts);
 
-void kvm_after_handle_nmi(struct kvm_vcpu *vcpu)
+void kvm_after_handle_host_interrupts(struct kvm_vcpu *vcpu)
 {
 	__this_cpu_write(current_vcpu, NULL);
 }
-EXPORT_SYMBOL_GPL(kvm_after_handle_nmi);
+EXPORT_SYMBOL_GPL(kvm_after_handle_host_interrupts);
 
 static void kvm_set_mmio_spte_mask(void)
 {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d0b95b7a90b4..3737884cc64d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -204,8 +204,8 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
 	return !(kvm->arch.disabled_quirks & quirk);
 }
 
-void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
-void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
+void kvm_before_handle_host_interrupts(struct kvm_vcpu *vcpu);
+void kvm_after_handle_host_interrupts(struct kvm_vcpu *vcpu);
 void kvm_set_pending_timer(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 06/11] KVM: x86: Set current_vcpu per-cpu var before enabling interrupts at host
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (4 preceding siblings ...)
  2017-12-24 16:12 ` [PATCH v3 05/11] KVM: x86: Rename functions which saves vCPU in per-cpu var Liran Alon
@ 2017-12-24 16:12 ` Liran Alon
  2017-12-27 10:06   ` Paolo Bonzini
  2017-12-24 16:12 ` [PATCH v3 07/11] KVM: x86: Add util for getting current vCPU running on CPU Liran Alon
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick

This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
---
 arch/x86/kvm/x86.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8dfcdc8a0878..ccc5a10eff3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6988,7 +6988,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	    || need_resched() || signal_pending(current)) {
 		vcpu->mode = OUTSIDE_GUEST_MODE;
 		smp_wmb();
+		kvm_before_handle_host_interrupts(vcpu);
 		local_irq_enable();
+		kvm_after_handle_host_interrupts(vcpu);
 		preempt_enable();
 		vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
 		r = 1;
@@ -7056,7 +7058,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	guest_exit_irqoff();
 
+	kvm_before_handle_host_interrupts(vcpu);
 	local_irq_enable();
+	kvm_after_handle_host_interrupts(vcpu);
 	preempt_enable();
 
 	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 07/11] KVM: x86: Add util for getting current vCPU running on CPU
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (5 preceding siblings ...)
  2017-12-24 16:12 ` [PATCH v3 06/11] KVM: x86: Set current_vcpu per-cpu var before enabling interrupts at host Liran Alon
@ 2017-12-24 16:12 ` Liran Alon
  2017-12-24 16:13 ` [PATCH v3 08/11] KVM: x86: Register empty handler for POSTED_INTR_NESTED_VECTOR IPI Liran Alon
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:12 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick

This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/x86.c              | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 516798431328..90c54d079bc1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1408,6 +1408,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
 int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 
+struct kvm_vcpu *kvm_get_current_vcpu(void);
 int kvm_is_in_guest(void);
 
 int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ccc5a10eff3d..fc08f2cb7aa2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6010,6 +6010,12 @@ static void kvm_timer_init(void)
 
 static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 
+struct kvm_vcpu *kvm_get_current_vcpu(void)
+{
+	return __this_cpu_read(current_vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_get_current_vcpu);
+
 int kvm_is_in_guest(void)
 {
 	return __this_cpu_read(current_vcpu) != NULL;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 08/11] KVM: x86: Register empty handler for POSTED_INTR_NESTED_VECTOR IPI
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (6 preceding siblings ...)
  2017-12-24 16:12 ` [PATCH v3 07/11] KVM: x86: Add util for getting current vCPU running on CPU Liran Alon
@ 2017-12-24 16:13 ` Liran Alon
  2017-12-24 16:13 ` [PATCH v3 09/11] KVM: nVMX: Deliver missed nested-PI notification-vector via self-IPI while interrupts disabled Liran Alon
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:13 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick

This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
---
 arch/x86/include/asm/irq.h |  4 +++-
 arch/x86/kernel/irq.c      | 26 ++++++++++++++++++--------
 arch/x86/kvm/vmx.c         | 13 +++++++++++--
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 2395bb794c7b..1a3f5cb27453 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -29,7 +29,9 @@ static inline int irq_canonicalize(int irq)
 extern void fixup_irqs(void);
 
 #ifdef CONFIG_HAVE_KVM
-extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
+extern void kvm_set_posted_intr_handlers(
+	void (*handler)(void),
+	void (*nested_handler)(void));
 #endif
 
 extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 49cfd9fe7589..1cfe7856eea3 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -279,16 +279,25 @@ __visible void __irq_entry smp_x86_platform_ipi(struct pt_regs *regs)
 
 #ifdef CONFIG_HAVE_KVM
 static void dummy_handler(void) {}
-static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
+static void (*kvm_posted_intr_handler)(void) = dummy_handler;
+static void (*kvm_nested_posted_intr_handler)(void) = dummy_handler;
 
-void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
+void kvm_set_posted_intr_handlers(
+	void (*handler)(void),
+	void (*nested_handler)(void))
 {
 	if (handler)
-		kvm_posted_intr_wakeup_handler = handler;
+		kvm_posted_intr_handler = handler;
 	else
-		kvm_posted_intr_wakeup_handler = dummy_handler;
+		kvm_posted_intr_handler = dummy_handler;
+
+	if (nested_handler)
+		kvm_nested_posted_intr_handler = nested_handler;
+	else
+		kvm_nested_posted_intr_handler = dummy_handler;
+
 }
-EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
+EXPORT_SYMBOL_GPL(kvm_set_posted_intr_handlers);
 
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
@@ -304,7 +313,7 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
 }
 
 /*
- * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ * Handler for POSTED_INTR_VECTOR.
  */
 __visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs)
 {
@@ -312,13 +321,13 @@ __visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs)
 
 	entering_ack_irq();
 	inc_irq_stat(kvm_posted_intr_wakeup_ipis);
-	kvm_posted_intr_wakeup_handler();
+	kvm_posted_intr_handler();
 	exiting_irq();
 	set_irq_regs(old_regs);
 }
 
 /*
- * Handler for POSTED_INTERRUPT_NESTED_VECTOR.
+ * Handler for POSTED_INTR_NESTED_VECTOR.
  */
 __visible void smp_kvm_posted_intr_nested_ipi(struct pt_regs *regs)
 {
@@ -326,6 +335,7 @@ __visible void smp_kvm_posted_intr_nested_ipi(struct pt_regs *regs)
 
 	entering_ack_irq();
 	inc_irq_stat(kvm_posted_intr_nested_ipis);
+	kvm_nested_posted_intr_handler();
 	exiting_irq();
 	set_irq_regs(old_regs);
 }
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 34ffe929fb41..cb8c52fb0591 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6706,7 +6706,8 @@ static void update_ple_window_actual_max(void)
 }
 
 /*
- * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ * Wake-up sleeping vCPUs on current physical CPU as a result
+ * of handling POSTED_INTR_VECTOR at host-side.
  */
 static void wakeup_handler(void)
 {
@@ -6724,6 +6725,13 @@ static void wakeup_handler(void)
 	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
 }
 
+/*
+ * Handler for POSTED_INTR_NESTED_VECTOR.
+ */
+static void nested_posted_intr_handler(void)
+{
+}
+
 void vmx_enable_tdp(void)
 {
 	kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
@@ -6898,7 +6906,8 @@ static __init int hardware_setup(void)
 		kvm_x86_ops->cancel_hv_timer = NULL;
 	}
 
-	kvm_set_posted_intr_wakeup_handler(wakeup_handler);
+	kvm_set_posted_intr_handlers(
+			wakeup_handler, nested_posted_intr_handler);
 
 	kvm_mce_cap_supported |= MCG_LMCE_P;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 09/11] KVM: nVMX: Deliver missed nested-PI notification-vector via self-IPI while interrupts disabled
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (7 preceding siblings ...)
  2017-12-24 16:13 ` [PATCH v3 08/11] KVM: x86: Register empty handler for POSTED_INTR_NESTED_VECTOR IPI Liran Alon
@ 2017-12-24 16:13 ` Liran Alon
  2017-12-24 16:13 ` [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt Liran Alon
  2017-12-24 16:13 ` [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending Liran Alon
  10 siblings, 0 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:13 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick,
	Konrad Rzeszutek Wilk

When L1 wants to send a posted-interrupt to another L1 CPU running L2,
it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in
vmx->nested.pi_desc->control. Then it attempts to send a
notification-vector IPI to dest L1 CPU.
This attempt to send IPI will exit to L0 which will reach
vmx_deliver_nested_posted_interrupt() which does the following:
1. If dest L0 CPU is currently running guest (vcpu->mode ==
IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector.
2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest
L0 CPU exits from guest to host and therefore doesn't recognize physical
IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next
vmentry which will call vmx_check_nested_events() which should call
(in theory) vmx_complete_nested_posted_interrupt(). That function
should see vmx->nested.pi_desc->control ON bit is set and therefore
"emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1
virtual-apic-page & update vmcs02 RVI).

The above logic regarding nested-posted-interrupts contains multiple
issues:

A) Race-condition:
On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE
but before sending physical IPI, the dest CPU will exit to host.
Therefore, physical IPI could be received at host which it's handler
does nothing. In addition, assume that dest CPU passes the checks
for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore,
dest CPU will resume L2 without evaluating nested-posted-interrupts.

B) vmx_complete_nested_posted_interrupt() is not always called
when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug
that vmx_check_nested_events() could exit from L2 to L1 before calling
vmx_complete_nested_posted_interrupt(). Therefore, on next resume of
L1 into L2, nested-posted-interrupts won't be evaluated even though L0
resume L2 (We may resume L2 without having KVM_REQ_EVENT set).

This commit removes nested-posted-interrupts processing from
check_nested_events() and instead makes sure to process
nested-posted-interrupts on vmentry after interrupts
disabled. Processing of nested-posted-interrupts is delegated
to hardware by issueing a self-IPI of relevant notification-vector
which will be delivered to CPU when CPU is in guest.

* Bug (A) is solved by the fact that processing of
nested-posted-interrupts is not dependent on KVM_REQ_EVENT and
happens before every vmentry to L2.
* Bug (B) is now trivially solved by processing
nested-posted-interrupts before each vmentry to L2 guest.

An alternative could have been to just call
vmx_complete_nested_posted_interrupt() at this call-site. However, we
have decided to go with this approach because:
1. It would require modifying vmx_complete_nested_posted_interrupt()
to be able to work with interrupts disabled (not-trivial).
2. We preffer to avoid software-emulation of hardware behavior if it
is possible.

Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Co-authored-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/lapic.c            | 12 ++-----
 arch/x86/kvm/lapic.h            |  1 -
 arch/x86/kvm/svm.c              |  6 ++++
 arch/x86/kvm/vmx.c              | 70 ++++++++++++++++-------------------------
 arch/x86/kvm/x86.c              | 13 ++++++--
 6 files changed, 47 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 90c54d079bc1..4b8caff83dc7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -993,6 +993,7 @@ struct kvm_x86_ops {
 	void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
 	void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
 	void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
+	void (*complete_nested_posted_interrupt)(struct kvm_vcpu *vcpu);
 	int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*get_tdp_level)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 924ac8ce9d50..faa8f296d7ad 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -364,8 +364,10 @@ static u8 count_vectors(void *bitmap)
 	return count;
 }
 
-bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr)
+bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr)
 {
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	void *regs = apic->regs;
 	u32 i, vec;
 	u32 pir_val, irr_val, prev_irr_val;
 	int max_updated_irr;
@@ -392,14 +394,6 @@ bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr)
 	return ((max_updated_irr != -1) &&
 		(max_updated_irr == *max_irr));
 }
-EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
-
-bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr)
-{
-	struct kvm_lapic *apic = vcpu->arch.apic;
-
-	return __kvm_apic_update_irr(pir, apic->regs, max_irr);
-}
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
 
 static inline int apic_search_irr(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 56c36014f7b7..0ace7cc57057 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -75,7 +75,6 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len,
 bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
 			   int short_hand, unsigned int dest, int dest_mode);
 
-bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr);
 bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr);
 void kvm_apic_update_ppr(struct kvm_vcpu *vcpu);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cf51471d993b..183f7aec29ca 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4454,6 +4454,10 @@ static int svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	return -1;
 }
 
+static void svm_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
+{
+}
+
 /* Note: Currently only used by Hyper-V. */
 static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
@@ -5587,6 +5591,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	.hwapic_irr_update = svm_hwapic_irr_update,
 	.hwapic_isr_update = svm_hwapic_isr_update,
 	.sync_pir_to_irr = svm_sync_pir_to_irr,
+	.complete_nested_posted_interrupt =
+		svm_complete_nested_posted_interrupt,
 	.apicv_post_state_restore = avic_post_state_restore,
 
 	.set_tss_addr = svm_set_tss_addr,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cb8c52fb0591..5f82cd9fc500 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -535,12 +535,6 @@ static bool pi_test_and_set_on(struct pi_desc *pi_desc)
 			(unsigned long *)&pi_desc->control);
 }
 
-static bool pi_test_and_clear_on(struct pi_desc *pi_desc)
-{
-	return test_and_clear_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
 static int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
 {
 	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
@@ -5044,39 +5038,6 @@ static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu)
 	}
 }
 
-
-static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	int max_irr;
-	void *vapic_page;
-	u16 status;
-
-	if (!vmx->nested.pi_desc || !vmx->nested.pi_pending)
-		return;
-
-	vmx->nested.pi_pending = false;
-	if (!pi_test_and_clear_on(vmx->nested.pi_desc))
-		return;
-
-	max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
-	if (max_irr != 256) {
-		vapic_page = kmap(vmx->nested.virtual_apic_page);
-		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
-			vapic_page, &max_irr);
-		kunmap(vmx->nested.virtual_apic_page);
-
-		status = vmcs_read16(GUEST_INTR_STATUS);
-		if ((u8)max_irr > ((u8)status & 0xff)) {
-			status &= ~0xff;
-			status |= (u8)max_irr;
-			vmcs_write16(GUEST_INTR_STATUS, status);
-		}
-	}
-
-	nested_mark_vmcs12_pages_dirty(vcpu);
-}
-
 static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
 						     bool nested)
 {
@@ -5123,14 +5084,13 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 
 	if (is_guest_mode(vcpu) &&
 	    vector == vmx->nested.posted_intr_nv) {
-		/* the PIR and ON have been set by L1. */
-		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
 		/*
 		 * If a posted intr is not recognized by hardware,
 		 * we will accomplish it in the next vmentry.
 		 */
 		vmx->nested.pi_pending = true;
-		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		/* the PIR and ON have been set by L1. */
+		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
 		return 0;
 	}
 	return -1;
@@ -5162,6 +5122,24 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 		kvm_vcpu_kick(vcpu);
 }
 
+static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	WARN_ON(!vcpu->arch.apicv_active);
+
+	if (WARN_ON(!is_guest_mode(vcpu)) || !vmx->nested.pi_pending
+	    || !vmx->nested.pi_desc)
+		return;
+
+	vmx->nested.pi_pending = false;
+
+	if (pi_test_on(vmx->nested.pi_desc)) {
+		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
+			POSTED_INTR_NESTED_VECTOR);
+	}
+}
+
 /*
  * Set up the vmcs's constant host-state fields, i.e., host-state fields that
  * will not change in the lifetime of the guest.
@@ -6730,6 +6708,10 @@ static void wakeup_handler(void)
  */
 static void nested_posted_intr_handler(void)
 {
+	struct kvm_vcpu *vcpu = kvm_get_current_vcpu();
+
+	if (vcpu && is_guest_mode(vcpu))
+		to_vmx(vcpu)->nested.pi_pending = true;
 }
 
 void vmx_enable_tdp(void)
@@ -6830,6 +6812,7 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_apicv()) {
 		enable_apicv = 0;
 		kvm_x86_ops->sync_pir_to_irr = NULL;
+		kvm_x86_ops->complete_nested_posted_interrupt = NULL;
 	}
 
 	if (cpu_has_vmx_tsc_scaling()) {
@@ -11155,7 +11138,6 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
 		return 0;
 	}
 
-	vmx_complete_nested_posted_interrupt(vcpu);
 	return 0;
 }
 
@@ -12157,6 +12139,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	.hwapic_isr_update = vmx_hwapic_isr_update,
 	.sync_pir_to_irr = vmx_sync_pir_to_irr,
 	.deliver_posted_interrupt = vmx_deliver_posted_interrupt,
+	.complete_nested_posted_interrupt =
+		vmx_complete_nested_posted_interrupt,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.get_tdp_level = get_ept_level,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fc08f2cb7aa2..fa088951afc9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6984,11 +6984,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	smp_mb__after_srcu_read_unlock();
 
 	/*
-	 * This handles the case where a posted interrupt was
-	 * notified with kvm_vcpu_kick.
+	 * In case guest got the posted-interrupt notification
+	 * vector while running in host, we need to make sure
+	 * it arrives to guest.
+	 * For L1 posted-interrupts, we manually sync PIR to IRR.
+	 * For L2 posted-interrupts, we send notification-vector
+	 * again by self IPI such that it will now be received in guest.
 	 */
-	if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active)
+	if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) {
 		kvm_x86_ops->sync_pir_to_irr(vcpu);
+		if (is_guest_mode(vcpu))
+			kvm_x86_ops->complete_nested_posted_interrupt(vcpu);
+	}
 
 	if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu)
 	    || need_resched() || signal_pending(current)) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (8 preceding siblings ...)
  2017-12-24 16:13 ` [PATCH v3 09/11] KVM: nVMX: Deliver missed nested-PI notification-vector via self-IPI while interrupts disabled Liran Alon
@ 2017-12-24 16:13 ` Liran Alon
  2017-12-27 11:31   ` Paolo Bonzini
  2017-12-24 16:13 ` [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending Liran Alon
  10 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:13 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick,
	Konrad Rzeszutek Wilk

If L1 doesn't intercept L2 HLT (doesn't set CPU_BASED_HLT_EXITING),
then when L2 executes HLT instruction, KVM will block vCPU from
further execution (just like what happens when L1 executes HLT).

Thus, when some CPU sends nested-posted-interrupt to a halted
L2 vCPU, vmx_deliver_nested_posted_interrupt() notes that
vcpu->mode != IN_GUEST_MODE and therefore doesn't send a physical IPI.
Because the dest vCPU is blocked by HLT, we should kick it.

Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt
processing")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5f82cd9fc500..c2d012d9d16d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5090,7 +5090,8 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 		 */
 		vmx->nested.pi_pending = true;
 		/* the PIR and ON have been set by L1. */
-		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
+		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
+			kvm_vcpu_kick(vcpu);
 		return 0;
 	}
 	return -1;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
                   ` (9 preceding siblings ...)
  2017-12-24 16:13 ` [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt Liran Alon
@ 2017-12-24 16:13 ` Liran Alon
  2017-12-27 10:15   ` Paolo Bonzini
  10 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-24 16:13 UTC (permalink / raw)
  To: pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liran Alon, Liam Merwick

If L1 doesn't intercept L2 HLT (doesn't set CPU_BASED_HLT_EXITING),
then when L2 executes HLT instruction, KVM will block vCPU from
further execution (just like what happens when L1 executes HLT).

Consider the case where vmx_deliver_nested_posted_interrupt() sees
that vcpu->mode == IN_GUEST_MODE and therefore sends a physical IPI.
Assume that before it sends the physical IPI, the dest CPU executing
L2 executes HLT which causes the dest vCPU to be blocked.

Before the dest vCPU is blocked, a call to kvm_arch_vcpu_runnable()
is made which calls kvm_vcpu_has_events(). In addition, after the
vCPU is blocked, every time it is interrupted (by IPI) then
kvm_vcpu_check_block() is called which also calls
kvm_arch_vcpu_runnable().
kvm_vcpu_has_events() should return if there is a pending event for
vCPU that should be processed and therefore vCPU should be unblocked.

In order to handle the above mentioned case, kvm_vcpu_has_events()
should check if there is a pending nested posted-interrupt on vCPU
that should be dispatched to guest.

Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt
processing")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/svm.c              |  7 +++++++
 arch/x86/kvm/vmx.c              | 13 +++++++++++++
 arch/x86/kvm/x86.c              |  3 ++-
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4b8caff83dc7..71c240bbb685 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -994,6 +994,7 @@ struct kvm_x86_ops {
 	void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
 	void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
 	void (*complete_nested_posted_interrupt)(struct kvm_vcpu *vcpu);
+	bool (*cpu_has_nested_posted_interrupt)(struct kvm_vcpu *vcpu);
 	int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*get_tdp_level)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 183f7aec29ca..a559c26d7bff 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4458,6 +4458,11 @@ static void svm_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
 {
 }
 
+static bool svm_cpu_has_nested_posted_interrupt(struct kvm_vcpu *vcpu)
+{
+	return false;
+}
+
 /* Note: Currently only used by Hyper-V. */
 static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
@@ -5593,6 +5598,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	.sync_pir_to_irr = svm_sync_pir_to_irr,
 	.complete_nested_posted_interrupt =
 		svm_complete_nested_posted_interrupt,
+	.cpu_has_nested_posted_interrupt =
+		svm_cpu_has_nested_posted_interrupt,
 	.apicv_post_state_restore = avic_post_state_restore,
 
 	.set_tss_addr = svm_set_tss_addr,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c2d012d9d16d..8b414dd5e07c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5141,6 +5141,17 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool vmx_cpu_has_nested_posted_interrupt(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	return (vcpu->arch.apicv_active &&
+		is_guest_mode(vcpu) &&
+		vmx->nested.pi_pending &&
+		vmx->nested.pi_desc &&
+		pi_test_on(vmx->nested.pi_desc));
+}
+
 /*
  * Set up the vmcs's constant host-state fields, i.e., host-state fields that
  * will not change in the lifetime of the guest.
@@ -12142,6 +12153,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	.deliver_posted_interrupt = vmx_deliver_posted_interrupt,
 	.complete_nested_posted_interrupt =
 		vmx_complete_nested_posted_interrupt,
+	.cpu_has_nested_posted_interrupt =
+		vmx_cpu_has_nested_posted_interrupt,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.get_tdp_level = get_ept_level,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fa088951afc9..a840f2c9bd66 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8542,7 +8542,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 		return true;
 
 	if (kvm_arch_interrupt_allowed(vcpu) &&
-	    kvm_cpu_has_interrupt(vcpu))
+	    (kvm_cpu_has_interrupt(vcpu) ||
+	     kvm_x86_ops->cpu_has_nested_posted_interrupt(vcpu)))
 		return true;
 
 	if (kvm_hv_has_stimer_pending(vcpu))
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr()
  2017-12-24 16:12 ` [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr() Liran Alon
@ 2017-12-27  9:56   ` Paolo Bonzini
  2017-12-27 10:01     ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27  9:56 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm; +Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick

On 24/12/2017 17:12, Liran Alon wrote:
> -	if (kvm_x86_ops->sync_pir_to_irr && apic->vcpu->arch.apicv_active)
> +	if (apic->vcpu->arch.apicv_active)
>  		highest_irr = kvm_x86_ops->sync_pir_to_irr(apic->vcpu);
>  	else
>  		highest_irr = apic_find_highest_irr(apic);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index eb714f1cdf7e..99c42deb742b 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4449,6 +4449,11 @@ static void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
>  {
>  }
>  
> +static int svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> +{
> +	return -1;
> +}

Shouldn't this be

	return kvm_lapic_find_highest_irr(vcpu);

?

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr()
  2017-12-27  9:56   ` Paolo Bonzini
@ 2017-12-27 10:01     ` Liran Alon
  0 siblings, 0 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-27 10:01 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick



On 27/12/17 11:56, Paolo Bonzini wrote:
> On 24/12/2017 17:12, Liran Alon wrote:
>> -	if (kvm_x86_ops->sync_pir_to_irr && apic->vcpu->arch.apicv_active)
>> +	if (apic->vcpu->arch.apicv_active)
>>   		highest_irr = kvm_x86_ops->sync_pir_to_irr(apic->vcpu);
>>   	else
>>   		highest_irr = apic_find_highest_irr(apic);
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index eb714f1cdf7e..99c42deb742b 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -4449,6 +4449,11 @@ static void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
>>   {
>>   }
>>
>> +static int svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>> +{
>> +	return -1;
>> +}
>
> Shouldn't this be
>
> 	return kvm_lapic_find_highest_irr(vcpu);
>
> ?
>
> Paolo
>

Yes you are correct. My bad.
This would break apic_has_interrupt_for_ppr() in case apicv_active in 
case of AMD SVM.

If there is more comments justifying a v4 series, I would fix it in that 
series. Otherwise, could you just change this commit when applying?

Thanks,
-Liran

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 06/11] KVM: x86: Set current_vcpu per-cpu var before enabling interrupts at host
  2017-12-24 16:12 ` [PATCH v3 06/11] KVM: x86: Set current_vcpu per-cpu var before enabling interrupts at host Liran Alon
@ 2017-12-27 10:06   ` Paolo Bonzini
  2017-12-27 10:44     ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 10:06 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm; +Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick

On 24/12/2017 17:12, Liran Alon wrote:
>  
> +	kvm_before_handle_host_interrupts(vcpu);
>  	local_irq_enable();
> +	kvm_after_handle_host_interrupts(vcpu);
>  	preempt_enable();

This should be around the call to kvm_x86_ops->handle_external_intr, not
here.

Also, perhaps you could: 1) do the kvm_before_handle_* unconditionally
before the point where NMIs would be injected; 2) remove the
kvm_after_handle_* from vmx_complete_atomic_exit and svm_vcpu_run; 3)
only do kvm_after_handle_* after kvm_x86_ops->handle_external_intr, not
before it.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2017-12-24 16:13 ` [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending Liran Alon
@ 2017-12-27 10:15   ` Paolo Bonzini
  2017-12-27 10:51     ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 10:15 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm; +Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick

On 24/12/2017 17:13, Liran Alon wrote:
> +static bool vmx_cpu_has_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	return (vcpu->arch.apicv_active &&
> +		is_guest_mode(vcpu) &&
> +		vmx->nested.pi_pending &&
> +		vmx->nested.pi_desc &&
> +		pi_test_on(vmx->nested.pi_desc));
> +}
> +
>  /*
>   * Set up the vmcs's constant host-state fields, i.e., host-state fields that
>   * will not change in the lifetime of the guest.
> @@ -12142,6 +12153,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
>  	.deliver_posted_interrupt = vmx_deliver_posted_interrupt,
>  	.complete_nested_posted_interrupt =
>  		vmx_complete_nested_posted_interrupt,
> +	.cpu_has_nested_posted_interrupt =
> +		vmx_cpu_has_nested_posted_interrupt,
>  
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.get_tdp_level = get_ept_level,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fa088951afc9..a840f2c9bd66 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8542,7 +8542,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>  		return true;
>  
>  	if (kvm_arch_interrupt_allowed(vcpu) &&
> -	    kvm_cpu_has_interrupt(vcpu))
> +	    (kvm_cpu_has_interrupt(vcpu) ||
> +	     kvm_x86_ops->cpu_has_nested_posted_interrupt(vcpu)))
>  		return true;


kvm_cpu_has_interrupt ultimately calls apic_has_interrupt_for_ppr, which 
calls kvm_x86_ops->sync_pir_to_irr.

You already have

+		if (is_guest_mode(vcpu))
+			kvm_x86_ops->complete_nested_posted_interrupt(vcpu);

earlier in the series right after a call to kvm_x86_ops->sync_pir_to_irr.

So I wonder if:

1) kvm_x86_ops->complete_nested_posted_interrupt would do the job here as
well, removing the need for the new kvm_x86_ops member;

2) The call to kvm_x86_ops->complete_nested_posted_interrupt actually
applies to all callers of sync_pir_to_irr, which would remove the need for
that other kvm_x86_ops member.

I'm leaning towards applying patches 1-4, what do you think?

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 06/11] KVM: x86: Set current_vcpu per-cpu var before enabling interrupts at host
  2017-12-27 10:06   ` Paolo Bonzini
@ 2017-12-27 10:44     ` Liran Alon
  0 siblings, 0 replies; 57+ messages in thread
From: Liran Alon @ 2017-12-27 10:44 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick

On 27/12/17 12:06, Paolo Bonzini wrote:
> On 24/12/2017 17:12, Liran Alon wrote:
>>
>> +	kvm_before_handle_host_interrupts(vcpu);
>>   	local_irq_enable();
>> +	kvm_after_handle_host_interrupts(vcpu);
>>   	preempt_enable();
>
> This should be around the call to kvm_x86_ops->handle_external_intr, not
> here.
>
> Also, perhaps you could: 1) do the kvm_before_handle_* unconditionally
> before the point where NMIs would be injected; 2) remove the
> kvm_after_handle_* from vmx_complete_atomic_exit and svm_vcpu_run; 3)
> only do kvm_after_handle_* after kvm_x86_ops->handle_external_intr, not
> before it.
>
> Paolo
>

The intention of this change was that later commit can handle the case 
where nested posted-interrupts processing is triggered by self-IPI but 
the CPU has immediately exited guest after resume because of another 
external interrupt (different than POSTED_INTR_NESTED_VECTOR).

In this case, the POSTED_INTR_NESTED_VECTOR interrupt will be triggered 
on host after vcpu_enter_guest() calls local_irq_enable(). And our 
handler should re-set vmx->nested.pi_pending to true so that on next 
vmentry, vmx_complete_nested_posted_interrupt() will re-trigger a self-IPI.

Therefore it is not sufficient to put calls to 
kvm_{before,after}_handle_host_interrupts around call to 
kvm_x86_ops->handle_external_intr().

But you are correct that it makes sense to also put these calls around 
that call-site as-well.

So as I see it, we have 2 alternatives here:
1. Just add calls around kvm_x86_ops->handle_external_intr().
2. Go with your suggestion but with a more intuitive approach:
Change only vcpu_enter_guest() such that:
a. Will call  kvm_before_handle_host_interrupts() after setting 
vcpu->mode = IN_GUEST_MODE.
b. Will call kvm_after_handle_host_interrupts() after calls to 
local_irq_enable() (One in case there is a pending KVM requst and one 
after guest was run and we exited guest normally).

What do you think?

Regards,
-Liran

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2017-12-27 10:15   ` Paolo Bonzini
@ 2017-12-27 10:51     ` Liran Alon
  2017-12-27 12:55       ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-27 10:51 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick



On 27/12/17 12:15, Paolo Bonzini wrote:
> On 24/12/2017 17:13, Liran Alon wrote:
>> +static bool vmx_cpu_has_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	return (vcpu->arch.apicv_active &&
>> +		is_guest_mode(vcpu) &&
>> +		vmx->nested.pi_pending &&
>> +		vmx->nested.pi_desc &&
>> +		pi_test_on(vmx->nested.pi_desc));
>> +}
>> +
>>   /*
>>    * Set up the vmcs's constant host-state fields, i.e., host-state fields that
>>    * will not change in the lifetime of the guest.
>> @@ -12142,6 +12153,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
>>   	.deliver_posted_interrupt = vmx_deliver_posted_interrupt,
>>   	.complete_nested_posted_interrupt =
>>   		vmx_complete_nested_posted_interrupt,
>> +	.cpu_has_nested_posted_interrupt =
>> +		vmx_cpu_has_nested_posted_interrupt,
>>
>>   	.set_tss_addr = vmx_set_tss_addr,
>>   	.get_tdp_level = get_ept_level,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index fa088951afc9..a840f2c9bd66 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -8542,7 +8542,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>>   		return true;
>>
>>   	if (kvm_arch_interrupt_allowed(vcpu) &&
>> -	    kvm_cpu_has_interrupt(vcpu))
>> +	    (kvm_cpu_has_interrupt(vcpu) ||
>> +	     kvm_x86_ops->cpu_has_nested_posted_interrupt(vcpu)))
>>   		return true;
>
>
> kvm_cpu_has_interrupt ultimately calls apic_has_interrupt_for_ppr, which
> calls kvm_x86_ops->sync_pir_to_irr.
>
> You already have
>
> +		if (is_guest_mode(vcpu))
> +			kvm_x86_ops->complete_nested_posted_interrupt(vcpu);
>
> earlier in the series right after a call to kvm_x86_ops->sync_pir_to_irr.
>
> So I wonder if:
>
> 1) kvm_x86_ops->complete_nested_posted_interrupt would do the job here as
> well, removing the need for the new kvm_x86_ops member;
>
> 2) The call to kvm_x86_ops->complete_nested_posted_interrupt actually
> applies to all callers of sync_pir_to_irr, which would remove the need for
> that other kvm_x86_ops member.

Maybe I misunderstand you, but I don't think this is true.
complete_nested_posted_interrupt() relies on being called at a very 
specific call-site: Right before VMEntry and after interrupts are 
disabled. It works by issue a self-IPI to cause CPU to actually process 
the nested posted-interrupts.

In addition, I don't see how we can utilize the fact that 
kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as 
vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR.
In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr().

>
> I'm leaning towards applying patches 1-4, what do you think?
>

I don't see any reason not to do so if it passes your review :)

Logically these patches are separated from the patches we still debate on.

Regards,
-Liran

> Paolo
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-24 16:13 ` [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt Liran Alon
@ 2017-12-27 11:31   ` Paolo Bonzini
  2017-12-27 12:01     ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 11:31 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 24/12/2017 17:13, Liran Alon wrote:
> If L1 doesn't intercept L2 HLT (doesn't set CPU_BASED_HLT_EXITING),
> then when L2 executes HLT instruction, KVM will block vCPU from
> further execution (just like what happens when L1 executes HLT).
> 
> Thus, when some CPU sends nested-posted-interrupt to a halted
> L2 vCPU, vmx_deliver_nested_posted_interrupt() notes that
> vcpu->mode != IN_GUEST_MODE and therefore doesn't send a physical IPI.
> Because the dest vCPU is blocked by HLT, we should kick it.

In patch 9, you write "in addition, assume that dest CPU passes the
checks for pending kvm requests before sender sets KVM_REQ_EVENT".  But,
this is pretty much the same scenario that you are fixing here (except
without KVM_REQ_EVENT involvement).

So patch 10 should be placed *before patch 9* and it should do something
like this:

-		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
...
		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
+			kvm_vcpu_kick(vcpu);

with patch 9's commit message adjusted.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-27 11:31   ` Paolo Bonzini
@ 2017-12-27 12:01     ` Liran Alon
  2017-12-27 12:27       ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-27 12:01 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 27/12/17 13:31, Paolo Bonzini wrote:
> On 24/12/2017 17:13, Liran Alon wrote:
>> If L1 doesn't intercept L2 HLT (doesn't set CPU_BASED_HLT_EXITING),
>> then when L2 executes HLT instruction, KVM will block vCPU from
>> further execution (just like what happens when L1 executes HLT).
>>
>> Thus, when some CPU sends nested-posted-interrupt to a halted
>> L2 vCPU, vmx_deliver_nested_posted_interrupt() notes that
>> vcpu->mode != IN_GUEST_MODE and therefore doesn't send a physical IPI.
>> Because the dest vCPU is blocked by HLT, we should kick it.
>
> In patch 9, you write "in addition, assume that dest CPU passes the
> checks for pending kvm requests before sender sets KVM_REQ_EVENT".  But,
> this is pretty much the same scenario that you are fixing here (except
> without KVM_REQ_EVENT involvement).
>
> So patch 10 should be placed *before patch 9* and it should do something
> like this:
>
> -		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
> ...
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> +		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
> +			kvm_vcpu_kick(vcpu);
>
> with patch 9's commit message adjusted.
>
> Paolo
>

I think that current patch series makes more sense than reordering them.

This is because the patch you are suggesting will actually, as a 
side-effect, fix the bug that is fixed in commit "KVM: nVMX: Deliver 
missed nested-PI notification-vector via self-IPI while interrupts 
disabled". This is actually the same as the first patch I suggested for 
fixing the bug in v1 of this series:
http://lkml.kernel.org/r/1510252040-5609-1-git-send-email-liran.alon@oracle.com

After I submitted v1, Radim mentioned that even though the fix is 
correct, it is awkward and error-prone. I agreed and therefore we have 
written another fix using self-IPI.

Therefore, to preserve each commit fixing only one problem, I prefer to 
keep the orders of commits as they are now.

Regards,
-Liran

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-27 12:01     ` Liran Alon
@ 2017-12-27 12:27       ` Paolo Bonzini
  2017-12-27 12:52         ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 12:27 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 27/12/2017 13:01, Liran Alon wrote:
> I think that current patch series makes more sense than reordering them.
> 
> This is because the patch you are suggesting will actually, as a
> side-effect, fix the bug that is fixed in commit "KVM: nVMX: Deliver
> missed nested-PI notification-vector via self-IPI while interrupts
> disabled". This is actually the same as the first patch I suggested for
> fixing the bug in v1 of this series:
> http://lkml.kernel.org/r/1510252040-5609-1-git-send-email-liran.alon@oracle.com
> 
> After I submitted v1, Radim mentioned that even though the fix is
> correct, it is awkward and error-prone. I agreed and therefore we have
> written another fix using self-IPI.

It's not exactly the same.  My suggested version checks the return value
of kvm_vcpu_trigger_posted_interrupt and calls kvm_vcpu_kick.

> Therefore, to preserve each commit fixing only one problem, I prefer to
> keep the orders of commits as they are now.

It's true that it would hide the bug but it makes no sense to say you
are fixing a bug in patch 9, and then fix that bug in the next patch.

So I think this patch should definitely go first (in fact I'm tempted to
install it together with patches 1-4...) and the rest should be done as
a broader cleanup of vmx->nested.pi_pending.  I agree with Radim that as
of now it's awkward and needs some work.  I was about to reply about
that so I'll leave it for another email.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-27 12:27       ` Paolo Bonzini
@ 2017-12-27 12:52         ` Liran Alon
  2017-12-27 13:05           ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-27 12:52 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 27/12/17 14:27, Paolo Bonzini wrote:
> On 27/12/2017 13:01, Liran Alon wrote:
>> I think that current patch series makes more sense than reordering them.
>>
>> This is because the patch you are suggesting will actually, as a
>> side-effect, fix the bug that is fixed in commit "KVM: nVMX: Deliver
>> missed nested-PI notification-vector via self-IPI while interrupts
>> disabled". This is actually the same as the first patch I suggested for
>> fixing the bug in v1 of this series:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_1510252040-2D5609-2D1-2Dgit-2Dsend-2Demail-2Dliran.alon-40oracle.com&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Yu_AVwFNqMnoYNj9fVZO9MHUe9-GwD5Ar2DHsbmBus4&s=VxOx9jL1jMWvw4WWvn3RCeRe2GwzTIlaBgZ5nJpch8Q&e=
>>
>> After I submitted v1, Radim mentioned that even though the fix is
>> correct, it is awkward and error-prone. I agreed and therefore we have
>> written another fix using self-IPI.
>
> It's not exactly the same.  My suggested version checks the return value
> of kvm_vcpu_trigger_posted_interrupt and calls kvm_vcpu_kick.
>
>> Therefore, to preserve each commit fixing only one problem, I prefer to
>> keep the orders of commits as they are now.
>
> It's true that it would hide the bug but it makes no sense to say you
> are fixing a bug in patch 9, and then fix that bug in the next patch.

There are 2 separate bugs here. One is fixed in patch 9 and the other in 
patches 10 & 11.

The race-condition described in commit message of patch 9 is fixed in 
patch 9.
It is fixed because we get rid of the dependency on KVM_REQ_EVENT.
If the target vCPU thread passes the check for pending kvm requests, it 
means it is already running with interrupts disabled and therefore the 
physical IPI of POSTED_INTR_NESTED_VECTOR will be received in guest 
which will process nested posted-interrupts correctly. If guest will 
exit because of another external-interrupt before the physical IPI will 
be received, on next VMEntry, code will note there are pending nested 
posted-interrupts and re-trigger physical IPI of 
POSTED_INTR_NESTED_VECTOR. Therefore, nested posted-interrupts will 
eventually be processed in guest.
This is in contrast to what happens before patch 9 where L2 guest will 
continue running until next time KVM_REQ_EVENT will be consumed.
Therefore, bug is indeed fixed in patch 9.

Patch 10 fixes another bug:
Consider the case target vCPU thread is actually blocked because it 
executed HLT and it was not intercepted by L1. When some other L1 vCPU 
post to it a posted-interrupt, it should wake the target vCPU (as it is 
HLT in L2) and process nested posted-interrupt.
Patches 10 & 11 makes sure to fix this issue.

>
> So I think this patch should definitely go first (in fact I'm tempted to
> install it together with patches 1-4...) and the rest should be done as
> a broader cleanup of vmx->nested.pi_pending.  I agree with Radim that as
> of now it's awkward and needs some work.  I was about to reply about
> that so I'll leave it for another email.

Note that this series is not a clean-up of vmx->nested.pi_pending. That 
variable remained untouched. It is used as a signal to know that a 
nested posted-interrupt notification-vector has indeed been sent to 
vCPU. This is required to not process nested posted-interrupts in case 
vmx.nested.pi_desc->control ON bit is set but no nested posted-interrupt 
notification-vector was actually sent by L1.
This is in contrast to how KVM process posted-interrupts where it only 
treats vmx.pi_desc->control ON bit as indicator if it should process 
posted-interrupts.

Instead, this series just fixes a bunch of unrelated nVMX issues which 
all revolve around posted-interrupts.

Regards,
-Liran

>
> Paolo
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2017-12-27 10:51     ` Liran Alon
@ 2017-12-27 12:55       ` Paolo Bonzini
  2017-12-27 15:15         ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 12:55 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm; +Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick

On 27/12/2017 11:51, Liran Alon wrote:
> Maybe I misunderstand you, but I don't think this is true.
> complete_nested_posted_interrupt() relies on being called at a very
> specific call-site: Right before VMEntry and after interrupts are
> disabled. It works by issue a self-IPI to cause CPU to actually process
> the nested posted-interrupts.
>
> In addition, I don't see how we can utilize the fact that
> kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as
> vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR.
> In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr().

You're right, the callbacks must stay.  What confused me is the
reference to pi_test_on in vmx_cpu_has_nested_posted_interrupt.  While
your patches are an improvement, there is still some confusion on what
leads to injecting a nested posted interrupt, and on the split between
"deliver" and "complete":

- checking pi_test_on should only be done when consuming the L2 PI
descriptor (which in your case is done by the processor via the
self-IPI).  So you don't need to test it before sending the self-IPI.

- vmx->nested.pi_pending should not be set in
vmx_deliver_nested_posted_interrupt after patch 9.  pi_pending should
only be set by the nested PI handler, when it is invoked outside guest
mode.  It is then consumed before doing the self-IPI as in the above sketch.

- likewise, vmx_cpu_has_nested_posted_interrupt should only check
vmx->nested.pi_pending.  I don't think it should even check
is_guest_mode(vcpu) or vmx->nested.pi_desc; compare with the handling of
KVM_REQ_NMI, it doesn't test nmi_allowed.  This leads me to the next point.

- vmx_complete_nested_posted_interrupt *must* find a valid descriptor.
Even after a vmexit, because vmx_complete_nested_posted_interrupt is the
very first thing that runs and L1 has had no time to run a vmwrite.  So
the above could become:

	if (is_guest_mode(vcpu) && vmx->nested.pi_pending) {
		vmx->nested.pi_pending = false;
		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
				    POSTED_INTR_NESTED_VECTOR);
	}

and then what about that is_guest_mode check?  When L1 sent the PI
notification, it thought that the destination VCPU was in guest mode.
Upon vmexit, L1 can expect to either see ON=0 or get the notification
interrupt itself.  So either we deliver the interrupt to L1, or we must
do L2 virtual interrupt delivery before L1 even runs.  Hence IMO the
is_guest_mode is wrong, which unfortunately throws a wrench somewhat in
the self-IPI idea quite a bit.

I agree with moving vmx_complete_nested_posted_interrupt out of
check_nested_events, and also with the simplification of
vmx_hwapic_irr_update.  However, in my opinion the self-IPI is actually
complicating things.  Once the handling of vmx->nested.pi_pending is
cleaned up (the nested PI interrupt handler in patch 9 is a great
improvement in that respect), we can move
vmx_complete_nested_posted_interrupt to a new KVM_REQ_NESTED_PI request
that replaces vmx->nested.pi_pending.  The GUEST_INTR_STATUS write can
be done in the vmcs12, and will then be copied by prepare_vmcs02 to the
vmcs02.

I may be wrong, but based on my experience cleaning up the bare-metal
APICv, I suspect that both the current code and this version are overly
complicated, and there is a nice and simple core waiting to be
extracted.  Let's fix the race condition in the simple way, and
separately focus on cleaning up pi_pending and everything that follows
from there.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-27 12:52         ` Liran Alon
@ 2017-12-27 13:05           ` Paolo Bonzini
  2017-12-27 15:33             ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 13:05 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 27/12/2017 13:52, Liran Alon wrote:
> 
> The race-condition described in commit message of patch 9 is fixed
> in patch 9. It is fixed because we get rid of the dependency on
> KVM_REQ_EVENT.

Or is it fixed because, while getting rid of the dependency on
KVM_REQ_EVENT, you had to move kvm_vcpu_trigger_posted_interrupt to the
correct place:


-		/* the PIR and ON have been set by L1. */
-		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
 		/*
 		 * If a posted intr is not recognized by hardware,
 		 * we will accomplish it in the next vmentry.
 		 */
 		vmx->nested.pi_pending = true;
-		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		/* the PIR and ON have been set by L1. */
+		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
 		return 0;

?

Getting rid of KVM_REQ_EVENT is not a requirement to fix the
vcpu->requests race.  There is canonical way to avoid that, which is to
use kvm_vcpu_kick.

To me the main value of patch 9 is introducing the nested PI handler.
The fact that patch 9 fixes that race condition is almost a side effect
(and in fact it's not entirely fixed until patch 10, in my opinion).

Paolo

> If the target vCPU thread passes the check for pending
> kvm requests, it means it is already running with interrupts disabled
> and therefore the physical IPI of POSTED_INTR_NESTED_VECTOR will be
> received in guest which will process nested posted-interrupts
> correctly. If guest will exit because of another external-interrupt
> before the physical IPI will be received, on next VMEntry, code will
> note there are pending nested posted-interrupts and re-trigger
> physical IPI of POSTED_INTR_NESTED_VECTOR. Therefore, nested
> posted-interrupts will eventually be processed in guest. This is in
> contrast to what happens before patch 9 where L2 guest will continue
> running until next time KVM_REQ_EVENT will be consumed. Therefore,
> bug is indeed fixed in patch 9.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2017-12-27 12:55       ` Paolo Bonzini
@ 2017-12-27 15:15         ` Liran Alon
  2017-12-27 15:55           ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-27 15:15 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick

On 27/12/17 14:55, Paolo Bonzini wrote:
> On 27/12/2017 11:51, Liran Alon wrote:
>> Maybe I misunderstand you, but I don't think this is true.
>> complete_nested_posted_interrupt() relies on being called at a very
>> specific call-site: Right before VMEntry and after interrupts are
>> disabled. It works by issue a self-IPI to cause CPU to actually process
>> the nested posted-interrupts.
>>
>> In addition, I don't see how we can utilize the fact that
>> kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as
>> vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR.
>> In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr().
>
> You're right, the callbacks must stay.  What confused me is the
> reference to pi_test_on in vmx_cpu_has_nested_posted_interrupt.  While
> your patches are an improvement, there is still some confusion on what
> leads to injecting a nested posted interrupt, and on the split between
> "deliver" and "complete":
>
> - checking pi_test_on should only be done when consuming the L2 PI
> descriptor (which in your case is done by the processor via the
> self-IPI).  So you don't need to test it before sending the self-IPI.

I agree we could have not check it. It is done as an optimization to 
avoid sending a self-IPI when it won't do anything because the 
vmx.nested->pi_desc.control ON bit is off.

This optimization is even more important when this code runs in L1 
(think about triple-virtualization ^_^) to avoid a #VMExit on write to 
LAPIC ICR.

>
> - vmx->nested.pi_pending should not be set in
> vmx_deliver_nested_posted_interrupt after patch 9.  pi_pending should
> only be set by the nested PI handler, when it is invoked outside guest
> mode.  It is then consumed before doing the self-IPI as in the above sketch.

I disagree.
vmx->nested.pi_pending is used as a signal to know L1 has indeed sent a 
vmcs12->posted_intr_nv IPI. If vmx_deliver_nested_posted_interrupt() 
will see target vCPU thread is OUTSIDE_GUEST_MODE then it won't send a 
physical IPI but instead only kick vCPU. In this case, the only way the 
target vCPU thread knows it should trigger self-IPI is by the fact 
pi_pending was set by vmx_deliver_nested_posted_interrupt().
(As in this case, the nested PI handler in host was never invoked).

(It cannot only rely on the vmx.nested->pi_desc.control ON bit as it may 
be set by L1 without L1 triggering a vmcs12->posted_intr_nv IPI)

>
> - likewise, vmx_cpu_has_nested_posted_interrupt should only check
> vmx->nested.pi_pending.  I don't think it should even check
> is_guest_mode(vcpu) or vmx->nested.pi_desc; compare with the handling of
> KVM_REQ_NMI, it doesn't test nmi_allowed.  This leads me to the next point.

There is an interesting question on what should happen when
(is_guest_mode(vcpu) && vmx->nested.pi_pending && 
!pi_test_on(vmx->nested.pi_desc)). I would expect L2 guest to not be 
waken from it's HLT as virtual-interrupt-delivery should not see any 
pending interrupt to inject. But I would expect that to also happen when 
PIR is empty (even if ON bit is set) which is not checked in this 
condition...

So I think I agree but we may have a small semantic issue here of waking 
the L2 vCPU unnecessarily. But this should be a very rare case so maybe 
handle it later (Write TODO comment).

>
> - vmx_complete_nested_posted_interrupt *must* find a valid descriptor.
> Even after a vmexit, because vmx_complete_nested_posted_interrupt is the
> very first thing that runs and L1 has had no time to run a vmwrite.  So

I agree. It is better to WARN in case descriptor is not valid.

> the above could become:
>
> 	if (is_guest_mode(vcpu) && vmx->nested.pi_pending) {
> 		vmx->nested.pi_pending = false;
> 		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
> 				    POSTED_INTR_NESTED_VECTOR);
> 	}
>
> and then what about that is_guest_mode check?  When L1 sent the PI
> notification, it thought that the destination VCPU was in guest mode.
> Upon vmexit, L1 can expect to either see ON=0 or get the notification
> interrupt itself.  So either we deliver the interrupt to L1, or we must
> do L2 virtual interrupt delivery before L1 even runs.  Hence IMO the
> is_guest_mode is wrong, which unfortunately throws a wrench somewhat in
> the self-IPI idea quite a bit.
>
> I agree with moving vmx_complete_nested_posted_interrupt out of
> check_nested_events, and also with the simplification of
> vmx_hwapic_irr_update.  However, in my opinion the self-IPI is actually
> complicating things.  Once the handling of vmx->nested.pi_pending is
> cleaned up (the nested PI interrupt handler in patch 9 is a great
> improvement in that respect), we can move
> vmx_complete_nested_posted_interrupt to a new KVM_REQ_NESTED_PI request
> that replaces vmx->nested.pi_pending.  The GUEST_INTR_STATUS write can
> be done in the vmcs12, and will then be copied by prepare_vmcs02 to the
> vmcs02.
>
> I may be wrong, but based on my experience cleaning up the bare-metal
> APICv, I suspect that both the current code and this version are overly
> complicated, and there is a nice and simple core waiting to be
> extracted.  Let's fix the race condition in the simple way, and
> separately focus on cleaning up pi_pending and everything that follows
> from there.

Interesting.

I think I now follow what you mean regarding cleaning logic around 
pi_pending. This is how I understand it:

1. "vmx->nested.pi_pending" is a flag used to indicate "L1 has sent a 
vmcs12->posted_intr_nv IPI". That's it.

2. Currently code is a bit weird in the sense that instead of signal the 
pending IPI in virtual LAPIC IRR, we set it in a special variable.
If we would have set it in virtual LAPIC IRR, we could in theory behave 
very similar to a standard CPU. At interrupt injection point, we could:
(a) If vCPU is in root-mode: Just inject the pending interrupt normally.
(b) If vCPU is in non-root-mode and posted-interrupts feature is active, 
then instead of injecting the pending interrupt, we should simulate 
processing of posted-interrupts.

3. The processing of the nested posted-interrupts itself can still be 
done in self-IPI mechanism.

4. Because not doing (2), there is still currently an issue that L1 
doesn't receive a vmcs12->posted_intr_nv interrupt when target vCPU 
thread has exited from L2 to L1 and pi_pending=true.

Do we agree on the above? Or am I still misunderstanding something?
If we agree, I will attempt to clean code to handle it like described above.

Regards,
-Liran

>
> Paolo
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-27 13:05           ` Paolo Bonzini
@ 2017-12-27 15:33             ` Liran Alon
  2017-12-27 15:54               ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2017-12-27 15:33 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk



On 27/12/17 15:05, Paolo Bonzini wrote:
> On 27/12/2017 13:52, Liran Alon wrote:
>>
>> The race-condition described in commit message of patch 9 is fixed
>> in patch 9. It is fixed because we get rid of the dependency on
>> KVM_REQ_EVENT.
>
> Or is it fixed because, while getting rid of the dependency on
> KVM_REQ_EVENT, you had to move kvm_vcpu_trigger_posted_interrupt to the
> correct place:
>
>
> -		/* the PIR and ON have been set by L1. */
> -		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
>   		/*
>   		 * If a posted intr is not recognized by hardware,
>   		 * we will accomplish it in the next vmentry.
>   		 */
>   		vmx->nested.pi_pending = true;
> -		kvm_make_request(KVM_REQ_EVENT, vcpu);
> +		/* the PIR and ON have been set by L1. */
> +		kvm_vcpu_trigger_posted_interrupt(vcpu, true);
>   		return 0;
>
> ?
>
> Getting rid of KVM_REQ_EVENT is not a requirement to fix the
> vcpu->requests race.  There is canonical way to avoid that, which is to
> use kvm_vcpu_kick.
>
> To me the main value of patch 9 is introducing the nested PI handler.
> The fact that patch 9 fixes that race condition is almost a side effect
> (and in fact it's not entirely fixed until patch 10, in my opinion).
>

That's a good point. Haven't thought about it like that.
I now tend to agree with you.

I will re-think about how to change patches 5-11 such that we will:

1. Get rid of pi_pending and instead use virtual LAPIC IRR and process 
the vmcs12->posted_intr_nv specially in case vCPU is in non-root-mode 
(and posted-interrupts is active). Similar to what a real CPU does.

2. Re-order patches to be similar to the following:
(a) Simple bug-fix for the race-condition issue: Just changing order 
like I did in v1 and adding a kvm_vcpu_kick() like in patch 10 of this 
series.
(b) Get rid of pi_pending and instead use virtual LAPIC IRR bit and 
process it specially in case vCPU in non-root-mode & posted-interrupts 
is active.
(c) Get rid of software simulation of nested posted-interrupts 
processing and instead use self-IPI trick to make CPU process it for us.

What do you think?

Regards,
-Liran

> Paolo
>
>> If the target vCPU thread passes the check for pending
>> kvm requests, it means it is already running with interrupts disabled
>> and therefore the physical IPI of POSTED_INTR_NESTED_VECTOR will be
>> received in guest which will process nested posted-interrupts
>> correctly. If guest will exit because of another external-interrupt
>> before the physical IPI will be received, on next VMEntry, code will
>> note there are pending nested posted-interrupts and re-trigger
>> physical IPI of POSTED_INTR_NESTED_VECTOR. Therefore, nested
>> posted-interrupts will eventually be processed in guest. This is in
>> contrast to what happens before patch 9 where L2 guest will continue
>> running until next time KVM_REQ_EVENT will be consumed. Therefore,
>> bug is indeed fixed in patch 9.
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-27 15:33             ` Liran Alon
@ 2017-12-27 15:54               ` Paolo Bonzini
  2018-01-01 21:32                 ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 15:54 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 27/12/2017 16:33, Liran Alon wrote:
> 
> 
> 2. Re-order patches to be similar to the following:
> (a) Simple bug-fix for the race-condition issue: Just changing order
> like I did in v1 and adding a kvm_vcpu_kick() like in patch 10 of this
> series.
> (b) Get rid of pi_pending and instead use virtual LAPIC IRR bit and
> process it specially in case vCPU in non-root-mode & posted-interrupts
> is active.
> (c) Get rid of software simulation of nested posted-interrupts
> processing and instead use self-IPI trick to make CPU process it for us.
> 
> What do you think?

Sounds good, in the meanwhile I can do 2(a) myself.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2017-12-27 15:15         ` Liran Alon
@ 2017-12-27 15:55           ` Paolo Bonzini
  2020-11-23 19:22             ` Oliver Upton
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2017-12-27 15:55 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm; +Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick

On 27/12/2017 16:15, Liran Alon wrote:
> I think I now follow what you mean regarding cleaning logic around
> pi_pending. This is how I understand it:
> 
> 1. "vmx->nested.pi_pending" is a flag used to indicate "L1 has sent a
> vmcs12->posted_intr_nv IPI". That's it.
> 
> 2. Currently code is a bit weird in the sense that instead of signal the
> pending IPI in virtual LAPIC IRR, we set it in a special variable.
> If we would have set it in virtual LAPIC IRR, we could in theory behave
> very similar to a standard CPU. At interrupt injection point, we could:
> (a) If vCPU is in root-mode: Just inject the pending interrupt normally.
> (b) If vCPU is in non-root-mode and posted-interrupts feature is active,
> then instead of injecting the pending interrupt, we should simulate
> processing of posted-interrupts.
> 
> 3. The processing of the nested posted-interrupts itself can still be
> done in self-IPI mechanism.
> 
> 4. Because not doing (2), there is still currently an issue that L1
> doesn't receive a vmcs12->posted_intr_nv interrupt when target vCPU
> thread has exited from L2 to L1 and pi_pending=true.
> 
> Do we agree on the above? Or am I still misunderstanding something?

Yes, I think we agree.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2017-12-27 15:54               ` Paolo Bonzini
@ 2018-01-01 21:32                 ` Paolo Bonzini
  2018-01-01 22:37                   ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2018-01-01 21:32 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 27/12/2017 16:54, Paolo Bonzini wrote:
> (b) Get rid of pi_pending and instead use virtual LAPIC IRR bit and
> process it specially in case vCPU in non-root-mode & posted-interrupts
> is active.
> (c) Get rid of software simulation of nested posted-interrupts
> processing and instead use self-IPI trick to make CPU process it for us.
> 
> What do you think?

I guess 2(b) would be done in vmx_hwapic_irr_update?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2018-01-01 21:32                 ` Paolo Bonzini
@ 2018-01-01 22:37                   ` Liran Alon
  2018-01-02  7:25                     ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2018-01-01 22:37 UTC (permalink / raw)
  To: Paolo Bonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 01/01/18 23:32, Paolo Bonzini wrote:
> On 27/12/2017 16:54, Paolo Bonzini wrote:
>> (b) Get rid of pi_pending and instead use virtual LAPIC IRR bit and
>> process it specially in case vCPU in non-root-mode & posted-interrupts
>> is active.
>> (c) Get rid of software simulation of nested posted-interrupts
>> processing and instead use self-IPI trick to make CPU process it for us.
>>
>> What do you think?
>
> I guess 2(b) would be done in vmx_hwapic_irr_update?

I'm not sure yet. I am currently busy with other tasks and therefore 
didn't have enough time to work on the new version of this series since 
your comments.

I need to think of what would be the most elegant and easy-to-understand 
code for this mechanism. There are many delicate cases here that needs 
to be thought through.

Regarding vmx_hwapic_irr_update(), I think that it is a bit misleading 
putting the logic handling this case in there.
vmx_hwapic_irr_update() should conceptually only update the RVI. Similar 
to vmx_hwapic_isr_update() update only the SVI.
I also think that the separation between vmx_hwapic_irr_update() & 
vmx_set_rvi() is a bit weird. I would have except 
vmx_hwapic_irr_update() and vmx_hwapic_isr_update() to be symmetric.

Regards,
-Liran

>
> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 02/11] KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated
  2017-12-24 16:12 ` [PATCH v3 02/11] KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated Liran Alon
@ 2018-01-02  1:51   ` Quan Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Quan Xu @ 2018-01-02  1:51 UTC (permalink / raw)
  To: Liran Alon, pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick



On 2017/12/25 00:12, Liran Alon wrote:
> This commit doesn't change semantics.
> It is done as a preparation for future commits.
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
> Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
> ---
>   arch/x86/kvm/lapic.c | 23 ++++++++++++++++-------
>   arch/x86/kvm/lapic.h |  4 ++--
>   arch/x86/kvm/vmx.c   |  5 +++--
>   3 files changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0928608750e3..924ac8ce9d50 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -364,32 +364,41 @@ static u8 count_vectors(void *bitmap)
>   	return count;
>   }
>   
> -int __kvm_apic_update_irr(u32 *pir, void *regs)
> +bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr)
>   {
>   	u32 i, vec;
> -	u32 pir_val, irr_val;
> -	int max_irr = -1;
> +	u32 pir_val, irr_val, prev_irr_val;
> +	int max_updated_irr;
> +
> +	max_updated_irr = -1;
   combine this 2 lines..

> +	*max_irr = -1;
>   
>   	for (i = vec = 0; i <= 7; i++, vec += 32) {
>   		pir_val = READ_ONCE(pir[i]);
>   		irr_val = *((u32 *)(regs + APIC_IRR + i * 0x10));
>   		if (pir_val) {
> +			prev_irr_val = irr_val;
>   			irr_val |= xchg(&pir[i], 0);
>   			*((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
> +			if (prev_irr_val != irr_val) {
> +				max_updated_irr =
> +					__fls(irr_val ^ prev_irr_val) + vec;
> +			}
>   		}
   perhaps, we could use some macros, instead of 7, 32, 0x10 ..

>   		if (irr_val)
> -			max_irr = __fls(irr_val) + vec;
> +			*max_irr = __fls(irr_val) + vec;
>   	}
>   
> -	return max_irr;
> +	return ((max_updated_irr != -1) &&
> +		(max_updated_irr == *max_irr));
>   }
>   EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
>   
> -int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
> +bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr)
>   {
>   	struct kvm_lapic *apic = vcpu->arch.apic;
>   
> -	return __kvm_apic_update_irr(pir, apic->regs);
> +	return __kvm_apic_update_irr(pir, apic->regs, max_irr);
>   }
>   EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
>   
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 4b9935a38347..56c36014f7b7 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -75,8 +75,8 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len,
>   bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
>   			   int short_hand, unsigned int dest, int dest_mode);
>   
> -int __kvm_apic_update_irr(u32 *pir, void *regs);
> -int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
> +bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr);
> +bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr);
>   void kvm_apic_update_ppr(struct kvm_vcpu *vcpu);
>   int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
>   		     struct dest_map *dest_map);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 8eba631c4dbd..325608a1ed65 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -5062,7 +5062,8 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>   	max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
>   	if (max_irr != 256) {
>   		vapic_page = kmap(vmx->nested.virtual_apic_page);
> -		__kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page);
> +		__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
> +			vapic_page, &max_irr);
>   		kunmap(vmx->nested.virtual_apic_page);
>   
>   		status = vmcs_read16(GUEST_INTR_STATUS);
> @@ -9040,7 +9041,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   		 * But on x86 this is just a compiler barrier anyway.
>   		 */
>   		smp_mb__after_atomic();
> -		max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
> +		kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
>   	} else {
>   		max_irr = kvm_lapic_find_highest_irr(vcpu);
>   	}
Reviewed-by: Quan Xu <quan.xu0@gmail.com>


Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2017-12-24 16:12 ` [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt Liran Alon
@ 2018-01-02  2:45   ` Quan Xu
  2018-01-02  9:57     ` Liran Alon
  2018-01-03  5:35   ` Quan Xu
  1 sibling, 1 reply; 57+ messages in thread
From: Quan Xu @ 2018-01-02  2:45 UTC (permalink / raw)
  To: Liran Alon, pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 2017/12/25 00:12, Liran Alon wrote:

> In case posted-interrupt was delivered to CPU while it is in host
> (outside guest), then posted-interrupt delivery will be done by
> calling sync_pir_to_irr() at vmentry after interrupts are disabled.
>
> sync_pir_to_irr() will check vmx->pi_desc.control ON bit and if
> set, it will sync vmx->pi_desc.pir to IRR and afterwards update RVI to
> ensure virtual-interrupt-delivery will dispatch interrupt to guest.
>
> However, it is possible that L1 will receive a posted-interrupt while
> CPU runs at host and is about to enter L2. In this case, the call to
> sync_pir_to_irr() will indeed update the L1's APIC IRR but
> vcpu_enter_guest() will then just resume into L2 guest without
> re-evaluating if it should exit from L2 to L1 as a result of this
> new pending L1 event.
>
> To address this case, if sync_pir_to_irr() has a new L1 injectable
> interrupt and CPU is running L2, we force exit GUEST_MODE which will
> result in another iteration of vcpu_run() run loop which will call
> kvm_vcpu_running() which will call check_nested_events() which will
> handle the pending L1 event properly.


I agree with this solution.. However ...


> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
> Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>   arch/x86/kvm/vmx.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 325608a1ed65..d307bf26462a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -9032,6 +9032,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	int max_irr;
> +	bool max_irr_updated;
>   
>   	WARN_ON(!vcpu->arch.apicv_active);
>   	if (pi_test_on(&vmx->pi_desc)) {
> @@ -9041,7 +9042,16 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   		 * But on x86 this is just a compiler barrier anyway.
>   		 */
>   		smp_mb__after_atomic();
> -		kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
> +		max_irr_updated =
> +			kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
> +
> +		/*
> +		 * If we are running L2 and L1 has a new pending interrupt
> +		 * which can be injected, we should re-evaluate
> +		 * what should be done with this new L1 interrupt.
> +		 */
> +		if (is_guest_mode(vcpu) && max_irr_updated)
> +			kvm_vcpu_exiting_guest_mode(vcpu);


...

refer to "Virtual-Interrupt Delivery":

"""
    Vector ← RVI;
    VISR[Vector] ← 1;
    SVI ← Vector;
    VPPR ← Vector & F0H;
    VIRR[Vector] ← 0;
    IF any bits set in VIRR
       THEN RVI ← highest index of bit set in VIRR
       ELSE RVI ← 0;
    FI;
    deliver interrupt with Vector through IDT;
    cease recognition of any pending virtual interrupt;
"""



as we synced PIR to L1's APIC IRR, not only the max_irr is injectable
but also the other vectors in L1's APIC IRR.

So what we need to check is the new L1 injectable interrupt from PIR,
even if it is not the max_irr..

Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt
  2018-01-01 22:37                   ` Liran Alon
@ 2018-01-02  7:25                     ` Paolo Bonzini
  0 siblings, 0 replies; 57+ messages in thread
From: Paolo Bonzini @ 2018-01-02  7:25 UTC (permalink / raw)
  To: Liran Alon, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk

On 01/01/2018 23:37, Liran Alon wrote:
> 
> 
> On 01/01/18 23:32, Paolo Bonzini wrote:
>> On 27/12/2017 16:54, Paolo Bonzini wrote:
>>> (b) Get rid of pi_pending and instead use virtual LAPIC IRR bit and
>>> process it specially in case vCPU in non-root-mode & posted-interrupts
>>> is active.
>>> (c) Get rid of software simulation of nested posted-interrupts
>>> processing and instead use self-IPI trick to make CPU process it for us.
>>>
>>> What do you think?
>>
>> I guess 2(b) would be done in vmx_hwapic_irr_update?
> 
> I'm not sure yet. I am currently busy with other tasks and therefore
> didn't have enough time to work on the new version of this series since
> your comments.
> 
> I need to think of what would be the most elegant and easy-to-understand
> code for this mechanism. There are many delicate cases here that needs
> to be thought through.
> 
> Regarding vmx_hwapic_irr_update(), I think that it is a bit misleading
> putting the logic handling this case in there.
> vmx_hwapic_irr_update() should conceptually only update the RVI. Similar
> to vmx_hwapic_isr_update() update only the SVI.
> I also think that the separation between vmx_hwapic_irr_update() &
> vmx_set_rvi() is a bit weird. I would have except
> vmx_hwapic_irr_update() and vmx_hwapic_isr_update() to be symmetric.

I'm not sure that they can be symmetric, especially since we're doing
vmx_hwapic_irr_update before all vmentries.  In fact I'm wondering if
kvm_x86_ops->hwapic_irr_update is needed at all.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2018-01-02  2:45   ` Quan Xu
@ 2018-01-02  9:57     ` Liran Alon
  2018-01-02 11:21       ` Quan Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2018-01-02  9:57 UTC (permalink / raw)
  To: Quan Xu, pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk



On 02/01/18 04:45, Quan Xu wrote:
> On 2017/12/25 00:12, Liran Alon wrote:
>
>> In case posted-interrupt was delivered to CPU while it is in host
>> (outside guest), then posted-interrupt delivery will be done by
>> calling sync_pir_to_irr() at vmentry after interrupts are disabled.
>>
>> sync_pir_to_irr() will check vmx->pi_desc.control ON bit and if
>> set, it will sync vmx->pi_desc.pir to IRR and afterwards update RVI to
>> ensure virtual-interrupt-delivery will dispatch interrupt to guest.
>>
>> However, it is possible that L1 will receive a posted-interrupt while
>> CPU runs at host and is about to enter L2. In this case, the call to
>> sync_pir_to_irr() will indeed update the L1's APIC IRR but
>> vcpu_enter_guest() will then just resume into L2 guest without
>> re-evaluating if it should exit from L2 to L1 as a result of this
>> new pending L1 event.
>>
>> To address this case, if sync_pir_to_irr() has a new L1 injectable
>> interrupt and CPU is running L2, we force exit GUEST_MODE which will
>> result in another iteration of vcpu_run() run loop which will call
>> kvm_vcpu_running() which will call check_nested_events() which will
>> handle the pending L1 event properly.
>
>
> I agree with this solution.. However ...
>
>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>> Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
>> Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> ---
>>   arch/x86/kvm/vmx.c | 12 +++++++++++-
>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 325608a1ed65..d307bf26462a 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -9032,6 +9032,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu
>> *vcpu)
>>   {
>>       struct vcpu_vmx *vmx = to_vmx(vcpu);
>>       int max_irr;
>> +    bool max_irr_updated;
>>       WARN_ON(!vcpu->arch.apicv_active);
>>       if (pi_test_on(&vmx->pi_desc)) {
>> @@ -9041,7 +9042,16 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu
>> *vcpu)
>>            * But on x86 this is just a compiler barrier anyway.
>>            */
>>           smp_mb__after_atomic();
>> -        kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
>> +        max_irr_updated =
>> +            kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
>> +
>> +        /*
>> +         * If we are running L2 and L1 has a new pending interrupt
>> +         * which can be injected, we should re-evaluate
>> +         * what should be done with this new L1 interrupt.
>> +         */
>> +        if (is_guest_mode(vcpu) && max_irr_updated)
>> +            kvm_vcpu_exiting_guest_mode(vcpu);
>
>
> ...
>
> refer to "Virtual-Interrupt Delivery":
>
> """
>     Vector ← RVI;
>     VISR[Vector] ← 1;
>     SVI ← Vector;
>     VPPR ← Vector & F0H;
>     VIRR[Vector] ← 0;
>     IF any bits set in VIRR
>        THEN RVI ← highest index of bit set in VIRR
>        ELSE RVI ← 0;
>     FI;
>     deliver interrupt with Vector through IDT;
>     cease recognition of any pending virtual interrupt;
> """
>
>
>
> as we synced PIR to L1's APIC IRR, not only the max_irr is injectable
> but also the other vectors in L1's APIC IRR.
>
> So what we need to check is the new L1 injectable interrupt from PIR,
> even if it is not the max_irr..

The vectors which were set in the LAPIC IRR before the sync with PIR 
were already evaluated to check if they should cause #VMExit from L2 to 
L1 (in check_nested_events()).

Therefore, in vmx_sync_pir_to_irr() we only need to check if PIR have a 
new injectable interrupt set.

I failed to understand how this relates to the VID SDM algorithm you 
provided here. Can you clarify?

Thanks,
-Liran

>
> Quan
> Alibaba Cloud

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2018-01-02  9:57     ` Liran Alon
@ 2018-01-02 11:21       ` Quan Xu
  2018-01-02 11:52         ` Quan Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Quan Xu @ 2018-01-02 11:21 UTC (permalink / raw)
  To: Liran Alon, pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk



On 2018/01/02 17:57, Liran Alon wrote:
>
>
> On 02/01/18 04:45, Quan Xu wrote:
>> On 2017/12/25 00:12, Liran Alon wrote:
>>
>>> In case posted-interrupt was delivered to CPU while it is in host
>>> (outside guest), then posted-interrupt delivery will be done by
>>> calling sync_pir_to_irr() at vmentry after interrupts are disabled.
>>>
>>> sync_pir_to_irr() will check vmx->pi_desc.control ON bit and if
>>> set, it will sync vmx->pi_desc.pir to IRR and afterwards update RVI to
>>> ensure virtual-interrupt-delivery will dispatch interrupt to guest.
>>>
>>> However, it is possible that L1 will receive a posted-interrupt while
>>> CPU runs at host and is about to enter L2. In this case, the call to
>>> sync_pir_to_irr() will indeed update the L1's APIC IRR but
>>> vcpu_enter_guest() will then just resume into L2 guest without
>>> re-evaluating if it should exit from L2 to L1 as a result of this
>>> new pending L1 event.
>>>
>>> To address this case, if sync_pir_to_irr() has a new L1 injectable
>>> interrupt and CPU is running L2, we force exit GUEST_MODE which will
>>> result in another iteration of vcpu_run() run loop which will call
>>> kvm_vcpu_running() which will call check_nested_events() which will
>>> handle the pending L1 event properly.
>>
>>
>> I agree with this solution.. However ...
>>
>>
>>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>>> Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
>>> Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> ---
>>>   arch/x86/kvm/vmx.c | 12 +++++++++++-
>>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 325608a1ed65..d307bf26462a 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -9032,6 +9032,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu
>>> *vcpu)
>>>   {
>>>       struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>       int max_irr;
>>> +    bool max_irr_updated;
>>>       WARN_ON(!vcpu->arch.apicv_active);
>>>       if (pi_test_on(&vmx->pi_desc)) {
>>> @@ -9041,7 +9042,16 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu
>>> *vcpu)
>>>            * But on x86 this is just a compiler barrier anyway.
>>>            */
>>>           smp_mb__after_atomic();
>>> -        kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
>>> +        max_irr_updated =
>>> +            kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
>>> +
>>> +        /*
>>> +         * If we are running L2 and L1 has a new pending interrupt
>>> +         * which can be injected, we should re-evaluate
>>> +         * what should be done with this new L1 interrupt.
>>> +         */
>>> +        if (is_guest_mode(vcpu) && max_irr_updated)
>>> +            kvm_vcpu_exiting_guest_mode(vcpu);
>>
>>
>> ...
>>
>> refer to "Virtual-Interrupt Delivery":
>>
>> """
>>     Vector ← RVI;
>>     VISR[Vector] ← 1;
>>     SVI ← Vector;
>>     VPPR ← Vector & F0H;
>>     VIRR[Vector] ← 0;
>>     IF any bits set in VIRR
>>        THEN RVI ← highest index of bit set in VIRR
>>        ELSE RVI ← 0;
>>     FI;
>>     deliver interrupt with Vector through IDT;
>>     cease recognition of any pending virtual interrupt;
>> """
>>
>>
>>
>> as we synced PIR to L1's APIC IRR, not only the max_irr is injectable
>> but also the other vectors in L1's APIC IRR.
>>
>> So what we need to check is the new L1 injectable interrupt from PIR,
>> even if it is not the max_irr..
>
> The vectors which were set in the LAPIC IRR before the sync with PIR 
> were already evaluated to check if they should cause #VMExit from L2 
> to L1 (in check_nested_events()).
>
> Therefore, in vmx_sync_pir_to_irr() we only need to check if PIR have 
> a new injectable interrupt set.
>

yes,

the key point is what does 'a new L1 injectable interrupt' mean..

in your patch, __IIUC__ one of condition to exit guest mode:
    1) max_irr_updated - 'vector of PIR' > 'vectors of previous IRR'


as the SDM algorithm said, even if:
    'vector of PIR' < 'vectors of previos IRR'

If we has synced PIR to L1's APIC IRR, the 'vector of PIR' _may_ be 
delivered to VM..
it is a new injectable interrupt as well..



Quan
Alibaba Cloud


> I failed to understand how this relates to the VID SDM algorithm you 
> provided here. Can you clarify?
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2018-01-02 11:21       ` Quan Xu
@ 2018-01-02 11:52         ` Quan Xu
  2018-01-02 12:20           ` Liran Alon
  0 siblings, 1 reply; 57+ messages in thread
From: Quan Xu @ 2018-01-02 11:52 UTC (permalink / raw)
  To: Liran Alon, kvm
  Cc: pbonzini, rkrcmar, jmattson, wanpeng.li, idan.brown,
	Liam Merwick, Konrad Rzeszutek Wilk



On 2018年01月02日 19:21, Quan Xu wrote:
>
> yes,
>
> the key point is what does 'a new L1 injectable interrupt' mean..
>
> in your patch, __IIUC__ one of condition to exit guest mode:
>    1) max_irr_updated - 'vector of PIR' > 'vectors of previous IRR'
>
>
> as the SDM algorithm said, even if:
>    'vector of PIR' < 'vectors of previos IRR'
>
> If we has synced PIR to L1's APIC IRR, the 'vector of PIR' _may_ be 
> delivered to VM..
     If we has synced PIR to L1's APIC IRR, the 'vector of PIR' _may_ be 
delivered to VM before next VMExit..

> it is a new injectable interrupt as well..
>
>
>
> Quan
> Alibaba Cloud

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2018-01-02 11:52         ` Quan Xu
@ 2018-01-02 12:20           ` Liran Alon
  2018-01-03  5:32             ` Quan Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Liran Alon @ 2018-01-02 12:20 UTC (permalink / raw)
  To: Quan Xu, kvm
  Cc: pbonzini, rkrcmar, jmattson, wanpeng.li, idan.brown,
	Liam Merwick, Konrad Rzeszutek Wilk

On 02/01/18 13:52, Quan Xu wrote:
>
>
> On 2018年01月02日 19:21, Quan Xu wrote:
>>
>> yes,
>>
>> the key point is what does 'a new L1 injectable interrupt' mean..
>>
>> in your patch, __IIUC__ one of condition to exit guest mode:
>>    1) max_irr_updated - 'vector of PIR' > 'vectors of previous IRR'
>>
>>
>> as the SDM algorithm said, even if:
>>    'vector of PIR' < 'vectors of previos IRR'
>>
>> If we has synced PIR to L1's APIC IRR, the 'vector of PIR' _may_ be
>> delivered to VM..
>      If we has synced PIR to L1's APIC IRR, the 'vector of PIR' _may_ be
> delivered to VM before next VMExit..
>
>> it is a new injectable interrupt as well..
>>
>>
>>
>> Quan
>> Alibaba Cloud
>

In case APICv is active (relevant to our case), __apic_accept_irq() 
handles APIC_DM_FIXED by calling deliver_posted_interrupt() which will 
set a bit in PIR instead of setting a bit in LAPIC IRR.

Therefore, in case APICv is active, the bits that are set in LAPIC IRR 
are bits which were synced from PIR to IRR by some previous call of 
vmx_sync_pir_to_irr(). (Not considering the case of some user-mode API 
setting them manually).

Therefore, the only new injectable interrupts we can encounter in 
vmx_sync_pir_to_irr() are vectors coming from the PIR.
It is also not possible for the guest to block interrupt vector X while 
still being able to accept vector Y if Y<X. Therefore, it is enough to 
just check if max vector from PIR > max vector from IRR before the sync.

It is true that VID will inject the vector in RVI to the guest if 
possible on VMEntry. RVI is set by vmx_sync_pir_to_irr() to either the 
max value from PIR or the max value from IRR. If the max value from IRR 
 > max value from PIR, then the vector to be injected to guest was 
already evaluated if it should exit to L1 by check_nested_events() and 
therefore we don't need to call kvm_vcpu_exiting_guest_mode() again.

Regards,
-Liran

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2018-01-02 12:20           ` Liran Alon
@ 2018-01-03  5:32             ` Quan Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Quan Xu @ 2018-01-03  5:32 UTC (permalink / raw)
  To: Liran Alon, kvm
  Cc: pbonzini, rkrcmar, jmattson, wanpeng.li, idan.brown,
	Liam Merwick, Konrad Rzeszutek Wilk



On 2018/01/02 20:20, Liran Alon wrote:
>
> In case APICv is active (relevant to our case), __apic_accept_irq() 
> handles APIC_DM_FIXED by calling deliver_posted_interrupt() which will 
> set a bit in PIR instead of setting a bit in LAPIC IRR.
>
> Therefore, in case APICv is active, the bits that are set in LAPIC IRR 
> are bits which were synced from PIR to IRR by some previous call of 
> vmx_sync_pir_to_irr(). (Not considering the case of some user-mode API 
> setting them manually).
>
> Therefore, the only new injectable interrupts we can encounter in 
> vmx_sync_pir_to_irr() are vectors coming from the PIR.
> It is also not possible for the guest to block interrupt vector X 
> while still being able to accept vector Y if Y<X. Therefore, it is 
> enough to just check if max vector from PIR > max vector from IRR 
> before the sync.
>
> It is true that VID will inject the vector in RVI to the guest if 
> possible on VMEntry. RVI is set by vmx_sync_pir_to_irr() to either the 
> max value from PIR or the max value from IRR. If the max value from 
> IRR > max value from PIR, 

I agree with you on all of above points, ...


> then the vector to be injected to guest was already evaluated if it 
> should exit to L1 by check_nested_events() and therefore we don't need 
> to call kvm_vcpu_exiting_guest_mode() again.

...


:( my bad.. I missed the VPPR..

.e.g. in this case:
          'vector of PIR' < 'vector of previos IRR'

then the 'vector of PIR' also have a chance to be  synced to RVI, but 
hardware does not recognize
'vector of PIR' until next VMEntry as RVI[7:4] <= VPPR[7:4] (VPPR is 
from the last EOI
virtualization of 'vector of previos IRR')

even L2 -> L0 -> L2, L0 code helps to check it as your description..

so you are right.


Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt
  2017-12-24 16:12 ` [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt Liran Alon
  2018-01-02  2:45   ` Quan Xu
@ 2018-01-03  5:35   ` Quan Xu
  1 sibling, 0 replies; 57+ messages in thread
From: Quan Xu @ 2018-01-03  5:35 UTC (permalink / raw)
  To: Liran Alon, pbonzini, rkrcmar, kvm
  Cc: jmattson, wanpeng.li, idan.brown, Liam Merwick, Konrad Rzeszutek Wilk



On 2017/12/25 00:12, Liran Alon wrote:
> In case posted-interrupt was delivered to CPU while it is in host
> (outside guest), then posted-interrupt delivery will be done by
> calling sync_pir_to_irr() at vmentry after interrupts are disabled.
>
> sync_pir_to_irr() will check vmx->pi_desc.control ON bit and if
> set, it will sync vmx->pi_desc.pir to IRR and afterwards update RVI to
> ensure virtual-interrupt-delivery will dispatch interrupt to guest.
>
> However, it is possible that L1 will receive a posted-interrupt while
> CPU runs at host and is about to enter L2. In this case, the call to
> sync_pir_to_irr() will indeed update the L1's APIC IRR but
> vcpu_enter_guest() will then just resume into L2 guest without
> re-evaluating if it should exit from L2 to L1 as a result of this
> new pending L1 event.
>
> To address this case, if sync_pir_to_irr() has a new L1 injectable
> interrupt and CPU is running L2, we force exit GUEST_MODE which will
> result in another iteration of vcpu_run() run loop which will call
> kvm_vcpu_running() which will call check_nested_events() which will
> handle the pending L1 event properly.
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
> Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Reviewed-by: Quan Xu <quan.xu0@gmail.com>

Quan
Alibaba Cloud

> ---
>   arch/x86/kvm/vmx.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 325608a1ed65..d307bf26462a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -9032,6 +9032,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	int max_irr;
> +	bool max_irr_updated;
>   
>   	WARN_ON(!vcpu->arch.apicv_active);
>   	if (pi_test_on(&vmx->pi_desc)) {
> @@ -9041,7 +9042,16 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   		 * But on x86 this is just a compiler barrier anyway.
>   		 */
>   		smp_mb__after_atomic();
> -		kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
> +		max_irr_updated =
> +			kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr);
> +
> +		/*
> +		 * If we are running L2 and L1 has a new pending interrupt
> +		 * which can be injected, we should re-evaluate
> +		 * what should be done with this new L1 interrupt.
> +		 */
> +		if (is_guest_mode(vcpu) && max_irr_updated)
> +			kvm_vcpu_exiting_guest_mode(vcpu);
>   	} else {
>   		max_irr = kvm_lapic_find_highest_irr(vcpu);
>   	}

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2017-12-27 15:55           ` Paolo Bonzini
@ 2020-11-23 19:22             ` Oliver Upton
  2020-11-23 22:42               ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Oliver Upton @ 2020-11-23 19:22 UTC (permalink / raw)
  To: pbonzini
  Cc: idan.brown, jmattson, kvm, liam.merwick, wanpeng.li, seanjc,
	Oliver Upton

>On 27/12/2017 16:15, Liran Alon wrote:
>> I think I now follow what you mean regarding cleaning logic around
>> pi_pending. This is how I understand it:
>> 
>> 1. "vmx->nested.pi_pending" is a flag used to indicate "L1 has sent a
>> vmcs12->posted_intr_nv IPI". That's it.
>> 
>> 2. Currently code is a bit weird in the sense that instead of signal the
>> pending IPI in virtual LAPIC IRR, we set it in a special variable.
>> If we would have set it in virtual LAPIC IRR, we could in theory behave
>> very similar to a standard CPU. At interrupt injection point, we could:
>> (a) If vCPU is in root-mode: Just inject the pending interrupt normally.
>> (b) If vCPU is in non-root-mode and posted-interrupts feature is active,
>> then instead of injecting the pending interrupt, we should simulate
>> processing of posted-interrupts.
>> 
>> 3. The processing of the nested posted-interrupts itself can still be
>> done in self-IPI mechanism.
>> 
>> 4. Because not doing (2), there is still currently an issue that L1
>> doesn't receive a vmcs12->posted_intr_nv interrupt when target vCPU
>> thread has exited from L2 to L1 and pi_pending=true.
>> 
>> Do we agree on the above? Or am I still misunderstanding something?
>
> Yes, I think we agree.
>
> Paolo

Digging up this old thread to add discussions I had with Jim and Sean
recently.

We were looking at what was necessary in order to implement the
suggestions above (route the nested PINV through L1 IRR instead of using
the pi_pending bit), but now believe this change could never work
perfectly.

The pi_pending bit works rather well as it is only a hint to KVM that it
may owe the guest a posted-interrupt completion. However, if we were to
set the guest's nested PINV as pending in the L1 IRR it'd be challenging
to infer whether or not it should actually be injected in L1 or result
in posted-interrupt processing for L2.

A possible solution would be to install a real ISR in L0 for KVM's
nested PINV in concert with a per-CPU data structure containing a
linked-list of possible vCPUs that could've been meant to receive the
doorbell. But how would L0 know to remove a vCPU from this list? Since
posted interrupts is exitless, KVM never actually knows if a vCPU got
the intended doorbell.

Short of any brilliant ideas, it seems that the pi_pending bit
is probably here to say. I have a patch to serialize it in the
{GET,SET}_NESTED_STATE ioctls (live migration is busted for nested
posted interrupts, currently) but wanted to make sure we all agree
pi_pending isn't going anywhere.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-23 19:22             ` Oliver Upton
@ 2020-11-23 22:42               ` Paolo Bonzini
  2020-11-24  0:10                 ` Oliver Upton
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2020-11-23 22:42 UTC (permalink / raw)
  To: Oliver Upton; +Cc: idan.brown, jmattson, kvm, liam.merwick, wanpeng.li, seanjc

On 23/11/20 20:22, Oliver Upton wrote:
> The pi_pending bit works rather well as it is only a hint to KVM that it
> may owe the guest a posted-interrupt completion. However, if we were to
> set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> to infer whether or not it should actually be injected in L1 or result
> in posted-interrupt processing for L2.

Stupid question: why does it matter?  The behavior when the PINV is 
delivered does not depend on the time it enters the IRR, only on the 
time that it enters ISR.  If that happens while the vCPU while in L2, it 
would trigger posted interrupt processing; if PINV moves to ISR while in 
L1, it would be delivered normally as an interrupt.

There are various special cases but they should fall in place.  For 
example, if PINV is delivered during L1 vmentry (with IF=0), it would be 
delivered at the next inject_pending_event when the VMRUN vmexit is 
processed and interrupts are unmasked.

The tricky case is when L0 tries to deliver the PINV to L1 as a posted 
interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the

                 if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
                         kvm_vcpu_kick(vcpu);

needs a tweak to fall back to setting the PINV in L1's IRR:

                 if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
                         /* set PINV in L1's IRR */
			kvm_vcpu_kick(vcpu);
		}

but you also have to do the same *in the PINV handler* 
sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the 
L2->L0 vmexit races against sending the IPI.

What am I missing?

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-23 22:42               ` Paolo Bonzini
@ 2020-11-24  0:10                 ` Oliver Upton
  2020-11-24  0:13                   ` Oliver Upton
  2020-11-24 11:09                   ` Paolo Bonzini
  0 siblings, 2 replies; 57+ messages in thread
From: Oliver Upton @ 2020-11-24  0:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: idan.brown, Jim Mattson, kvm list, liam.merwick, wanpeng.li,
	Sean Christopherson

On Mon, Nov 23, 2020 at 2:42 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 23/11/20 20:22, Oliver Upton wrote:
> > The pi_pending bit works rather well as it is only a hint to KVM that it
> > may owe the guest a posted-interrupt completion. However, if we were to
> > set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> > to infer whether or not it should actually be injected in L1 or result
> > in posted-interrupt processing for L2.
>
> Stupid question: why does it matter?  The behavior when the PINV is
> delivered does not depend on the time it enters the IRR, only on the
> time that it enters ISR.  If that happens while the vCPU while in L2, it
> would trigger posted interrupt processing; if PINV moves to ISR while in
> L1, it would be delivered normally as an interrupt.
>
> There are various special cases but they should fall in place.  For
> example, if PINV is delivered during L1 vmentry (with IF=0), it would be
> delivered at the next inject_pending_event when the VMRUN vmexit is
> processed and interrupts are unmasked.
>
> The tricky case is when L0 tries to deliver the PINV to L1 as a posted
> interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the
>
>                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
>                          kvm_vcpu_kick(vcpu);
>
> needs a tweak to fall back to setting the PINV in L1's IRR:
>
>                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
>                          /* set PINV in L1's IRR */
>                         kvm_vcpu_kick(vcpu);
>                 }

Yeah, I think that's fair. Regardless, the pi_pending bit should've
only been set if the IPI was actually sent. Though I suppose

> but you also have to do the same *in the PINV handler*
> sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> L2->L0 vmexit races against sending the IPI.

Indeed, there is a race but are we assured that the target vCPU thread
is scheduled on the target CPU when that IPI arrives?

I believe there is a 1-to-many relationship here, which is why I said
each CPU would need to maintain a linked list of possible vCPUs to
iterate and find the intended recipient. The process of removing vCPUs
from the list where we caught the IPI in L0 is quite clear, but it
doesn't seem like we could ever know to remove vCPUs from the list
when hardware catches that IPI.

If the ISR thing can be figured out then that'd be great, though it
seems infeasible because we are racing with scheduling on the target.

Could we split the difference and do something like:

        if (kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
                vmx->nested.pi_pending = true;
        } else {
                /* set PINV in L1's IRR */
                kvm_vcpu_kick(vcpu);
        }

which ensures we only set the hint when KVM might actually have
something to do. Otherwise, it'll deliver to L1 like a normal
interrupt or trigger posted-interrupt processing on nested VM-entry if
IF=0.

> What am I missing?
>
> Paolo
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24  0:10                 ` Oliver Upton
@ 2020-11-24  0:13                   ` Oliver Upton
  2020-11-24  1:55                     ` Sean Christopherson
  2020-11-24 11:09                   ` Paolo Bonzini
  1 sibling, 1 reply; 57+ messages in thread
From: Oliver Upton @ 2020-11-24  0:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: idan.brown, Jim Mattson, kvm list, liam.merwick, wanpeng.li,
	Sean Christopherson

On Mon, Nov 23, 2020 at 4:10 PM Oliver Upton <oupton@google.com> wrote:
>
> On Mon, Nov 23, 2020 at 2:42 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 23/11/20 20:22, Oliver Upton wrote:
> > > The pi_pending bit works rather well as it is only a hint to KVM that it
> > > may owe the guest a posted-interrupt completion. However, if we were to
> > > set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> > > to infer whether or not it should actually be injected in L1 or result
> > > in posted-interrupt processing for L2.
> >
> > Stupid question: why does it matter?  The behavior when the PINV is
> > delivered does not depend on the time it enters the IRR, only on the
> > time that it enters ISR.  If that happens while the vCPU while in L2, it
> > would trigger posted interrupt processing; if PINV moves to ISR while in
> > L1, it would be delivered normally as an interrupt.
> >
> > There are various special cases but they should fall in place.  For
> > example, if PINV is delivered during L1 vmentry (with IF=0), it would be
> > delivered at the next inject_pending_event when the VMRUN vmexit is
> > processed and interrupts are unmasked.
> >
> > The tricky case is when L0 tries to deliver the PINV to L1 as a posted
> > interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the
> >
> >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
> >                          kvm_vcpu_kick(vcpu);
> >
> > needs a tweak to fall back to setting the PINV in L1's IRR:
> >
> >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
> >                          /* set PINV in L1's IRR */
> >                         kvm_vcpu_kick(vcpu);
> >                 }
>
> Yeah, I think that's fair. Regardless, the pi_pending bit should've
> only been set if the IPI was actually sent. Though I suppose

Didn't finish my thought :-/

Though I suppose pi_pending was set unconditionally (and skipped the
IRR) since until recently KVM completely bungled handling the PINV
correctly when in the L1 IRR.

>
> > but you also have to do the same *in the PINV handler*
> > sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> > L2->L0 vmexit races against sending the IPI.
>
> Indeed, there is a race but are we assured that the target vCPU thread
> is scheduled on the target CPU when that IPI arrives?
>
> I believe there is a 1-to-many relationship here, which is why I said
> each CPU would need to maintain a linked list of possible vCPUs to
> iterate and find the intended recipient. The process of removing vCPUs
> from the list where we caught the IPI in L0 is quite clear, but it
> doesn't seem like we could ever know to remove vCPUs from the list
> when hardware catches that IPI.
>
> If the ISR thing can be figured out then that'd be great, though it
> seems infeasible because we are racing with scheduling on the target.
>
> Could we split the difference and do something like:
>
>         if (kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
>                 vmx->nested.pi_pending = true;
>         } else {
>                 /* set PINV in L1's IRR */
>                 kvm_vcpu_kick(vcpu);
>         }
>
> which ensures we only set the hint when KVM might actually have
> something to do. Otherwise, it'll deliver to L1 like a normal
> interrupt or trigger posted-interrupt processing on nested VM-entry if
> IF=0.
>
> > What am I missing?
> >
> > Paolo
> >

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24  0:13                   ` Oliver Upton
@ 2020-11-24  1:55                     ` Sean Christopherson
  2020-11-24  3:19                       ` Sean Christopherson
  2020-11-24 11:39                       ` Paolo Bonzini
  0 siblings, 2 replies; 57+ messages in thread
From: Sean Christopherson @ 2020-11-24  1:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, idan.brown, Jim Mattson, kvm list, liam.merwick,
	wanpeng.li

On Mon, Nov 23, 2020 at 04:13:49PM -0800, Oliver Upton wrote:
> On Mon, Nov 23, 2020 at 4:10 PM Oliver Upton <oupton@google.com> wrote:
> >
> > On Mon, Nov 23, 2020 at 2:42 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > > On 23/11/20 20:22, Oliver Upton wrote:
> > > > The pi_pending bit works rather well as it is only a hint to KVM that it
> > > > may owe the guest a posted-interrupt completion. However, if we were to
> > > > set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> > > > to infer whether or not it should actually be injected in L1 or result
> > > > in posted-interrupt processing for L2.
> > >
> > > Stupid question: why does it matter?  The behavior when the PINV is
> > > delivered does not depend on the time it enters the IRR, only on the
> > > time that it enters ISR.  If that happens while the vCPU while in L2, it
> > > would trigger posted interrupt processing; if PINV moves to ISR while in
> > > L1, it would be delivered normally as an interrupt.
> > >
> > > There are various special cases but they should fall in place.  For
> > > example, if PINV is delivered during L1 vmentry (with IF=0), it would be
> > > delivered at the next inject_pending_event when the VMRUN vmexit is
> > > processed and interrupts are unmasked.
> > >
> > > The tricky case is when L0 tries to deliver the PINV to L1 as a posted
> > > interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the
> > >
> > >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
> > >                          kvm_vcpu_kick(vcpu);
> > >
> > > needs a tweak to fall back to setting the PINV in L1's IRR:
> > >
> > >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
> > >                          /* set PINV in L1's IRR */
> > >                         kvm_vcpu_kick(vcpu);
> > >                 }
> >
> > Yeah, I think that's fair. Regardless, the pi_pending bit should've
> > only been set if the IPI was actually sent. Though I suppose
> 
> Didn't finish my thought :-/
> 
> Though I suppose pi_pending was set unconditionally (and skipped the
> IRR) since until recently KVM completely bungled handling the PINV
> correctly when in the L1 IRR.
> 
> >
> > > but you also have to do the same *in the PINV handler*
> > > sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> > > L2->L0 vmexit races against sending the IPI.
> >
> > Indeed, there is a race but are we assured that the target vCPU thread
> > is scheduled on the target CPU when that IPI arrives?
> >
> > I believe there is a 1-to-many relationship here, which is why I said
> > each CPU would need to maintain a linked list of possible vCPUs to
> > iterate and find the intended recipient.

Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
after a different vCPU has been loaded (or even multiple different vCPUs).      
E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the  
NMI runs for some absurd amount of time.  If that happens, the PINV handler     
won't know which vCPU(s) should get an IRQ injected into L1 without additional  
tracking.  KVM would need to set something like nested.pi_pending before doing  
kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets  
more complex.

> > The process of removing vCPUs from the list where we caught the IPI
> > in L0 is quite clear, but it doesn't seem like we could ever know to
> > remove vCPUs from the list when hardware catches that IPI.

It's probably possible by shadowing the PI descriptor, but doing so would likely
wipe out the perf benefits of nested PI.

That being said, if we don't care about strictly adhering to the spec (sorry in
advance Jim), I think we could avoid nested.pi_pending if we're ok with KVM
processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
KVM already does that with nested.pi_pending, so it might not be the worst idea
in the world since the existing nested PI implementation mostly works.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24  1:55                     ` Sean Christopherson
@ 2020-11-24  3:19                       ` Sean Christopherson
  2020-11-24 11:39                       ` Paolo Bonzini
  1 sibling, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2020-11-24  3:19 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, idan.brown, Jim Mattson, kvm list, liam.merwick,
	wanpeng.li

On Tue, Nov 24, 2020 at 01:55:15AM +0000, Sean Christopherson wrote:
> On Mon, Nov 23, 2020 at 04:13:49PM -0800, Oliver Upton wrote:
> > On Mon, Nov 23, 2020 at 4:10 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > On Mon, Nov 23, 2020 at 2:42 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > >
> > > > On 23/11/20 20:22, Oliver Upton wrote:
> > > > > The pi_pending bit works rather well as it is only a hint to KVM that it
> > > > > may owe the guest a posted-interrupt completion. However, if we were to
> > > > > set the guest's nested PINV as pending in the L1 IRR it'd be challenging
> > > > > to infer whether or not it should actually be injected in L1 or result
> > > > > in posted-interrupt processing for L2.
> > > >
> > > > Stupid question: why does it matter?  The behavior when the PINV is
> > > > delivered does not depend on the time it enters the IRR, only on the
> > > > time that it enters ISR.  If that happens while the vCPU while in L2, it
> > > > would trigger posted interrupt processing; if PINV moves to ISR while in
> > > > L1, it would be delivered normally as an interrupt.
> > > >
> > > > There are various special cases but they should fall in place.  For
> > > > example, if PINV is delivered during L1 vmentry (with IF=0), it would be
> > > > delivered at the next inject_pending_event when the VMRUN vmexit is
> > > > processed and interrupts are unmasked.
> > > >
> > > > The tricky case is when L0 tries to deliver the PINV to L1 as a posted
> > > > interrupt, i.e. in vmx_deliver_nested_posted_interrupt.  Then the
> > > >
> > > >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
> > > >                          kvm_vcpu_kick(vcpu);
> > > >
> > > > needs a tweak to fall back to setting the PINV in L1's IRR:
> > > >
> > > >                  if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
> > > >                          /* set PINV in L1's IRR */
> > > >                         kvm_vcpu_kick(vcpu);
> > > >                 }
> > >
> > > Yeah, I think that's fair. Regardless, the pi_pending bit should've
> > > only been set if the IPI was actually sent. Though I suppose
> > 
> > Didn't finish my thought :-/
> > 
> > Though I suppose pi_pending was set unconditionally (and skipped the
> > IRR) since until recently KVM completely bungled handling the PINV
> > correctly when in the L1 IRR.
> > 
> > >
> > > > but you also have to do the same *in the PINV handler*
> > > > sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> > > > L2->L0 vmexit races against sending the IPI.
> > >
> > > Indeed, there is a race but are we assured that the target vCPU thread
> > > is scheduled on the target CPU when that IPI arrives?
> > >
> > > I believe there is a 1-to-many relationship here, which is why I said
> > > each CPU would need to maintain a linked list of possible vCPUs to
> > > iterate and find the intended recipient.
> 
> Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
> after a different vCPU has been loaded (or even multiple different vCPUs).      
> E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the  
> NMI runs for some absurd amount of time.  If that happens, the PINV handler     
> won't know which vCPU(s) should get an IRQ injected into L1 without additional  
> tracking.  KVM would need to set something like nested.pi_pending before doing  
> kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets  
> more complex.

If we do want to do something fancy with the L0 PINV handler, I think we could
avoid a list by having the dest vCPU busy wait before exiting guest_mode if
there is one or more PINV IPIs that are in the process of being sent, and then
wait again in vcpu_put() for the PINV to be handled.  The latter in particular
would rarely come into play, i.e. shouldn't impact performance.  That would
prevent having a somewhat unbounded et of vCPUs that may or may not have an
oustanding PINV.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24  0:10                 ` Oliver Upton
  2020-11-24  0:13                   ` Oliver Upton
@ 2020-11-24 11:09                   ` Paolo Bonzini
  2020-12-03 21:45                     ` Jim Mattson
  1 sibling, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2020-11-24 11:09 UTC (permalink / raw)
  To: Oliver Upton
  Cc: idan.brown, Jim Mattson, kvm list, liam.merwick, wanpeng.li,
	Sean Christopherson

On 24/11/20 01:10, Oliver Upton wrote:
>> but you also have to do the same*in the PINV handler*
>> sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
>> L2->L0 vmexit races against sending the IPI.
> Indeed, there is a race but are we assured that the target vCPU thread
> is scheduled on the target CPU when that IPI arrives?

This would only happen if the source vCPU saw vcpu->mode == 
IN_GUEST_MODE for the target vCPU.  Thus there are three cases:

1) the vCPU is in non-root mode.  This is easy. :)

2) the vCPU hasn't entered the VM yet.  Then posted interrupt IPIs are 
delayed after guest entry and ensured to result in virtual interrupt 
delivery, just like case 1.

3) the vCPU has left the VM but it hasn't reached

         vcpu->mode = OUTSIDE_GUEST_MODE;
         smp_wmb();

yet.  Then the interrupt will be right after that moment, at.

         kvm_before_interrupt(vcpu);
         local_irq_enable();
         ++vcpu->stat.exits;
         local_irq_disable();
         kvm_after_interrupt(vcpu);

Anything else will cause kvm_vcpu_trigger_posted_interrupt(vcpu, true) 
to return false instead of sending an IPI.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24  1:55                     ` Sean Christopherson
  2020-11-24  3:19                       ` Sean Christopherson
@ 2020-11-24 11:39                       ` Paolo Bonzini
  2020-11-24 21:22                         ` Sean Christopherson
  2020-12-03 22:07                         ` Jim Mattson
  1 sibling, 2 replies; 57+ messages in thread
From: Paolo Bonzini @ 2020-11-24 11:39 UTC (permalink / raw)
  To: Sean Christopherson, Oliver Upton
  Cc: idan.brown, Jim Mattson, kvm list, liam.merwick, wanpeng.li

On 24/11/20 02:55, Sean Christopherson wrote:
>>> I believe there is a 1-to-many relationship here, which is why I said
>>> each CPU would need to maintain a linked list of possible vCPUs to
>>> iterate and find the intended recipient.
> 
> Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
> after a different vCPU has been loaded (or even multiple different vCPUs).
> E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the
> NMI runs for some absurd amount of time.  If that happens, the PINV handler
> won't know which vCPU(s) should get an IRQ injected into L1 without additional
> tracking.  KVM would need to set something like nested.pi_pending before doing
> kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets
> more complex.

Ah, gotcha.  What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a 
generation count?  Then you reread vcpu->mode after sending the IPI, and 
retry if it does not match.

To guarantee atomicity, the OUTSIDE_GUEST_MODE IN_GUEST_MODE 
EXITING_GUEST_MODE READING_SHADOW_PAGE_TABLES values would remain in the 
bottom 2 bits.  That is, the vcpu->mode accesses like

	vcpu->mode = IN_GUEST_MODE;

	vcpu->mode = OUTSIDE_GUEST_MODE;

	smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);

	smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);

	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);

becoming

	enum {
		OUTSIDE_GUEST_MODE,
		IN_GUEST_MODE,
		EXITING_GUEST_MODE,
		READING_SHADOW_PAGE_TABLES,
		GUEST_MODE_MASK = 3,
	};

	vcpu->mode = (vcpu->mode | GUEST_MODE_MASK) + 1 + IN_GUEST_MODE;

	vcpu->mode &= ~GUEST_MODE_MASK;

	smp_store_mb(vcpu->mode, vcpu->mode|READING_SHADOW_PAGE_TABLES);

	smp_store_release(&vcpu->mode, vcpu->mode & ~GUEST_MODE_MASK);

	int x = READ_ONCE(vcpu->mode);
	do {
		if ((x & GUEST_MODE_MASK) != IN_GUEST_MODE)
			return x & GUEST_MODE_MASK;
	} while (!try_cmpxchg(&vcpu->mode, &x,
			      x ^ IN_GUEST_MODE ^ EXITING_GUEST_MODE))
	return IN_GUEST_MODE;

You could still get spurious posted interrupt IPIs, but the IPI handler 
need not do anything at all and that is much better.

> if we're ok with KVM
> processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
> the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.

I actually find it curious that the spec promises posted interrupt 
processing to be triggered only by the arrival of the posted interrupt 
IPI.  Why couldn't the processor in principle snoop for the address of 
the ON bit instead, similar to an MWAIT?

But even without that, I don't think the spec promises that kind of 
strict ordering with respect to what goes on in the source.  Even though 
posted interrupt processing is atomic with the acknowledgement of the 
posted interrupt IPI, the spec only promises that the PINV triggers an 
_eventual_ scan of PID.PIR when the interrupt controller delivers an 
unmasked external interrupt to the destination CPU.  You can still have 
something like

	set PID.PIR[100]
	set PID.ON
					processor starts executing a
					 very slow instruction...
	send PINV
	set PID.PIR[200]
					acknowledge PINV

and then vector 200 would be delivered before vector 100.  Of course 
with nested PI the effect would be amplified, but it's possible even on 
bare metal.

Paolo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24 11:39                       ` Paolo Bonzini
@ 2020-11-24 21:22                         ` Sean Christopherson
  2020-11-25  0:10                           ` Paolo Bonzini
  2020-12-03 22:07                         ` Jim Mattson
  1 sibling, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2020-11-24 21:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, idan.brown, Jim Mattson, kvm list, liam.merwick,
	wanpeng.li

On Tue, Nov 24, 2020, Paolo Bonzini wrote:
> On 24/11/20 02:55, Sean Christopherson wrote:
> > > > I believe there is a 1-to-many relationship here, which is why I said
> > > > each CPU would need to maintain a linked list of possible vCPUs to
> > > > iterate and find the intended recipient.
> > 
> > Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
> > after a different vCPU has been loaded (or even multiple different vCPUs).
> > E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the
> > NMI runs for some absurd amount of time.  If that happens, the PINV handler
> > won't know which vCPU(s) should get an IRQ injected into L1 without additional
> > tracking.  KVM would need to set something like nested.pi_pending before doing
> > kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets
> > more complex.
> 
> Ah, gotcha.  What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a
> generation count?  Then you reread vcpu->mode after sending the IPI, and
> retry if it does not match.
> 
> To guarantee atomicity, the OUTSIDE_GUEST_MODE IN_GUEST_MODE
> EXITING_GUEST_MODE READING_SHADOW_PAGE_TABLES values would remain in the
> bottom 2 bits.  That is, the vcpu->mode accesses like
> 
> 	vcpu->mode = IN_GUEST_MODE;
> 
> 	vcpu->mode = OUTSIDE_GUEST_MODE;
> 
> 	smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);
> 
> 	smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);
> 
> 	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
> 
> becoming
> 
> 	enum {
> 		OUTSIDE_GUEST_MODE,
> 		IN_GUEST_MODE,
> 		EXITING_GUEST_MODE,
> 		READING_SHADOW_PAGE_TABLES,
> 		GUEST_MODE_MASK = 3,
> 	};
> 
> 	vcpu->mode = (vcpu->mode | GUEST_MODE_MASK) + 1 + IN_GUEST_MODE;
> 
> 	vcpu->mode &= ~GUEST_MODE_MASK;
> 
> 	smp_store_mb(vcpu->mode, vcpu->mode|READING_SHADOW_PAGE_TABLES);
> 
> 	smp_store_release(&vcpu->mode, vcpu->mode & ~GUEST_MODE_MASK);
> 
> 	int x = READ_ONCE(vcpu->mode);
> 	do {
> 		if ((x & GUEST_MODE_MASK) != IN_GUEST_MODE)
> 			return x & GUEST_MODE_MASK;
> 	} while (!try_cmpxchg(&vcpu->mode, &x,
> 			      x ^ IN_GUEST_MODE ^ EXITING_GUEST_MODE))
> 	return IN_GUEST_MODE;
> 
> You could still get spurious posted interrupt IPIs, but the IPI handler need
> not do anything at all and that is much better.

This doesn't handle the case where the PINV arrives in L0 after VM-Exit but
before the vCPU clears IN_GUEST_MODE.  The sender will have seen IN_GUEST_MODE
and so won't retry the IPI, but hardware didn't process the PINV as a
posted-interrupt.  I.e. the L0 PINV handler still needs an indicator a la
nested.pi_pending to know that it should stuff an IRQ into L1's vIRR.

> > if we're ok with KVM
> > processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
> > the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
> 
> I actually find it curious that the spec promises posted interrupt
> processing to be triggered only by the arrival of the posted interrupt IPI.
> Why couldn't the processor in principle snoop for the address of the ON bit
> instead, similar to an MWAIT?

It would lead to false positives and missed IRQs.  PI processing would fire on
writing the PI _cache line_, not just on writes to PI.ON.  I suspect MONITOR is
also triggered on request-for-EXLUSIVE and not just writes, i.e. on speculative
behavior, but I forget if that's actually the case.

Regardless, a write to any part of the PI would trip the monitoring, and then
all subsequent writes would be missed, e.g. other package writes PI.IRR then
PI.ON, CPU PI processing triggers on the PI.IRR write but not PI.ON write.  The
target CPU (the running vCPU) would have to constantly rearm the monitor, but
even then there would always be a window where a write would get missed.

> But even without that, I don't think the spec promises that kind of strict
> ordering with respect to what goes on in the source.  Even though posted
> interrupt processing is atomic with the acknowledgement of the posted
> interrupt IPI, the spec only promises that the PINV triggers an _eventual_
> scan of PID.PIR when the interrupt controller delivers an unmasked external
> interrupt to the destination CPU.  You can still have something like
> 
> 	set PID.PIR[100]
> 	set PID.ON
> 					processor starts executing a
> 					 very slow instruction...
> 	send PINV
> 	set PID.PIR[200]
> 					acknowledge PINV
> 
> and then vector 200 would be delivered before vector 100.  Of course with
> nested PI the effect would be amplified, but it's possible even on bare
> metal.

Jim was concerned that L1 could poll the PID to determine whether or not
PID.PIR[200] should be seen in L2.  The whole PIR is copied to the vIRR after
PID.ON is cleared the auto-EOI is done, and the read->clear is atomic.  So the
above sequence where PINV is acknowledge after PID.PIR[200] is legal, but
processing PIR bits that are set after the PIR is observed to be cleared would
be illegal.  E.g. if L1 did this

	set PID.PIR[100]
	set PID.ON
	send PINV
	while (PID.PIR)
	set PID.PIR[200]
	set PID.ON

then L2 should never observe vector 200.  KVM violates this because
nested.pi_pending is left set even if PINV is handled as a posted interrupt, and
KVM's processing of nested.pi_pending will see the second PID.ON and incorrectly
do PI processing in software.  This is the part that is likely impossible to
solve without shadowing the PID (which, for the record, I have zero desire to do).

It seems extremely unlikely any guest will puke on the above, I can't imagine
there's for setting a PID.PIR + PID.ON without triggering PINV, but it's
technically bad behavior in KVM.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24 21:22                         ` Sean Christopherson
@ 2020-11-25  0:10                           ` Paolo Bonzini
  2020-11-25  1:14                             ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2020-11-25  0:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, idan.brown, Jim Mattson, kvm list, liam.merwick,
	wanpeng.li

On 24/11/20 22:22, Sean Christopherson wrote:
>> What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a
>> generation count?  Then you reread vcpu->mode after sending the IPI, and
>> retry if it does not match.
> The sender will have seen IN_GUEST_MODE and so won't retry the IPI,
> but hardware didn't process the PINV as a posted-interrupt.
Uff, of course.

That said, it may still be a good idea to keep pi_pending only as a very 
short-lived flag only to handle this case, and maybe not even need the 
generation count (late here so put up with me if it's wrong :)).  The 
flag would not have to live past vmx_vcpu_run even, the vIRR[PINV] bit 
would be the primary marker that a nested posted interrupt is pending.

>>> if we're ok with KVM
>>> processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
>>> the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
>>
>> I actually find it curious that the spec promises posted interrupt
>> processing to be triggered only by the arrival of the posted interrupt IPI.
>> Why couldn't the processor in principle snoop for the address of the ON bit
>> instead, similar to an MWAIT?
> 
> It would lead to false positives and missed IRQs.

Not to missed IRQs---false positives on the monitor would be possible, 
but you would still have to send a posted interrupt IPI.  The weird 
promise is that the PINV interrupt is the _only_ trigger for posted 
interrupts.

>> But even without that, I don't think the spec promises that kind of strict
>> ordering with respect to what goes on in the source.  Even though posted
>> interrupt processing is atomic with the acknowledgement of the posted
>> interrupt IPI, the spec only promises that the PINV triggers an _eventual_
>> scan of PID.PIR when the interrupt controller delivers an unmasked external
>> interrupt to the destination CPU.  You can still have something like
>>
>> 	set PID.PIR[100]
>> 	set PID.ON
>> 					processor starts executing a
>> 					 very slow instruction...
>> 	send PINV
>> 	set PID.PIR[200]
>> 					acknowledge PINV
>>
>> and then vector 200 would be delivered before vector 100.  Of course with
>> nested PI the effect would be amplified, but it's possible even on bare
>> metal.
> 
> Jim was concerned that L1 could poll the PID to determine whether or not
> PID.PIR[200] should be seen in L2.  The whole PIR is copied to the vIRR after
> PID.ON is cleared the auto-EOI is done, and the read->clear is atomic.  So the
> above sequence where PINV is acknowledge after PID.PIR[200] is legal, but
> processing PIR bits that are set after the PIR is observed to be cleared would
> be illegal.

That would be another case of the unnecessarily specific promise above.

> E.g. if L1 did this
> 
> 	set PID.PIR[100]
> 	set PID.ON
> 	send PINV
> 	while (PID.PIR)
> 	set PID.PIR[200]
> 	set PID.ON
>
> This is the part that is likely impossible to
> solve without shadowing the PID (which, for the record, I have zero desire to do).

Neither do I. :)  But technically the SDM doesn't promise reading the 
whole 256 bits at the same time.  Perhaps that's the only way it can 
work in practice due to the cache coherency protocols, but the SDM only 
promises atomicity of the read and clear of "a single PIR bit (or group 
of bits)".  So there's in principle no reason why the target CPU 
couldn't clear PID.PIR[100], and then L1 would sneak in and set 
PID.PIR[200].

Paolo

> It seems extremely unlikely any guest will puke on the above, I can't imagine
> there's for setting a PID.PIR + PID.ON without triggering PINV, but it's
> technically bad behavior in KVM.
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-25  0:10                           ` Paolo Bonzini
@ 2020-11-25  1:14                             ` Sean Christopherson
  2020-11-25 17:00                               ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2020-11-25  1:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, idan.brown, Jim Mattson, kvm list, liam.merwick,
	wanpeng.li

On Wed, Nov 25, 2020, Paolo Bonzini wrote:
> On 24/11/20 22:22, Sean Christopherson wrote:
> > > What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a
> > > generation count?  Then you reread vcpu->mode after sending the IPI, and
> > > retry if it does not match.
> > The sender will have seen IN_GUEST_MODE and so won't retry the IPI,
> > but hardware didn't process the PINV as a posted-interrupt.
> Uff, of course.
> 
> That said, it may still be a good idea to keep pi_pending only as a very
> short-lived flag only to handle this case, and maybe not even need the
> generation count (late here so put up with me if it's wrong :)).  The flag
> would not have to live past vmx_vcpu_run even, the vIRR[PINV] bit would be
> the primary marker that a nested posted interrupt is pending.

I 100% agree that would be ideal for nested state migration, as it would avoid
polluting the ABI with nested.pi_pending, but the downside is that it would mean
KVM could deliver a physical interrupt twice; once to L2 and once to L1.  E.g.

	while (READ_ONCE(vmx->nested.pi_pending) && PID.ON) {
		vmx->nested.pi_pending = false;
		vIRR.PINV = 1;
	}

would incorrectly set vIRR.PINV in the case where hardware handled the PI, and
that could result in L1 seeing the interrupt if a nested exit occured before KVM
processed vIRR.PINV for L2.  Note, without PID.ON, the behavior would be really
bad as KVM would set vIRR.PINV *every* time hardware handled the PINV.

The current behavior just means KVM is more greedy with respect to processing
PID.PIR than it technically should be.  Not sure if that distinction is worth
carrying nested.pi_pending.

> > > > if we're ok with KVM
> > > > processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
> > > > the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
> > > 
> > > I actually find it curious that the spec promises posted interrupt
> > > processing to be triggered only by the arrival of the posted interrupt IPI.
> > > Why couldn't the processor in principle snoop for the address of the ON bit
> > > instead, similar to an MWAIT?
> > 
> > It would lead to false positives and missed IRQs.
> 
> Not to missed IRQs---false positives on the monitor would be possible, but
> you would still have to send a posted interrupt IPI.  The weird promise is
> that the PINV interrupt is the _only_ trigger for posted interrupts.

Ah, I misunderstood the original "only".  I suspect the primary reason is that
it would cost uops to do the snoop thing and would be inefficient in practice.
The entire PID is guaranteed to be in a single cache line, and most (all?) flows
will write PIR before ON.  So snooping the PID will detect the PIR write, bounce
the cache line by reading it, burn some uops checking that ON isn't set[*], and
then do ???

[*] The pseudocode in the SDM doesn't actually state that the CPU checks PID.ON,
    it only says it clears PID.ON, i.e. assumes that PID.ON is set.  That seems
    like an SDM bug.

> > > But even without that, I don't think the spec promises that kind of strict
> > > ordering with respect to what goes on in the source.  Even though posted
> > > interrupt processing is atomic with the acknowledgement of the posted
> > > interrupt IPI, the spec only promises that the PINV triggers an _eventual_
> > > scan of PID.PIR when the interrupt controller delivers an unmasked external
> > > interrupt to the destination CPU.  You can still have something like
> > > 
> > > 	set PID.PIR[100]
> > > 	set PID.ON
> > > 					processor starts executing a
> > > 					 very slow instruction...
> > > 	send PINV
> > > 	set PID.PIR[200]
> > > 					acknowledge PINV
> > > 
> > > and then vector 200 would be delivered before vector 100.  Of course with
> > > nested PI the effect would be amplified, but it's possible even on bare
> > > metal.
> > 
> > Jim was concerned that L1 could poll the PID to determine whether or not
> > PID.PIR[200] should be seen in L2.  The whole PIR is copied to the vIRR after
> > PID.ON is cleared the auto-EOI is done, and the read->clear is atomic.  So the
> > above sequence where PINV is acknowledge after PID.PIR[200] is legal, but
> > processing PIR bits that are set after the PIR is observed to be cleared would
> > be illegal.
> 
> That would be another case of the unnecessarily specific promise above.
> 
> > E.g. if L1 did this
> > 
> > 	set PID.PIR[100]
> > 	set PID.ON
> > 	send PINV
> > 	while (PID.PIR)
> > 	set PID.PIR[200]
> > 	set PID.ON
> > 
> > This is the part that is likely impossible to
> > solve without shadowing the PID (which, for the record, I have zero desire to do).
> 
> Neither do I. :)  But technically the SDM doesn't promise reading the whole
> 256 bits at the same time.

Hrm, the wording is poor, but my interpretation of this blurb is that the CPU
somehow has a death grip on the PID cache line while it's reading and clearing
the PIR.

  5. The logical processor performs a logical-OR of PIR into VIRR and clears PIR.
     No other agent can read or write a PIR bit (or group of bits) between the
     time it is read (to determine what to OR into VIRR) and when it is cleared.

> Perhaps that's the only way it can work in
> practice due to the cache coherency protocols, but the SDM only promises
> atomicity of the read and clear of "a single PIR bit (or group of bits)".
> So there's in principle no reason why the target CPU couldn't clear
> PID.PIR[100], and then L1 would sneak in and set PID.PIR[200].
> 
> Paolo
> 
> > It seems extremely unlikely any guest will puke on the above, I can't imagine
> > there's for setting a PID.PIR + PID.ON without triggering PINV, but it's
> > technically bad behavior in KVM.
> > 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-25  1:14                             ` Sean Christopherson
@ 2020-11-25 17:00                               ` Paolo Bonzini
  2020-11-25 18:32                                 ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2020-11-25 17:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, idan.brown, Jim Mattson, kvm list, liam.merwick,
	wanpeng.li

On 25/11/20 02:14, Sean Christopherson wrote:
>> The flag
>> would not have to live past vmx_vcpu_run even, the vIRR[PINV] bit would be
>> the primary marker that a nested posted interrupt is pending.
>
> 	while (READ_ONCE(vmx->nested.pi_pending) && PID.ON) {
> 		vmx->nested.pi_pending = false;
> 		vIRR.PINV = 1;
> 	}
> 
> would incorrectly set vIRR.PINV in the case where hardware handled the PI, and
> that could result in L1 seeing the interrupt if a nested exit occured before KVM
> processed vIRR.PINV for L2.  Note, without PID.ON, the behavior would be really
> bad as KVM would set vIRR.PINV *every* time hardware handled the PINV.

It doesn't have to be a while loop, since by the time we get here 
vcpu->mode is not IN_GUEST_MODE anymore.  To avoid the double PINV 
delivery, we could process the PID as in 
vmx_complete_nested_posted_interrupt in this particular case---but 
vmx_complete_nested_posted_interrupt would be moved from vmentry to 
vmexit, and the common case would use vIRR.PINV instead.  There would 
still be double processing, but it would solve the migration problem in 
a relatively elegant manner.

>> The weird promise is
>> that the PINV interrupt is the _only_ trigger for posted interrupts.
> 
> Ah, I misunderstood the original "only".  I suspect the primary reason is that
> it would cost uops to do the snoop thing and would be inefficient in practice.

Yes, I agree.  But again, the spec seems to be unnecessarily restrictive.

>>> This is the part that is likely impossible to
>>> solve without shadowing the PID (which, for the record, I have zero desire to do).
>> Neither do I.:)   But technically the SDM doesn't promise reading the whole
>> 256 bits at the same time.
>
> Hrm, the wording is poor, but my interpretation of this blurb is that the CPU
> somehow has a death grip on the PID cache line while it's reading and clearing
> the PIR.
> 
>    5. The logical processor performs a logical-OR of PIR into VIRR and clears PIR.
>       No other agent can read or write a PIR bit (or group of bits) between the
>       time it is read (to determine what to OR into VIRR) and when it is cleared.

Yeah, that's the part I interpreted as other processors possibly being 
able to see a partially updated version.  Of course in practice the 
processor will be doing everything atomically, but the more restrictive 
reading of the spec all but precludes a software implementation.

Paolo


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-25 17:00                               ` Paolo Bonzini
@ 2020-11-25 18:32                                 ` Sean Christopherson
  2020-11-26 13:13                                   ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2020-11-25 18:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, Jim Mattson, kvm list, liam.merwick, wanpeng.li

-Idan to stop getting bounces.

On Wed, Nov 25, 2020, Paolo Bonzini wrote:
> On 25/11/20 02:14, Sean Christopherson wrote:
> > > The flag
> > > would not have to live past vmx_vcpu_run even, the vIRR[PINV] bit would be
> > > the primary marker that a nested posted interrupt is pending.
> > 
> > 	while (READ_ONCE(vmx->nested.pi_pending) && PID.ON) {
> > 		vmx->nested.pi_pending = false;
> > 		vIRR.PINV = 1;
> > 	}
> > 
> > would incorrectly set vIRR.PINV in the case where hardware handled the PI, and
> > that could result in L1 seeing the interrupt if a nested exit occured before KVM
> > processed vIRR.PINV for L2.  Note, without PID.ON, the behavior would be really
> > bad as KVM would set vIRR.PINV *every* time hardware handled the PINV.
> 
> It doesn't have to be a while loop, since by the time we get here vcpu->mode
> is not IN_GUEST_MODE anymore.

Hrm, bad loop logic on my part.  I'm pretty sure the exiting vCPU needs to wait
for all senders to finish their sequence, otherwise pi_pending could be left
set, but spinning on pi_pending is wrong.  Your generation counter thing may
also work, but that made my brain hurt too much to work through the logic. :-)

Something like this?

static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
						int vector)
{
	struct vcpu_vmx *vmx = to_vmx(vcpu);

	if (is_guest_mode(vcpu) &&
	    vector == vmx->nested.posted_intr_nv) {
		/* Write a comment. */
		vmx->nested.pi_sending_count++;
		smp_wmb();
		if (kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
			vmx->nested.pi_pending = true;
		} else {
			<set PINV in L1 vIRR>
			kvm_make_request(KVM_REQ_EVENT, vcpu);
			kvm_vcpu_kick(vcpu);
		}
		smp_wmb();
		vmx->nested.pi_sending_count--;
		return 0;
	}
	return -1;
}

static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
	...

	/* The actual VMENTER/EXIT is in the .noinstr.text section. */
	vmx_vcpu_enter_exit(vcpu, vmx);

	...

	if (is_guest_mode(vcpu) {
		while (READ_ONCE(vmx->nested.pi_sending_count));

		vmx_complete_nested_posted_interrupt(vcpu);
	}

	...
}

> To avoid the double PINV delivery, we could process the PID as in
> vmx_complete_nested_posted_interrupt in this particular case---but
> vmx_complete_nested_posted_interrupt would be moved from vmentry to vmexit,
> and the common case would use vIRR.PINV instead.  There would still be double
> processing, but it would solve the migration problem in a relatively elegant
> manner.

I like this idea, a lot.  I'm a-ok with KVM processing more PIRs than the
SDM may or may not technically allow.

Jim, any objections?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-25 18:32                                 ` Sean Christopherson
@ 2020-11-26 13:13                                   ` Paolo Bonzini
  2020-11-30 19:14                                     ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Bonzini @ 2020-11-26 13:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, Jim Mattson, kvm list, liam.merwick, wanpeng.li

On 25/11/20 19:32, Sean Christopherson wrote:
> I'm pretty sure the exiting vCPU needs to wait
> for all senders to finish their sequence, otherwise pi_pending could be left
> set, but spinning on pi_pending is wrong.

What if you set it before?

static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
						int vector)
{
	struct vcpu_vmx *vmx = to_vmx(vcpu);

	if (is_guest_mode(vcpu) &&
	    vector == vmx->nested.posted_intr_nv) {
		/*
		 * Set pi_pending after ON.
		 */
		smp_store_release(&vmx->nested.pi_pending, true);
		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
			/*
			 * The guest was not running, let's try again
			 * on the next vmentry.
			 */
			<set PINV in L1 vIRR>
			kvm_make_request(KVM_REQ_EVENT, vcpu);
			kvm_vcpu_kick(vcpu);
			vmx->nested.pi_pending = false;
		}
		write_seqcount_end(&vmx->nested.pi_pending_sc);
		return 0;
	}
	return -1;
}

On top of this:

- kvm_x86_ops.hwapic_irr_update can be deleted.  It is already done 
unconditionally by vmx_sync_pir_to_irr before every vmentry.  This gives 
more freedom in changing vmx_sync_pir_to_irr and vmx_hwapic_irr_update.

- VCPU entry must check if max_irr == vmx->nested.posted_intr_nv, and if 
so send a POSTED_INTR_NESTED_VECTOR self-IPI.

Combining both (and considering that AMD doesn't do anything interesting 
in vmx_sync_pir_to_irr), I would move the whole call to 
vmx_sync_pir_to_irr from x86.c to vmx/vmx.c, so that we know that 
vmx_hwapic_irr_update is called with interrupts disabled and right 
before vmentry:

  static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
  {
	...
-	vmx_hwapic_irr_update(vcpu, max_irr);
         return max_irr;
  }

-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu)
  {
+	int max_irr;
+
+	WARN_ON(!irqs_disabled());
+	max_irr = vmx_sync_pir_to_irr(vcpu);
         if (!is_guest_mode(vcpu))
                 vmx_set_rvi(max_irr);
+	else if (max_irr == vmx->nested.posted_intr_nv) {
+		...
+	}
  }

and in vmx_vcpu_run:

+	if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active)
+		vmx_hwapic_irr_update(vcpu);


If you agree, feel free to send this (without the else of course) as a 
separate cleanup patch immediately.

Paolo


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-26 13:13                                   ` Paolo Bonzini
@ 2020-11-30 19:14                                     ` Sean Christopherson
  2020-11-30 19:36                                       ` Paolo Bonzini
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2020-11-30 19:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, Jim Mattson, kvm list, liam.merwick, wanpeng.li

On Thu, Nov 26, 2020, Paolo Bonzini wrote:
> On 25/11/20 19:32, Sean Christopherson wrote:
> > I'm pretty sure the exiting vCPU needs to wait
> > for all senders to finish their sequence, otherwise pi_pending could be left
> > set, but spinning on pi_pending is wrong.
> 
> What if you set it before?

That doesn't help.  nested.pi_pending will be left set, with a valid vIRQ in the
PID, after vmx_vcpu_run() if kvm_vcpu_trigger_posted_interrupt() succeeds but
the PINV is delivered in the host.

Side topic, for the "wait" sequence to work, vmx_vcpu_run() would need to do
kvm_vcpu_exiting_guest_mode() prior to waiting for senders to completely their
sequence.

> 
> static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
> 						int vector)
> {
> 	struct vcpu_vmx *vmx = to_vmx(vcpu);
> 
> 	if (is_guest_mode(vcpu) &&
> 	    vector == vmx->nested.posted_intr_nv) {
> 		/*
> 		 * Set pi_pending after ON.
> 		 */
> 		smp_store_release(&vmx->nested.pi_pending, true);
> 		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true)) {
> 			/*
> 			 * The guest was not running, let's try again
> 			 * on the next vmentry.
> 			 */
> 			<set PINV in L1 vIRR>
> 			kvm_make_request(KVM_REQ_EVENT, vcpu);
> 			kvm_vcpu_kick(vcpu);
> 			vmx->nested.pi_pending = false;
> 		}
> 		write_seqcount_end(&vmx->nested.pi_pending_sc);
> 		return 0;
> 	}
> 	return -1;
> }
> 
> On top of this:
> 
> - kvm_x86_ops.hwapic_irr_update can be deleted.  It is already done
> unconditionally by vmx_sync_pir_to_irr before every vmentry.  This gives
> more freedom in changing vmx_sync_pir_to_irr and vmx_hwapic_irr_update.

And would lower the barrier of entry for understanding this code :-)

> - VCPU entry must check if max_irr == vmx->nested.posted_intr_nv, and if so
> send a POSTED_INTR_NESTED_VECTOR self-IPI.

Hmm, there's also this snippet in vmx_sync_pir_to_irr() that needs to be dealt
with.  If the new max_irr in this case is the nested PI vector, KVM will bail
from the run loop instead of continuing on. 

	/*
	 * If we are running L2 and L1 has a new pending interrupt
	 * which can be injected, we should re-evaluate
	 * what should be done with this new L1 interrupt.
	 * If L1 intercepts external-interrupts, we should
	 * exit from L2 to L1. Otherwise, interrupt should be
	 * delivered directly to L2.
	 */
	if (is_guest_mode(vcpu) && max_irr_updated) {
		if (nested_exit_on_intr(vcpu))
			kvm_vcpu_exiting_guest_mode(vcpu);
		else
			kvm_make_request(KVM_REQ_EVENT, vcpu);
	}

> Combining both (and considering that AMD doesn't do anything interesting in
> vmx_sync_pir_to_irr), I would move the whole call to vmx_sync_pir_to_irr
> from x86.c to vmx/vmx.c, so that we know that vmx_hwapic_irr_update is
> called with interrupts disabled and right before vmentry:
> 
>  static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>  {
> 	...
> -	vmx_hwapic_irr_update(vcpu, max_irr);
>         return max_irr;
>  }
> 
> -static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> +static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu)

I would also vote to rename this helper; not sure what to call it, but for me
the current name doesn't help understand its purpose.

>  {
> +	int max_irr;
> +
> +	WARN_ON(!irqs_disabled());
> +	max_irr = vmx_sync_pir_to_irr(vcpu);
>         if (!is_guest_mode(vcpu))
>                 vmx_set_rvi(max_irr);
> +	else if (max_irr == vmx->nested.posted_intr_nv) {
> +		...
> +	}
>  }
> 
> and in vmx_vcpu_run:
> 
> +	if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active)
> +		vmx_hwapic_irr_update(vcpu);

And also drop the direct call to vmx_sync_pir_to_irr() in the fastpath exit.

> If you agree, feel free to send this (without the else of course) as a
> separate cleanup patch immediately.

Without what "else"?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-30 19:14                                     ` Sean Christopherson
@ 2020-11-30 19:36                                       ` Paolo Bonzini
  0 siblings, 0 replies; 57+ messages in thread
From: Paolo Bonzini @ 2020-11-30 19:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, Jim Mattson, kvm list, liam.merwick, wanpeng.li

On 30/11/20 20:14, Sean Christopherson wrote:
>> +	WARN_ON(!irqs_disabled());
>> +	max_irr = vmx_sync_pir_to_irr(vcpu);
>>          if (!is_guest_mode(vcpu))
>>                  vmx_set_rvi(max_irr);
>> +	else if (max_irr == vmx->nested.posted_intr_nv) {
>> +		...
>> +	}
>>   }
>>
>> and in vmx_vcpu_run:
>>
>> +	if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active)
>> +		vmx_hwapic_irr_update(vcpu);
> And also drop the direct call to vmx_sync_pir_to_irr() in the fastpath exit.
> 
>> If you agree, feel free to send this (without the else of course) as a
>> separate cleanup patch immediately.
> Without what "else"?

This one:

+	else if (max_irr == vmx->nested.posted_intr_nv) {
+		...

Paolo


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24 11:09                   ` Paolo Bonzini
@ 2020-12-03 21:45                     ` Jim Mattson
  0 siblings, 0 replies; 57+ messages in thread
From: Jim Mattson @ 2020-12-03 21:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, Idan Brown, kvm list, liam.merwick, Wanpeng Li,
	Sean Christopherson

On Tue, Nov 24, 2020 at 3:09 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 24/11/20 01:10, Oliver Upton wrote:
> >> but you also have to do the same*in the PINV handler*
> >> sysvec_kvm_posted_intr_nested_ipi too, to handle the case where the
> >> L2->L0 vmexit races against sending the IPI.
> > Indeed, there is a race but are we assured that the target vCPU thread
> > is scheduled on the target CPU when that IPI arrives?
>
> This would only happen if the source vCPU saw vcpu->mode ==
> IN_GUEST_MODE for the target vCPU.  Thus there are three cases:
>
> 1) the vCPU is in non-root mode.  This is easy. :)
>
> 2) the vCPU hasn't entered the VM yet.  Then posted interrupt IPIs are
> delayed after guest entry and ensured to result in virtual interrupt
> delivery, just like case 1.
>
> 3) the vCPU has left the VM but it hasn't reached
>
>          vcpu->mode = OUTSIDE_GUEST_MODE;
>          smp_wmb();
>
> yet.  Then the interrupt will be right after that moment, at.
>
>          kvm_before_interrupt(vcpu);
>          local_irq_enable();
>          ++vcpu->stat.exits;
>          local_irq_disable();
>          kvm_after_interrupt(vcpu);
>
> Anything else will cause kvm_vcpu_trigger_posted_interrupt(vcpu, true)
> to return false instead of sending an IPI.

This logic only applies to when the IPI is *sent*. AFAIK, there is no
bound on the amount of time that can elapse before the IPI is
*received*. (In fact, virtualization depends on this.)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending
  2020-11-24 11:39                       ` Paolo Bonzini
  2020-11-24 21:22                         ` Sean Christopherson
@ 2020-12-03 22:07                         ` Jim Mattson
  1 sibling, 0 replies; 57+ messages in thread
From: Jim Mattson @ 2020-12-03 22:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Oliver Upton, Idan Brown, kvm list,
	liam.merwick, Wanpeng Li

On Tue, Nov 24, 2020 at 3:39 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 24/11/20 02:55, Sean Christopherson wrote:
> >>> I believe there is a 1-to-many relationship here, which is why I said
> >>> each CPU would need to maintain a linked list of possible vCPUs to
> >>> iterate and find the intended recipient.
> >
> > Ya, the concern is that it's theoretically possible for the PINV to arrive in L0
> > after a different vCPU has been loaded (or even multiple different vCPUs).
> > E.g. if the sending pCPU is hit with an NMI after checking vcpu->mode, and the
> > NMI runs for some absurd amount of time.  If that happens, the PINV handler
> > won't know which vCPU(s) should get an IRQ injected into L1 without additional
> > tracking.  KVM would need to set something like nested.pi_pending before doing
> > kvm_vcpu_trigger_posted_interrupt(), i.e. nothing really changes, it just gets
> > more complex.
>
> Ah, gotcha.  What if IN_GUEST_MODE/OUTSIDE_GUEST_MODE was replaced by a
> generation count?  Then you reread vcpu->mode after sending the IPI, and
> retry if it does not match.

Meanwhile, the IPI that you previously sent may have triggered posted
interrupt processing for the wrong vCPU.

Consider an overcommitted scenario, where each pCPU is running two
vCPUs. L1 initializes all of the L2 PIDs so that PIR[x] and PID.ON are
set. Then, it sends the VMCS12 NV IPI to one specific L2 vCPU. When L0
processes the VM-exit for sending the IPI, the target vCPU is running
IN_GUEST_MODE, so we send the vmcs02 NV to the corresponding pCPU.
But, by the time the IPI is received on the pCPU, a different L2 vCPU
is running. Oops. It seems that we have to pin the target vCPU to the
pCPU until the IPI arrives. But since it's the vmcs02 NV, we have no
way of knowing whether or not it has arrived.

If the requisite IPI delivery delays seem too absurd to contemplate,
consider pushing L0 down into L1.

> To guarantee atomicity, the OUTSIDE_GUEST_MODE IN_GUEST_MODE
> EXITING_GUEST_MODE READING_SHADOW_PAGE_TABLES values would remain in the
> bottom 2 bits.  That is, the vcpu->mode accesses like
>
>         vcpu->mode = IN_GUEST_MODE;
>
>         vcpu->mode = OUTSIDE_GUEST_MODE;
>
>         smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES);
>
>         smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE);
>
>         return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>
> becoming
>
>         enum {
>                 OUTSIDE_GUEST_MODE,
>                 IN_GUEST_MODE,
>                 EXITING_GUEST_MODE,
>                 READING_SHADOW_PAGE_TABLES,
>                 GUEST_MODE_MASK = 3,
>         };
>
>         vcpu->mode = (vcpu->mode | GUEST_MODE_MASK) + 1 + IN_GUEST_MODE;
>
>         vcpu->mode &= ~GUEST_MODE_MASK;
>
>         smp_store_mb(vcpu->mode, vcpu->mode|READING_SHADOW_PAGE_TABLES);
>
>         smp_store_release(&vcpu->mode, vcpu->mode & ~GUEST_MODE_MASK);
>
>         int x = READ_ONCE(vcpu->mode);
>         do {
>                 if ((x & GUEST_MODE_MASK) != IN_GUEST_MODE)
>                         return x & GUEST_MODE_MASK;
>         } while (!try_cmpxchg(&vcpu->mode, &x,
>                               x ^ IN_GUEST_MODE ^ EXITING_GUEST_MODE))
>         return IN_GUEST_MODE;
>
> You could still get spurious posted interrupt IPIs, but the IPI handler
> need not do anything at all and that is much better.
>
> > if we're ok with KVM
> > processing virtual interrupts that technically shouldn't happen, yet.  E.g. if
> > the L0 PINV handler consumes vIRR bits that were set after the last PI from L1.
>
> I actually find it curious that the spec promises posted interrupt
> processing to be triggered only by the arrival of the posted interrupt
> IPI.  Why couldn't the processor in principle snoop for the address of
> the ON bit instead, similar to an MWAIT?
>
> But even without that, I don't think the spec promises that kind of
> strict ordering with respect to what goes on in the source.  Even though
> posted interrupt processing is atomic with the acknowledgement of the
> posted interrupt IPI, the spec only promises that the PINV triggers an
> _eventual_ scan of PID.PIR when the interrupt controller delivers an
> unmasked external interrupt to the destination CPU.  You can still have
> something like
>
>         set PID.PIR[100]
>         set PID.ON
>                                         processor starts executing a
>                                          very slow instruction...
>         send PINV
>         set PID.PIR[200]
>                                         acknowledge PINV
>
> and then vector 200 would be delivered before vector 100.  Of course
> with nested PI the effect would be amplified, but it's possible even on
> bare metal.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2020-12-03 22:08 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-24 16:12 [PATCH v3 00/11] KVM: nVMX: Fix multiple issues in nested posted-interrupts Liran Alon
2017-12-24 16:12 ` [PATCH v3 01/11] KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr() Liran Alon
2017-12-27  9:56   ` Paolo Bonzini
2017-12-27 10:01     ` Liran Alon
2017-12-24 16:12 ` [PATCH v3 02/11] KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated Liran Alon
2018-01-02  1:51   ` Quan Xu
2017-12-24 16:12 ` [PATCH v3 03/11] KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt Liran Alon
2018-01-02  2:45   ` Quan Xu
2018-01-02  9:57     ` Liran Alon
2018-01-02 11:21       ` Quan Xu
2018-01-02 11:52         ` Quan Xu
2018-01-02 12:20           ` Liran Alon
2018-01-03  5:32             ` Quan Xu
2018-01-03  5:35   ` Quan Xu
2017-12-24 16:12 ` [PATCH v3 04/11] KVM: nVMX: Fix injection to L2 when L1 don't intercept external-interrupts Liran Alon
2017-12-24 16:12 ` [PATCH v3 05/11] KVM: x86: Rename functions which saves vCPU in per-cpu var Liran Alon
2017-12-24 16:12 ` [PATCH v3 06/11] KVM: x86: Set current_vcpu per-cpu var before enabling interrupts at host Liran Alon
2017-12-27 10:06   ` Paolo Bonzini
2017-12-27 10:44     ` Liran Alon
2017-12-24 16:12 ` [PATCH v3 07/11] KVM: x86: Add util for getting current vCPU running on CPU Liran Alon
2017-12-24 16:13 ` [PATCH v3 08/11] KVM: x86: Register empty handler for POSTED_INTR_NESTED_VECTOR IPI Liran Alon
2017-12-24 16:13 ` [PATCH v3 09/11] KVM: nVMX: Deliver missed nested-PI notification-vector via self-IPI while interrupts disabled Liran Alon
2017-12-24 16:13 ` [PATCH v3 10/11] KVM: nVMX: Wake halted L2 on nested posted-interrupt Liran Alon
2017-12-27 11:31   ` Paolo Bonzini
2017-12-27 12:01     ` Liran Alon
2017-12-27 12:27       ` Paolo Bonzini
2017-12-27 12:52         ` Liran Alon
2017-12-27 13:05           ` Paolo Bonzini
2017-12-27 15:33             ` Liran Alon
2017-12-27 15:54               ` Paolo Bonzini
2018-01-01 21:32                 ` Paolo Bonzini
2018-01-01 22:37                   ` Liran Alon
2018-01-02  7:25                     ` Paolo Bonzini
2017-12-24 16:13 ` [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending Liran Alon
2017-12-27 10:15   ` Paolo Bonzini
2017-12-27 10:51     ` Liran Alon
2017-12-27 12:55       ` Paolo Bonzini
2017-12-27 15:15         ` Liran Alon
2017-12-27 15:55           ` Paolo Bonzini
2020-11-23 19:22             ` Oliver Upton
2020-11-23 22:42               ` Paolo Bonzini
2020-11-24  0:10                 ` Oliver Upton
2020-11-24  0:13                   ` Oliver Upton
2020-11-24  1:55                     ` Sean Christopherson
2020-11-24  3:19                       ` Sean Christopherson
2020-11-24 11:39                       ` Paolo Bonzini
2020-11-24 21:22                         ` Sean Christopherson
2020-11-25  0:10                           ` Paolo Bonzini
2020-11-25  1:14                             ` Sean Christopherson
2020-11-25 17:00                               ` Paolo Bonzini
2020-11-25 18:32                                 ` Sean Christopherson
2020-11-26 13:13                                   ` Paolo Bonzini
2020-11-30 19:14                                     ` Sean Christopherson
2020-11-30 19:36                                       ` Paolo Bonzini
2020-12-03 22:07                         ` Jim Mattson
2020-11-24 11:09                   ` Paolo Bonzini
2020-12-03 21:45                     ` Jim Mattson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.