[patch 0/2] VMX: configure posted interrupt descriptor when assigning device

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/2] VMX: configure posted interrupt descriptor when assigning device
@ 2021-05-06 18:57 Marcelo Tosatti
  2021-05-06 18:57 ` [patch 1/2] KVM: x86: add start_assignment hook to kvm_x86_ops Marcelo Tosatti
  2021-05-06 18:57 ` [patch 2/2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device Marcelo Tosatti
  0 siblings, 2 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-05-06 18:57 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Alex Williamson, Sean Christopherson

Configuration of the posted interrupt descriptor is incorrect when devices 
are hotplugged to the guest (and vcpus are halted).

See patch 2 for details.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 1/2] KVM: x86: add start_assignment hook to kvm_x86_ops
  2021-05-06 18:57 [patch 0/2] VMX: configure posted interrupt descriptor when assigning device Marcelo Tosatti
@ 2021-05-06 18:57 ` Marcelo Tosatti
  2021-05-06 19:54   ` Sean Christopherson
  2021-05-06 18:57 ` [patch 2/2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device Marcelo Tosatti
  1 sibling, 1 reply; 8+ messages in thread
From: Marcelo Tosatti @ 2021-05-06 18:57 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Alex Williamson, Sean Christopherson, Marcelo Tosatti

Add a start_assignment hook to kvm_x86_ops, which is called when 
kvm_arch_start_assignment is done.

The hook is required to update the wakeup vector of a sleeping vCPU
when a device is assigned to the guest.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -1322,6 +1322,7 @@ struct kvm_x86_ops {
 
 	int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
 			      uint32_t guest_irq, bool set);
+	void (*start_assignment)(struct kvm *kvm, int device_count);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
 	bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
 
Index: kvm/arch/x86/kvm/svm/svm.c
===================================================================
--- kvm.orig/arch/x86/kvm/svm/svm.c
+++ kvm/arch/x86/kvm/svm/svm.c
@@ -4601,6 +4601,7 @@ static struct kvm_x86_ops svm_x86_ops __
 	.deliver_posted_interrupt = svm_deliver_avic_intr,
 	.dy_apicv_has_pending_interrupt = svm_dy_apicv_has_pending_interrupt,
 	.update_pi_irte = svm_update_pi_irte,
+	.start_assignment = NULL,
 	.setup_mce = svm_setup_mce,
 
 	.smi_allowed = svm_smi_allowed,
Index: kvm/arch/x86/kvm/vmx/vmx.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/vmx.c
+++ kvm/arch/x86/kvm/vmx/vmx.c
@@ -7732,6 +7732,7 @@ static struct kvm_x86_ops vmx_x86_ops __
 	.nested_ops = &vmx_nested_ops,
 
 	.update_pi_irte = pi_update_irte,
+	.start_assignment = NULL,
 
 #ifdef CONFIG_X86_64
 	.set_hv_timer = vmx_set_hv_timer,
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -11295,7 +11295,11 @@ bool kvm_arch_can_dequeue_async_page_pre
 
 void kvm_arch_start_assignment(struct kvm *kvm)
 {
-	atomic_inc(&kvm->arch.assigned_device_count);
+	int ret;
+
+	ret = atomic_inc_return(&kvm->arch.assigned_device_count);
+	if (kvm_x86_ops.start_assignment)
+		kvm_x86_ops.start_assignment(kvm, ret);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_start_assignment);
 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 2/2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
  2021-05-06 18:57 [patch 0/2] VMX: configure posted interrupt descriptor when assigning device Marcelo Tosatti
  2021-05-06 18:57 ` [patch 1/2] KVM: x86: add start_assignment hook to kvm_x86_ops Marcelo Tosatti
@ 2021-05-06 18:57 ` Marcelo Tosatti
  2021-05-06 19:11   ` Marcelo Tosatti
  2021-05-06 19:21   ` [patch 2/2 V2] " Marcelo Tosatti
  1 sibling, 2 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-05-06 18:57 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Alex Williamson, Sean Christopherson, Pei Zhang,
	Marcelo Tosatti

For VMX, when a vcpu enters HLT emulation, pi_post_block will:

1) Add vcpu to per-cpu list of blocked vcpus.

2) Program the posted-interrupt descriptor "notification vector" 
to POSTED_INTR_WAKEUP_VECTOR

With interrupt remapping, an interrupt will set the PIR bit for the 
vector programmed for the device on the CPU, test-and-set the 
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.

This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.

Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:

1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit (notification vector not programmed,
so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.

To fix this, use the start_assignment kvm_x86_ops callback to program the
notification vector when assigned device count changes from 0 to 1.

Reported-by: Pei Zhang <pezhang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/kvm/vmx/posted_intr.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
+++ kvm/arch/x86/kvm/vmx/posted_intr.c
@@ -114,7 +114,7 @@ static void __pi_post_block(struct kvm_v
 	} while (cmpxchg64(&pi_desc->control, old.control,
 			   new.control) != old.control);
 
-	if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) {
+	if (vcpu->pre_pcpu != -1) {
 		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
 		list_del(&vcpu->blocked_vcpu_list);
 		spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
@@ -135,20 +135,13 @@ static void __pi_post_block(struct kvm_v
  *   this case, return 1, otherwise, return 0.
  *
  */
-int pi_pre_block(struct kvm_vcpu *vcpu)
+static int __pi_pre_block(struct kvm_vcpu *vcpu)
 {
 	unsigned int dest;
 	struct pi_desc old, new;
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
-		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
-		!kvm_vcpu_apicv_active(vcpu))
-		return 0;
-
-	WARN_ON(irqs_disabled());
-	local_irq_disable();
-	if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) {
+	if (vcpu->pre_pcpu == -1) {
 		vcpu->pre_pcpu = vcpu->cpu;
 		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
 		list_add_tail(&vcpu->blocked_vcpu_list,
@@ -188,12 +181,33 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
 	if (pi_test_on(pi_desc) == 1)
 		__pi_post_block(vcpu);
 
+	return (vcpu->pre_pcpu == -1);
+}
+
+int pi_pre_block(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->in_blocked_section = true;
+
+	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
+		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
+		!kvm_vcpu_apicv_active(vcpu))
+		return 0;
+
+	WARN_ON(irqs_disabled());
+	local_irq_disable();
+	__pi_pre_block(vcpu);
 	local_irq_enable();
+
 	return (vcpu->pre_pcpu == -1);
 }
 
 void pi_post_block(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->in_blocked_section = false;
 	if (vcpu->pre_pcpu == -1)
 		return;
 
@@ -236,6 +250,52 @@ bool pi_has_pending_interrupt(struct kvm
 		(pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc));
 }
 
+static void pi_update_wakeup_vector(void *data)
+{
+	struct vcpu_vmx *vmx;
+	struct kvm_vcpu *vcpu = data;
+
+	vmx = to_vmx(vcpu);
+
+	/* race with pi_post_block ? */
+	if (vcpu->pre_pcpu != -1)
+		return;
+
+	if (!vmx->in_blocked_section)
+		return;
+
+	__pi_pre_block(vcpu);
+}
+
+void vmx_pi_start_assignment(struct kvm *kvm, int device_count)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	if (!irq_remapping_cap(IRQ_POSTING_CAP))
+		return;
+
+	/* only care about first device assignment */
+	if (device_count != 1)
+		return;
+
+	/* Update wakeup vector and add vcpu to blocked_vcpu_list */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct vcpu_vmx *vmx = to_vmx(vcpu);
+		int pcpu;
+
+		if (!kvm_vcpu_apicv_active(vcpu))
+			continue;
+
+		preempt_disable();
+		pcpu = vcpu->cpu;
+		if (vmx->in_blocked_section && vcpu->pre_pcpu == -1 &&
+		    pcpu != -1 && pcpu != smp_processor_id())
+			smp_call_function_single(pcpu, pi_update_wakeup_vector,
+						 vcpu, 1);
+		preempt_enable();
+	}
+}
 
 /*
  * pi_update_irte - set IRTE for Posted-Interrupts
Index: kvm/arch/x86/kvm/vmx/posted_intr.h
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.h
+++ kvm/arch/x86/kvm/vmx/posted_intr.h
@@ -95,5 +95,6 @@ void __init pi_init_cpu(int cpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
 		   bool set);
+void vmx_pi_start_assignment(struct kvm *kvm, int device_count);
 
 #endif /* __KVM_X86_VMX_POSTED_INTR_H */
Index: kvm/arch/x86/kvm/vmx/vmx.h
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/vmx.h
+++ kvm/arch/x86/kvm/vmx/vmx.h
@@ -336,6 +336,9 @@ struct vcpu_vmx {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 	} shadow_msr_intercept;
+
+	/* true if vcpu is between pre_block and post_block */
+	bool in_blocked_section;
 };
 
 enum ept_pointers_status {



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 2/2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
  2021-05-06 18:57 ` [patch 2/2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device Marcelo Tosatti
@ 2021-05-06 19:11   ` Marcelo Tosatti
  2021-05-06 19:21   ` [patch 2/2 V2] " Marcelo Tosatti
  1 sibling, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-05-06 19:11 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Alex Williamson, Sean Christopherson, Pei Zhang

On Thu, May 06, 2021 at 03:57:34PM -0300, Marcelo Tosatti wrote:
> For VMX, when a vcpu enters HLT emulation, pi_post_block will:
> 
> 1) Add vcpu to per-cpu list of blocked vcpus.
> 
> 2) Program the posted-interrupt descriptor "notification vector" 
> to POSTED_INTR_WAKEUP_VECTOR
> 
> With interrupt remapping, an interrupt will set the PIR bit for the 
> vector programmed for the device on the CPU, test-and-set the 
> ON bit on the posted interrupt descriptor, and if the ON bit is clear
> generate an interrupt for the notification vector.
> 
> This way, the target CPU wakes upon a device interrupt and wakes up
> the target vcpu.
> 
> Problem is that pi_post_block only programs the notification vector
> if kvm_arch_has_assigned_device() is true. Its possible for the
> following to happen:
> 
> 1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
> notification vector is not programmed
> 2) device is assigned to VM
> 3) device interrupts vcpu V, sets ON bit (notification vector not programmed,
> so pcpu P remains in idle)
> 4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
> kvm_vcpu_kick is skipped
> 5) vcpu 0 busy spins on vcpu V's response for several seconds, until
> RCU watchdog NMIs all vCPUs.
> 
> To fix this, use the start_assignment kvm_x86_ops callback to program the
> notification vector when assigned device count changes from 0 to 1.
> 
> Reported-by: Pei Zhang <pezhang@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Argh, missing setting vmx_pi_start_assignment, will resend v2.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 2/2 V2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
  2021-05-06 18:57 ` [patch 2/2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device Marcelo Tosatti
  2021-05-06 19:11   ` Marcelo Tosatti
@ 2021-05-06 19:21   ` Marcelo Tosatti
  2021-05-06 21:35     ` Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: Marcelo Tosatti @ 2021-05-06 19:21 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Alex Williamson, Sean Christopherson, Pei Zhang


For VMX, when a vcpu enters HLT emulation, pi_post_block will:

1) Add vcpu to per-cpu list of blocked vcpus.

2) Program the posted-interrupt descriptor "notification vector" 
to POSTED_INTR_WAKEUP_VECTOR

With interrupt remapping, an interrupt will set the PIR bit for the 
vector programmed for the device on the CPU, test-and-set the 
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.

This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.

Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:

1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit (notification vector not programmed,
so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.

To fix this, use the start_assignment kvm_x86_ops callback to program the
notification vector when assigned device count changes from 0 to 1.

Reported-by: Pei Zhang <pezhang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

v2: add vmx_pi_start_assignment to vmx's kvm_x86_ops

Index: kvm/arch/x86/kvm/vmx/posted_intr.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
+++ kvm/arch/x86/kvm/vmx/posted_intr.c
@@ -114,7 +114,7 @@ static void __pi_post_block(struct kvm_v
 	} while (cmpxchg64(&pi_desc->control, old.control,
 			   new.control) != old.control);
 
-	if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) {
+	if (vcpu->pre_pcpu != -1) {
 		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
 		list_del(&vcpu->blocked_vcpu_list);
 		spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
@@ -135,20 +135,13 @@ static void __pi_post_block(struct kvm_v
  *   this case, return 1, otherwise, return 0.
  *
  */
-int pi_pre_block(struct kvm_vcpu *vcpu)
+static int __pi_pre_block(struct kvm_vcpu *vcpu)
 {
 	unsigned int dest;
 	struct pi_desc old, new;
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
-		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
-		!kvm_vcpu_apicv_active(vcpu))
-		return 0;
-
-	WARN_ON(irqs_disabled());
-	local_irq_disable();
-	if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) {
+	if (vcpu->pre_pcpu == -1) {
 		vcpu->pre_pcpu = vcpu->cpu;
 		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
 		list_add_tail(&vcpu->blocked_vcpu_list,
@@ -188,12 +181,33 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
 	if (pi_test_on(pi_desc) == 1)
 		__pi_post_block(vcpu);
 
+	return (vcpu->pre_pcpu == -1);
+}
+
+int pi_pre_block(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->in_blocked_section = true;
+
+	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
+		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
+		!kvm_vcpu_apicv_active(vcpu))
+		return 0;
+
+	WARN_ON(irqs_disabled());
+	local_irq_disable();
+	__pi_pre_block(vcpu);
 	local_irq_enable();
+
 	return (vcpu->pre_pcpu == -1);
 }
 
 void pi_post_block(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->in_blocked_section = false;
 	if (vcpu->pre_pcpu == -1)
 		return;
 
@@ -236,6 +250,52 @@ bool pi_has_pending_interrupt(struct kvm
 		(pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc));
 }
 
+static void pi_update_wakeup_vector(void *data)
+{
+	struct vcpu_vmx *vmx;
+	struct kvm_vcpu *vcpu = data;
+
+	vmx = to_vmx(vcpu);
+
+	/* race with pi_post_block ? */
+	if (vcpu->pre_pcpu != -1)
+		return;
+
+	if (!vmx->in_blocked_section)
+		return;
+
+	__pi_pre_block(vcpu);
+}
+
+void vmx_pi_start_assignment(struct kvm *kvm, int device_count)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	if (!irq_remapping_cap(IRQ_POSTING_CAP))
+		return;
+
+	/* only care about first device assignment */
+	if (device_count != 1)
+		return;
+
+	/* Update wakeup vector and add vcpu to blocked_vcpu_list */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct vcpu_vmx *vmx = to_vmx(vcpu);
+		int pcpu;
+
+		if (!kvm_vcpu_apicv_active(vcpu))
+			continue;
+
+		preempt_disable();
+		pcpu = vcpu->cpu;
+		if (vmx->in_blocked_section && vcpu->pre_pcpu == -1 &&
+		    pcpu != -1 && pcpu != smp_processor_id())
+			smp_call_function_single(pcpu, pi_update_wakeup_vector,
+						 vcpu, 1);
+		preempt_enable();
+	}
+}
 
 /*
  * pi_update_irte - set IRTE for Posted-Interrupts
Index: kvm/arch/x86/kvm/vmx/posted_intr.h
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.h
+++ kvm/arch/x86/kvm/vmx/posted_intr.h
@@ -95,5 +95,6 @@ void __init pi_init_cpu(int cpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
 		   bool set);
+void vmx_pi_start_assignment(struct kvm *kvm, int device_count);
 
 #endif /* __KVM_X86_VMX_POSTED_INTR_H */
Index: kvm/arch/x86/kvm/vmx/vmx.h
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/vmx.h
+++ kvm/arch/x86/kvm/vmx/vmx.h
@@ -336,6 +336,9 @@ struct vcpu_vmx {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 	} shadow_msr_intercept;
+
+	/* true if vcpu is between pre_block and post_block */
+	bool in_blocked_section;
 };
 
 enum ept_pointers_status {
Index: kvm/arch/x86/kvm/vmx/vmx.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/vmx.c
+++ kvm/arch/x86/kvm/vmx/vmx.c
@@ -7732,7 +7732,7 @@ static struct kvm_x86_ops vmx_x86_ops __
 	.nested_ops = &vmx_nested_ops,
 
 	.update_pi_irte = pi_update_irte,
-	.start_assignment = NULL,
+	.start_assignment = vmx_pi_start_assignment,
 
 #ifdef CONFIG_X86_64
 	.set_hv_timer = vmx_set_hv_timer,


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 1/2] KVM: x86: add start_assignment hook to kvm_x86_ops
  2021-05-06 18:57 ` [patch 1/2] KVM: x86: add start_assignment hook to kvm_x86_ops Marcelo Tosatti
@ 2021-05-06 19:54   ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2021-05-06 19:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Paolo Bonzini, Alex Williamson

On Thu, May 06, 2021, Marcelo Tosatti wrote:
> Index: kvm/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm.orig/arch/x86/kvm/x86.c
> +++ kvm/arch/x86/kvm/x86.c
> @@ -11295,7 +11295,11 @@ bool kvm_arch_can_dequeue_async_page_pre
>  
>  void kvm_arch_start_assignment(struct kvm *kvm)
>  {
> -	atomic_inc(&kvm->arch.assigned_device_count);
> +	int ret;
> +
> +	ret = atomic_inc_return(&kvm->arch.assigned_device_count);
> +	if (kvm_x86_ops.start_assignment)
> +		kvm_x86_ops.start_assignment(kvm, ret);

This can be a static_call(), just needs an entry in kvm-x86-ops.h.

>  }
>  EXPORT_SYMBOL_GPL(kvm_arch_start_assignment);
>  
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 2/2 V2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
  2021-05-06 19:21   ` [patch 2/2 V2] " Marcelo Tosatti
@ 2021-05-06 21:35     ` Sean Christopherson
  2021-05-07 12:11       ` Marcelo Tosatti
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2021-05-06 21:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Paolo Bonzini, Alex Williamson, Pei Zhang

On Thu, May 06, 2021, Marcelo Tosatti wrote:
> Index: kvm/arch/x86/kvm/vmx/posted_intr.c
> ===================================================================
> --- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
> +++ kvm/arch/x86/kvm/vmx/posted_intr.c
> @@ -114,7 +114,7 @@ static void __pi_post_block(struct kvm_v
>  	} while (cmpxchg64(&pi_desc->control, old.control,
>  			   new.control) != old.control);
>  
> -	if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) {
> +	if (vcpu->pre_pcpu != -1) {
>  		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
>  		list_del(&vcpu->blocked_vcpu_list);
>  		spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> @@ -135,20 +135,13 @@ static void __pi_post_block(struct kvm_v
>   *   this case, return 1, otherwise, return 0.
>   *
>   */
> -int pi_pre_block(struct kvm_vcpu *vcpu)
> +static int __pi_pre_block(struct kvm_vcpu *vcpu)
>  {
>  	unsigned int dest;
>  	struct pi_desc old, new;
>  	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  
> -	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> -		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
> -		!kvm_vcpu_apicv_active(vcpu))
> -		return 0;
> -
> -	WARN_ON(irqs_disabled());
> -	local_irq_disable();
> -	if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) {
> +	if (vcpu->pre_pcpu == -1) {
>  		vcpu->pre_pcpu = vcpu->cpu;
>  		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
>  		list_add_tail(&vcpu->blocked_vcpu_list,
> @@ -188,12 +181,33 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
>  	if (pi_test_on(pi_desc) == 1)
>  		__pi_post_block(vcpu);
>  
> +	return (vcpu->pre_pcpu == -1);

Nothing checks the return of __pi_pre_block(), this can be dropped and the
helper can be a void return.

> +}
> +
> +int pi_pre_block(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->in_blocked_section = true;
> +
> +	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> +		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
> +		!kvm_vcpu_apicv_active(vcpu))

Opportunistically fix the indentation?

> +		return 0;
> +
> +	WARN_ON(irqs_disabled());
> +	local_irq_disable();
> +	__pi_pre_block(vcpu);
>  	local_irq_enable();
> +
>  	return (vcpu->pre_pcpu == -1);
>  }
>  
>  void pi_post_block(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->in_blocked_section = false;
>  	if (vcpu->pre_pcpu == -1)
>  		return;
>  
> @@ -236,6 +250,52 @@ bool pi_has_pending_interrupt(struct kvm
>  		(pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc));
>  }
>  
> +static void pi_update_wakeup_vector(void *data)
> +{
> +	struct vcpu_vmx *vmx;
> +	struct kvm_vcpu *vcpu = data;
> +
> +	vmx = to_vmx(vcpu);
> +
> +	/* race with pi_post_block ? */
> +	if (vcpu->pre_pcpu != -1)

This seems wrong.  The funky code in __pi_pre_block() regarding pre_cpu muddies
the waters, but I don't think it's safe to call __pi_pre_block() from a pCPU
other than the pCPU that is associated with the vCPU.

If the vCPU is migrated after vmx_pi_start_assignment() grabs vcpu->cpu but
before the IPI arrives (to run pi_update_wakeup_vector()), then it's possible
that a different pCPU could be running __pi_pre_block() concurrently with this
code.  If that happens, both pcPUs could see "vcpu->pre_cpu == -1" and corrupt
the list due to a double list_add_tail.

The existing code is unnecessarily confusing, but unless I'm missing something,
it's guaranteed to call pi_pre_block() from the pCPU that is associated with the
pCPU, i.e. arguably it could/should use this_cpu_ptr().  Because the existing
code grabs vcpu->cpu with IRQs disabled and is called only from KVM_RUN,
vcpu->cpu is guaranteed to match the current pCPU since vcpu->cpu will be set to
the current pCPU when the vCPU is scheduled in.

Assuming my analysis is correct (definitely not guaranteed), I'm struggling to
come up with an elegant solution.  But, do we need an elegant solution?  E.g.
can the start_assignment() hook simply kick all vCPUs with APICv active?

> +		return;
> +
> +	if (!vmx->in_blocked_section)
> +		return;
> +
> +	__pi_pre_block(vcpu);
> +}
> +
> +void vmx_pi_start_assignment(struct kvm *kvm, int device_count)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> +		return;
> +
> +	/* only care about first device assignment */
> +	if (device_count != 1)
> +		return;
> +
> +	/* Update wakeup vector and add vcpu to blocked_vcpu_list */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		struct vcpu_vmx *vmx = to_vmx(vcpu);
> +		int pcpu;
> +
> +		if (!kvm_vcpu_apicv_active(vcpu))
> +			continue;
> +
> +		preempt_disable();

Any reason not to do "cpu = get_cpu()"?  Might make sense to do that outside of
the for-loop, too.

> +		pcpu = vcpu->cpu;
> +		if (vmx->in_blocked_section && vcpu->pre_pcpu == -1 &&
> +		    pcpu != -1 && pcpu != smp_processor_id())
> +			smp_call_function_single(pcpu, pi_update_wakeup_vector,
> +						 vcpu, 1);
> +		preempt_enable();
> +	}
> +}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 2/2 V2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
  2021-05-06 21:35     ` Sean Christopherson
@ 2021-05-07 12:11       ` Marcelo Tosatti
  0 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-05-07 12:11 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, Alex Williamson, Pei Zhang

Hi Sean,

On Thu, May 06, 2021 at 09:35:46PM +0000, Sean Christopherson wrote:
> On Thu, May 06, 2021, Marcelo Tosatti wrote:
> > Index: kvm/arch/x86/kvm/vmx/posted_intr.c
> > ===================================================================
> > --- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
> > +++ kvm/arch/x86/kvm/vmx/posted_intr.c
> > @@ -114,7 +114,7 @@ static void __pi_post_block(struct kvm_v
> >  	} while (cmpxchg64(&pi_desc->control, old.control,
> >  			   new.control) != old.control);
> >  
> > -	if (!WARN_ON_ONCE(vcpu->pre_pcpu == -1)) {
> > +	if (vcpu->pre_pcpu != -1) {
> >  		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> >  		list_del(&vcpu->blocked_vcpu_list);
> >  		spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> > @@ -135,20 +135,13 @@ static void __pi_post_block(struct kvm_v
> >   *   this case, return 1, otherwise, return 0.
> >   *
> >   */
> > -int pi_pre_block(struct kvm_vcpu *vcpu)
> > +static int __pi_pre_block(struct kvm_vcpu *vcpu)
> >  {
> >  	unsigned int dest;
> >  	struct pi_desc old, new;
> >  	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> >  
> > -	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> > -		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
> > -		!kvm_vcpu_apicv_active(vcpu))
> > -		return 0;
> > -
> > -	WARN_ON(irqs_disabled());
> > -	local_irq_disable();
> > -	if (!WARN_ON_ONCE(vcpu->pre_pcpu != -1)) {
> > +	if (vcpu->pre_pcpu == -1) {
> >  		vcpu->pre_pcpu = vcpu->cpu;
> >  		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
> >  		list_add_tail(&vcpu->blocked_vcpu_list,
> > @@ -188,12 +181,33 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
> >  	if (pi_test_on(pi_desc) == 1)
> >  		__pi_post_block(vcpu);
> >  
> > +	return (vcpu->pre_pcpu == -1);
> 
> Nothing checks the return of __pi_pre_block(), this can be dropped and the
> helper can be a void return.

Done.

> > +}
> > +
> > +int pi_pre_block(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > +	vmx->in_blocked_section = true;
> > +
> > +	if (!kvm_arch_has_assigned_device(vcpu->kvm) ||
> > +		!irq_remapping_cap(IRQ_POSTING_CAP)  ||
> > +		!kvm_vcpu_apicv_active(vcpu))
> 
> Opportunistically fix the indentation?

Done.

> > +		return 0;
> > +
> > +	WARN_ON(irqs_disabled());
> > +	local_irq_disable();
> > +	__pi_pre_block(vcpu);
> >  	local_irq_enable();
> > +
> >  	return (vcpu->pre_pcpu == -1);
> >  }
> >  
> >  void pi_post_block(struct kvm_vcpu *vcpu)
> >  {
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > +	vmx->in_blocked_section = false;
> >  	if (vcpu->pre_pcpu == -1)
> >  		return;
> >  
> > @@ -236,6 +250,52 @@ bool pi_has_pending_interrupt(struct kvm
> >  		(pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc));
> >  }
> >  
> > +static void pi_update_wakeup_vector(void *data)
> > +{
> > +	struct vcpu_vmx *vmx;
> > +	struct kvm_vcpu *vcpu = data;
> > +
> > +	vmx = to_vmx(vcpu);
> > +
> > +	/* race with pi_post_block ? */
> > +	if (vcpu->pre_pcpu != -1)
> 
> This seems wrong.  The funky code in __pi_pre_block() regarding pre_cpu muddies
> the waters, but I don't think it's safe to call __pi_pre_block() from a pCPU
> other than the pCPU that is associated with the vCPU.

From Intel's manual:

"29.6 POSTED-INTERRUPT PROCESSING

...

Use of the posted-interrupt descriptor differs from that of other
data structures that are referenced by pointers in a VMCS. There is a
general requirement that software ensure that each such data structure
is modified only when no logical processor with a current VMCS that
references it is in VMX non-root operation. That requirement does not
apply to the posted-interrupt descriptor. There is a requirement,
however, that such modifications be done using locked read-modify-write
instructions."

> If the vCPU is migrated after vmx_pi_start_assignment() grabs vcpu->cpu but
> before the IPI arrives (to run pi_update_wakeup_vector()), then it's possible
> that a different pCPU could be running __pi_pre_block() concurrently with this
> code.  If that happens, both pcPUs could see "vcpu->pre_cpu == -1" and corrupt
> the list due to a double list_add_tail.

Good point.

> The existing code is unnecessarily confusing, but unless I'm missing something,
> it's guaranteed to call pi_pre_block() from the pCPU that is associated with the
> pCPU, i.e. arguably it could/should use this_cpu_ptr(). 

Well problem is it might not exit kvm_vcpu_block(). However that can be
fixed.


>  Because the existing
> code grabs vcpu->cpu with IRQs disabled and is called only from KVM_RUN,
> vcpu->cpu is guaranteed to match the current pCPU since vcpu->cpu will be set to
> the current pCPU when the vCPU is scheduled in.
> 
> Assuming my analysis is correct (definitely not guaranteed), I'm struggling to
> come up with an elegant solution.  But, do we need an elegant solution?  E.g.
> can the start_assignment() hook simply kick all vCPUs with APICv active?
> 
> > +		return;
> > +
> > +	if (!vmx->in_blocked_section)
> > +		return;
> > +
> > +	__pi_pre_block(vcpu);
> > +}
> > +
> > +void vmx_pi_start_assignment(struct kvm *kvm, int device_count)
> > +{
> > +	struct kvm_vcpu *vcpu;
> > +	int i;
> > +
> > +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> > +		return;
> > +
> > +	/* only care about first device assignment */
> > +	if (device_count != 1)
> > +		return;
> > +
> > +	/* Update wakeup vector and add vcpu to blocked_vcpu_list */
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +		int pcpu;
> > +
> > +		if (!kvm_vcpu_apicv_active(vcpu))
> > +			continue;
> > +
> > +		preempt_disable();
> 
> Any reason not to do "cpu = get_cpu()"?  Might make sense to do that outside of
> the for-loop, too.

kvm_vcpu_kick seems cleaner, just need to add another arch
hook to allow kvm_vcpu_block() to return.

Thanks for the review! Will resend after testing.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-07 12:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06 18:57 [patch 0/2] VMX: configure posted interrupt descriptor when assigning device Marcelo Tosatti
2021-05-06 18:57 ` [patch 1/2] KVM: x86: add start_assignment hook to kvm_x86_ops Marcelo Tosatti
2021-05-06 19:54   ` Sean Christopherson
2021-05-06 18:57 ` [patch 2/2] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device Marcelo Tosatti
2021-05-06 19:11   ` Marcelo Tosatti
2021-05-06 19:21   ` [patch 2/2 V2] " Marcelo Tosatti
2021-05-06 21:35     ` Sean Christopherson
2021-05-07 12:11       ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).