[PATCH v4 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops
@ 2019-08-22 15:34 Ben Luo
  2019-08-22 15:34 ` [PATCH v4 1/3] genirq: enhance error recovery code in free irq Ben Luo
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ben Luo @ 2019-08-22 15:34 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

Currently, VFIO takes uses a free-then-request-irq way to do interrupt
affinity setting and masking/unmasking for a VM (with device passthru
via VFIO). Sometimes only change the cookie data of irqaction or even
change nothing. The free-then-request-irq not only adds more latency,
but also increases the risk of losing interrupt, which may lead to a
VM hung forever in waiting for IO completion

This patchset solved the issue by:
Patch 2 introduces irq_update_devid() to only update dev_id of irqaction
Patch 3 make use of this function and optimize irq operations in VFIO

changes from v3:
 - rename the new function to irq_update_devid()
 - amend commit messages and code comments
 - use disbale_irq()/enable_irq() to avoid a twist for threaded interrupt
 - add more comments to vfio-pci code

changes from v2:
 - reformat to avoid quoted string split across lines and etc.

changes from v1:
 - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
 - enhance error recovery code and debugging info in irq_update_devid
 - use __must_check in external referencing of this function
 - use EXPORT_SYMBOL_GPL for irq_update_devid
 - reformat code of patch 3 for better readability

Ben Luo (3):
  genirq: enhance error recovery code in free irq
  genirq: introduce irq_update_devid()
  vfio/pci: make use of irq_update_devid and optimize irq ops

 drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
 include/linux/interrupt.h         |   3 +
 kernel/irq/manage.c               | 105 +++++++++++++++++++++++++++++++----
 3 files changed, 170 insertions(+), 50 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/3] genirq: enhance error recovery code in free irq
  2019-08-22 15:34 [PATCH v4 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
@ 2019-08-22 15:34 ` Ben Luo
  2019-08-22 15:34 ` [PATCH v4 2/3] genirq: introduce irq_update_devid() Ben Luo
  2019-08-22 15:34 ` [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
  2 siblings, 0 replies; 9+ messages in thread
From: Ben Luo @ 2019-08-22 15:34 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

__free_irq()/__free_percpu_irq() need to return if called from IRQ
context because the interrupt handler loop runs with desc->lock dropped
and dev_id can be subject to load and store tearing. Also move WARNs
out of lock region and print out dev_id to help debugging.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 kernel/irq/manage.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f17..10ec3e9 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1690,7 +1690,10 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 	struct irqaction *action, **action_ptr;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (WARN(in_interrupt(),
+		 "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+		 irq, dev_id))
+		return NULL;
 
 	mutex_lock(&desc->request_mutex);
 	chip_bus_lock(desc);
@@ -1705,10 +1708,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 		action = *action_ptr;
 
 		if (!action) {
-			WARN(1, "Trying to free already-free IRQ %d\n", irq);
 			raw_spin_unlock_irqrestore(&desc->lock, flags);
 			chip_bus_sync_unlock(desc);
 			mutex_unlock(&desc->request_mutex);
+			WARN(1, "Trying to free already-free IRQ %d (dev_id %p)\n",
+			     irq, dev_id);
 			return NULL;
 		}
 
@@ -2286,7 +2290,10 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	struct irqaction *action;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (WARN(in_interrupt(),
+		 "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+		 irq, dev_id))
+		return NULL;
 
 	if (!desc)
 		return NULL;
@@ -2295,14 +2302,17 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 
 	action = desc->action;
 	if (!action || action->percpu_dev_id != dev_id) {
-		WARN(1, "Trying to free already-free IRQ %d\n", irq);
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "Trying to free already-free IRQ (dev_id %p) %d\n",
+		     dev_id, irq);
+		return NULL;
 	}
 
 	if (!cpumask_empty(desc->percpu_enabled)) {
-		WARN(1, "percpu IRQ %d still enabled on CPU%d!\n",
-		     irq, cpumask_first(desc->percpu_enabled));
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "percpu IRQ %d (dev_id %p) still enabled on CPU%d!\n",
+		     irq, dev_id, cpumask_first(desc->percpu_enabled));
+		return NULL;
 	}
 
 	/* Found it - now remove it from the list of entries: */
@@ -2317,10 +2327,6 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	irq_chip_pm_put(&desc->irq_data);
 	module_put(desc->owner);
 	return action;
-
-bad:
-	raw_spin_unlock_irqrestore(&desc->lock, flags);
-	return NULL;
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 2/3] genirq: introduce irq_update_devid()
  2019-08-22 15:34 [PATCH v4 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
  2019-08-22 15:34 ` [PATCH v4 1/3] genirq: enhance error recovery code in free irq Ben Luo
@ 2019-08-22 15:34 ` Ben Luo
  2019-08-22 15:34 ` [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
  2 siblings, 0 replies; 9+ messages in thread
From: Ben Luo @ 2019-08-22 15:34 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

Sometimes, only the dev_id field of irqaction needs to be changed.

E.g. KVM VM with device passthru via VFIO may switch the interrupt
injection path between KVM irqfd and userspace eventfd. These two
paths share the same interrupt number and handler for the same msi
vector of a device, only with different 'dev_id's referencing to
different fds' contexts. Set interrupt affinity in this VM is a way
to trigger the path switching.

Currently, VFIO uses a free-then-request-irq way for the path switching.
There is a time window between free_irq() and request_irq() where the
target IRTE is invalid. So, in-flight interrupts (buffering in hardware
layer and unfortunately cannot be synchronized in software) can cause
DMAR faults and even worse, this VM may hang in waiting IO completion.

By using irq_update_devid(), this issue can be avoided since IRTE will
not be invalidated during the whole process.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 include/linux/interrupt.h |  3 ++
 kernel/irq/manage.c       | 75 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a..09b6a0f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -172,6 +172,9 @@ struct irqaction {
 request_percpu_nmi(unsigned int irq, irq_handler_t handler,
 		   const char *devname, void __percpu *dev);
 
+extern int __must_check
+irq_update_devid(unsigned int irq, void *dev_id, void *new_dev_id);
+
 extern const void *free_irq(unsigned int, void *);
 extern void free_percpu_irq(unsigned int, void __percpu *);
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 10ec3e9..adb1980 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2063,6 +2063,81 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 EXPORT_SYMBOL(request_threaded_irq);
 
 /**
+ *	irq_update_devid - update irq dev_id to a new one
+ *
+ *	@irq:		Interrupt line to update
+ *	@dev_id:	A cookie to find the irqaction to update
+ *	@new_dev_id:	New cookie passed to the handler function
+ *
+ *	Sometimes, only the cookie data need to be changed. Instead of
+ *	free-then-request interrupt, only update dev_id of irqaction can
+ *	not only gain some performance benefit, but also reduce the risk
+ *	of losing interrupt.
+ *
+ *	This function won't update dev_id until any executing interrupts
+ *	for this IRQ have completed. This function must not be called
+ *	from interrupt context.
+ *
+ *	On failure, it returns a negative value. On success,
+ *	it returns 0
+ */
+int irq_update_devid(unsigned int irq, void *dev_id, void *new_dev_id)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irqaction *action, **action_ptr;
+	unsigned long flags;
+
+	if (WARN(in_interrupt(),
+		 "Trying to update IRQ %d (dev_id %p to %p) from IRQ context!\n",
+		 irq, dev_id, new_dev_id))
+		return -EPERM;
+
+	if (!desc)
+		return -EINVAL;
+
+	/*
+	 * Ensure that an interrupt in flight on another CPU which uses the
+	 * old 'dev_id' has completed because the caller can free the memory
+	 * to which it points after this function returns. And also avoid to
+	 * update 'dev_id' in the middle of a threaded interrupt process, it
+	 * can lead to a twist that primary handler uses old 'dev_id' but new
+	 * 'dev_id' is used by secondary handler.
+	 */
+	disable_irq(irq);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	/*
+	 * There can be multiple actions per IRQ descriptor, find the right
+	 * one based on the dev_id:
+	 */
+	action_ptr = &desc->action;
+	for (;;) {
+		action = *action_ptr;
+
+		if (!action) {
+			raw_spin_unlock_irqrestore(&desc->lock, flags);
+			enable_irq(irq);
+			WARN(1,
+			     "Trying to update already-free IRQ %d (dev_id %p to %p)\n",
+			     irq, dev_id, new_dev_id);
+			return -ENXIO;
+		}
+
+		if (action->dev_id == dev_id) {
+			action->dev_id = new_dev_id;
+			break;
+		}
+		action_ptr = &action->next;
+	}
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	enable_irq(irq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(irq_update_devid);
+
+/**
  *	request_any_context_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
  *	@handler: Function to be called when the IRQ occurs.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-22 15:34 [PATCH v4 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
  2019-08-22 15:34 ` [PATCH v4 1/3] genirq: enhance error recovery code in free irq Ben Luo
  2019-08-22 15:34 ` [PATCH v4 2/3] genirq: introduce irq_update_devid() Ben Luo
@ 2019-08-22 15:34 ` Ben Luo
  2019-08-27 20:33   ` Alex Williamson
  2 siblings, 1 reply; 9+ messages in thread
From: Ben Luo @ 2019-08-22 15:34 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of irq_update_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
 1 file changed, 74 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..60d3023 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 				      int vector, int fd, bool msix)
 {
+	struct eventfd_ctx *trigger = NULL;
 	struct pci_dev *pdev = vdev->pdev;
-	struct eventfd_ctx *trigger;
 	int irq, ret;
 
 	if (vector < 0 || vector >= vdev->num_ctx)
 		return -EINVAL;
 
+	if (fd >= 0) {
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger))
+			return PTR_ERR(trigger);
+	}
+
 	irq = pci_irq_vector(pdev, vector);
 
+	/*
+	 * For KVM-VFIO case, interrupt from passthrough device will be directly
+	 * delivered to VM after producer and consumer connected successfully.
+	 * If producer and consumer are disconnected, this interrupt process
+	 * will fall back to remap mode, where interrupt handler uses 'trigger'
+	 * to find the right way to deliver the interrupt to VM. So, it is safe
+	 * to do irq_update_devid() before irq_bypass_unregister_producer() which
+	 * switches interrupt process to remap mode. To producer and consumer,
+	 * 'trigger' is only a token used for pairing them togather.
+	 */
 	if (vdev->ctx[vector].trigger) {
-		free_irq(irq, vdev->ctx[vector].trigger);
-		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(vdev->ctx[vector].trigger);
-		vdev->ctx[vector].trigger = NULL;
+		if (vdev->ctx[vector].trigger == trigger) {
+			/* switch back to remap mode */
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+		} else if (trigger) {
+			ret = irq_update_devid(irq,
+					       vdev->ctx[vector].trigger, trigger);
+			if (unlikely(ret)) {
+				dev_info(&pdev->dev,
+					 "update devid of %d (token %p) failed: %d\n",
+					 irq, vdev->ctx[vector].trigger, ret);
+				eventfd_ctx_put(trigger);
+				return ret;
+			}
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].producer.token = trigger;
+			vdev->ctx[vector].trigger = trigger;
+		} else {
+			free_irq(irq, vdev->ctx[vector].trigger);
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].trigger = NULL;
+		}
 	}
 
 	if (fd < 0)
 		return 0;
 
-	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
-					   msix ? "x" : "", vector,
-					   pci_name(pdev));
-	if (!vdev->ctx[vector].name)
-		return -ENOMEM;
+	if (!vdev->ctx[vector].trigger) {
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
+						   "vfio-msi%s[%d](%s)",
+						   msix ? "x" : "", vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name) {
+			eventfd_ctx_put(trigger);
+			return -ENOMEM;
+		}
 
-	trigger = eventfd_ctx_fdget(fd);
-	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[vector].name);
-		return PTR_ERR(trigger);
-	}
+		/*
+		 * The MSIx vector table resides in device memory which may be
+		 * cleared via backdoor resets. We don't allow direct access to
+		 * the vector table so even if a userspace driver attempts to
+		 * save/restore around such a reset it would be unsuccessful.
+		 * To avoid this, restore the cached value of the message prior
+		 * to enabling.
+		 */
+		if (msix) {
+			struct msi_msg msg;
 
-	/*
-	 * The MSIx vector table resides in device memory which may be cleared
-	 * via backdoor resets. We don't allow direct access to the vector
-	 * table so even if a userspace driver attempts to save/restore around
-	 * such a reset it would be unsuccessful. To avoid this, restore the
-	 * cached value of the message prior to enabling.
-	 */
-	if (msix) {
-		struct msi_msg msg;
+			get_cached_msi_msg(irq, &msg);
+			pci_write_msi_msg(irq, &msg);
+		}
 
-		get_cached_msi_msg(irq, &msg);
-		pci_write_msi_msg(irq, &msg);
-	}
+		ret = request_irq(irq, vfio_msihandler, 0,
+				  vdev->ctx[vector].name, trigger);
+		if (ret) {
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(trigger);
+			return ret;
+		}
 
-	ret = request_irq(irq, vfio_msihandler, 0,
-			  vdev->ctx[vector].name, trigger);
-	if (ret) {
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(trigger);
-		return ret;
+		vdev->ctx[vector].producer.token = trigger;
+		vdev->ctx[vector].producer.irq = irq;
+		vdev->ctx[vector].trigger = trigger;
 	}
 
-	vdev->ctx[vector].producer.token = trigger;
-	vdev->ctx[vector].producer.irq = irq;
+	/* setup bypass connection and make irte updated */
 	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
 	if (unlikely(ret))
 		dev_info(&pdev->dev,
 		"irq bypass producer (token %p) registration fails: %d\n",
 		vdev->ctx[vector].producer.token, ret);
 
-	vdev->ctx[vector].trigger = trigger;
-
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-22 15:34 ` [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
@ 2019-08-27 20:33   ` Alex Williamson
  2019-08-28 10:08     ` Ben Luo
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2019-08-27 20:33 UTC (permalink / raw)
  To: Ben Luo; +Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

On Thu, 22 Aug 2019 23:34:43 +0800
Ben Luo <luoben@linux.alibaba.com> wrote:

> When userspace (e.g. qemu) triggers a switch between KVM
> irqfd and userspace eventfd, only dev_id of irq action
> (i.e. the "trigger" in this patch's context) will be
> changed, but a free-then-request-irq action is taken in
> current code. And, irq affinity setting in VM will also
> trigger a free-then-request-irq action, which actually
> changes nothing, but only fires a producer re-registration
> to update irte in case that posted-interrupt is in use.
> 
> This patch makes use of irq_update_devid() and optimize
> both cases above, which reduces the risk of losing interrupt
> and also cuts some overhead.
> 
> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
>  1 file changed, 74 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3fa3f72..60d3023 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
>  static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
>  				      int vector, int fd, bool msix)
>  {
> +	struct eventfd_ctx *trigger = NULL;
>  	struct pci_dev *pdev = vdev->pdev;
> -	struct eventfd_ctx *trigger;
>  	int irq, ret;
>  
>  	if (vector < 0 || vector >= vdev->num_ctx)
>  		return -EINVAL;
>  
> +	if (fd >= 0) {
> +		trigger = eventfd_ctx_fdget(fd);
> +		if (IS_ERR(trigger))
> +			return PTR_ERR(trigger);
> +	}

I think this is a user visible change.  Previously the vector is
disabled first, then if an error occurs re-enabling, we return an errno
with the vector disabled.  Here we instead fail the ioctl and leave the
state as if it had never happened.  For instance with QEMU, if they
were trying to change from KVM to userspace signaling and entered this
condition, previously the interrupt would signal to neither eventfd, now
it would continue signaling to KVM.  If QEMU's intent was to emulate
vector masking, this could induce unhandled interrupts in the guest.
Maybe we need a tear-down on fault here to maintain that behavior, or
do you see some justification for the change?

> +
>  	irq = pci_irq_vector(pdev, vector);
>  
> +	/*
> +	 * For KVM-VFIO case, interrupt from passthrough device will be directly
> +	 * delivered to VM after producer and consumer connected successfully.
> +	 * If producer and consumer are disconnected, this interrupt process
> +	 * will fall back to remap mode, where interrupt handler uses 'trigger'
> +	 * to find the right way to deliver the interrupt to VM. So, it is safe
> +	 * to do irq_update_devid() before irq_bypass_unregister_producer() which
> +	 * switches interrupt process to remap mode. To producer and consumer,
> +	 * 'trigger' is only a token used for pairing them togather.
> +	 */
>  	if (vdev->ctx[vector].trigger) {
> -		free_irq(irq, vdev->ctx[vector].trigger);
> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> -		kfree(vdev->ctx[vector].name);
> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
> -		vdev->ctx[vector].trigger = NULL;
> +		if (vdev->ctx[vector].trigger == trigger) {
> +			/* switch back to remap mode */
> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

I think we leak the fd context we acquired above in this case.

Why do we do anything in this case, couldn't we just 'put' the extra ctx
and return 0 here?

> +		} else if (trigger) {
> +			ret = irq_update_devid(irq,
> +					       vdev->ctx[vector].trigger, trigger);
> +			if (unlikely(ret)) {
> +				dev_info(&pdev->dev,
> +					 "update devid of %d (token %p) failed: %d\n",
> +					 irq, vdev->ctx[vector].trigger, ret);
> +				eventfd_ctx_put(trigger);
> +				return ret;
> +			}
> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

Can you explain this ordering, I would have expected that we'd
unregister the bypass before we updated the devid.  Thanks,

Alex

> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
> +			vdev->ctx[vector].producer.token = trigger;
> +			vdev->ctx[vector].trigger = trigger;
> +		} else {
> +			free_irq(irq, vdev->ctx[vector].trigger);
> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> +			kfree(vdev->ctx[vector].name);
> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
> +			vdev->ctx[vector].trigger = NULL;
> +		}
>  	}
>  
>  	if (fd < 0)
>  		return 0;
>  
> -	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
> -					   msix ? "x" : "", vector,
> -					   pci_name(pdev));
> -	if (!vdev->ctx[vector].name)
> -		return -ENOMEM;
> +	if (!vdev->ctx[vector].trigger) {
> +		vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
> +						   "vfio-msi%s[%d](%s)",
> +						   msix ? "x" : "", vector,
> +						   pci_name(pdev));
> +		if (!vdev->ctx[vector].name) {
> +			eventfd_ctx_put(trigger);
> +			return -ENOMEM;
> +		}
>  
> -	trigger = eventfd_ctx_fdget(fd);
> -	if (IS_ERR(trigger)) {
> -		kfree(vdev->ctx[vector].name);
> -		return PTR_ERR(trigger);
> -	}
> +		/*
> +		 * The MSIx vector table resides in device memory which may be
> +		 * cleared via backdoor resets. We don't allow direct access to
> +		 * the vector table so even if a userspace driver attempts to
> +		 * save/restore around such a reset it would be unsuccessful.
> +		 * To avoid this, restore the cached value of the message prior
> +		 * to enabling.
> +		 */
> +		if (msix) {
> +			struct msi_msg msg;
>  
> -	/*
> -	 * The MSIx vector table resides in device memory which may be cleared
> -	 * via backdoor resets. We don't allow direct access to the vector
> -	 * table so even if a userspace driver attempts to save/restore around
> -	 * such a reset it would be unsuccessful. To avoid this, restore the
> -	 * cached value of the message prior to enabling.
> -	 */
> -	if (msix) {
> -		struct msi_msg msg;
> +			get_cached_msi_msg(irq, &msg);
> +			pci_write_msi_msg(irq, &msg);
> +		}
>  
> -		get_cached_msi_msg(irq, &msg);
> -		pci_write_msi_msg(irq, &msg);
> -	}
> +		ret = request_irq(irq, vfio_msihandler, 0,
> +				  vdev->ctx[vector].name, trigger);
> +		if (ret) {
> +			kfree(vdev->ctx[vector].name);
> +			eventfd_ctx_put(trigger);
> +			return ret;
> +		}
>  
> -	ret = request_irq(irq, vfio_msihandler, 0,
> -			  vdev->ctx[vector].name, trigger);
> -	if (ret) {
> -		kfree(vdev->ctx[vector].name);
> -		eventfd_ctx_put(trigger);
> -		return ret;
> +		vdev->ctx[vector].producer.token = trigger;
> +		vdev->ctx[vector].producer.irq = irq;
> +		vdev->ctx[vector].trigger = trigger;
>  	}
>  
> -	vdev->ctx[vector].producer.token = trigger;
> -	vdev->ctx[vector].producer.irq = irq;
> +	/* setup bypass connection and make irte updated */
>  	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>  	if (unlikely(ret))
>  		dev_info(&pdev->dev,
>  		"irq bypass producer (token %p) registration fails: %d\n",
>  		vdev->ctx[vector].producer.token, ret);
>  
> -	vdev->ctx[vector].trigger = trigger;
> -
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-27 20:33   ` Alex Williamson
@ 2019-08-28 10:08     ` Ben Luo
  2019-08-28 17:23       ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Ben Luo @ 2019-08-28 10:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng


在 2019/8/28 上午4:33, Alex Williamson 写道:
> On Thu, 22 Aug 2019 23:34:43 +0800
> Ben Luo <luoben@linux.alibaba.com> wrote:
>
>> When userspace (e.g. qemu) triggers a switch between KVM
>> irqfd and userspace eventfd, only dev_id of irq action
>> (i.e. the "trigger" in this patch's context) will be
>> changed, but a free-then-request-irq action is taken in
>> current code. And, irq affinity setting in VM will also
>> trigger a free-then-request-irq action, which actually
>> changes nothing, but only fires a producer re-registration
>> to update irte in case that posted-interrupt is in use.
>>
>> This patch makes use of irq_update_devid() and optimize
>> both cases above, which reduces the risk of losing interrupt
>> and also cuts some overhead.
>>
>> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
>> ---
>>   drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
>>   1 file changed, 74 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>> index 3fa3f72..60d3023 100644
>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
>>   static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
>>   				      int vector, int fd, bool msix)
>>   {
>> +	struct eventfd_ctx *trigger = NULL;
>>   	struct pci_dev *pdev = vdev->pdev;
>> -	struct eventfd_ctx *trigger;
>>   	int irq, ret;
>>   
>>   	if (vector < 0 || vector >= vdev->num_ctx)
>>   		return -EINVAL;
>>   
>> +	if (fd >= 0) {
>> +		trigger = eventfd_ctx_fdget(fd);
>> +		if (IS_ERR(trigger))
>> +			return PTR_ERR(trigger);
>> +	}
> I think this is a user visible change.  Previously the vector is
> disabled first, then if an error occurs re-enabling, we return an errno
> with the vector disabled.  Here we instead fail the ioctl and leave the
> state as if it had never happened.  For instance with QEMU, if they
> were trying to change from KVM to userspace signaling and entered this
> condition, previously the interrupt would signal to neither eventfd, now
> it would continue signaling to KVM. If QEMU's intent was to emulate
> vector masking, this could induce unhandled interrupts in the guest.
> Maybe we need a tear-down on fault here to maintain that behavior, or
> do you see some justification for the change?

Thanks for your comments, this reminds me to think more about the 
effects to users.

After I reviewed the related code in Qemu and VFIO, I think maybe there 
is a problem in current behavior
for the signal path changing case. Qemu has neither recovery nor retry 
code in case that ioctl with
'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been 
disabled on fault of setting
up new path, the corresponding vector may be disabled forever. Following 
is an example from qemu's
vfio_msix_vector_do_use():

         ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
         g_free(irq_set);
         if (ret) {
             error_report("vfio: failed to modify vector, %d", ret);
         }

I think the singal path before changing should be still working at this 
moment and the caller should keep it
working if the changing fails, so that at least we still have the old 
path instead of no path.

For masking vector case, the 'fd' should be -1, and the interrupt will 
be freed as before this patch.

>
>> +
>>   	irq = pci_irq_vector(pdev, vector);
>>   
>> +	/*
>> +	 * For KVM-VFIO case, interrupt from passthrough device will be directly
>> +	 * delivered to VM after producer and consumer connected successfully.
>> +	 * If producer and consumer are disconnected, this interrupt process
>> +	 * will fall back to remap mode, where interrupt handler uses 'trigger'
>> +	 * to find the right way to deliver the interrupt to VM. So, it is safe
>> +	 * to do irq_update_devid() before irq_bypass_unregister_producer() which
>> +	 * switches interrupt process to remap mode. To producer and consumer,
>> +	 * 'trigger' is only a token used for pairing them togather.
>> +	 */
>>   	if (vdev->ctx[vector].trigger) {
>> -		free_irq(irq, vdev->ctx[vector].trigger);
>> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>> -		kfree(vdev->ctx[vector].name);
>> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
>> -		vdev->ctx[vector].trigger = NULL;
>> +		if (vdev->ctx[vector].trigger == trigger) {
>> +			/* switch back to remap mode */
>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> I think we leak the fd context we acquired above in this case.
Thanks for pointing it out, I will fix this in next version.
>
> Why do we do anything in this case, couldn't we just 'put' the extra ctx
> and return 0 here?

Sorry for confusing and I do this for a reason,  I will add some more 
comments like this:

Unregistration here is for re-resigtraion later, which will trigger the 
reconnection of irq_bypass producer
and consumer, which in turn calls vmx_update_pi_irte() to update IRTE if 
posted interrupt is in use.
(vmx_update_pi_irte() will modify IRTE based on the information 
retrieved from KVM.)
Whether producer token changed or not, irq_bypass_register_producer() is 
a way (seems the only way) to
update IRTE by VFIO for posted interrupt. The IRTE will be used by IOMMU 
directly to find the target CPU
for an interrupt posted to VM, since hypervisor is bypassed.
>
>> +		} else if (trigger) {
>> +			ret = irq_update_devid(irq,
>> +					       vdev->ctx[vector].trigger, trigger);
>> +			if (unlikely(ret)) {
>> +				dev_info(&pdev->dev,
>> +					 "update devid of %d (token %p) failed: %d\n",
>> +					 irq, vdev->ctx[vector].trigger, ret);
>> +				eventfd_ctx_put(trigger);
>> +				return ret;
>> +			}
>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> Can you explain this ordering, I would have expected that we'd
> unregister the bypass before we updated the devid.  Thanks,
>
> Alex
Actually, I have explained this in comments above this whole control 
block. I think it is safe and better to
update devid before irq_bypass_unregister_producer() which switches 
interrupt process from posted mode
to remap mode. So, if update fails, the posted interrupt can still work.

Anyway, to producer and consumer,  'trigger' is only a token used for 
pairing them togather.

Thanks,

     Ben


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-28 10:08     ` Ben Luo
@ 2019-08-28 17:23       ` Alex Williamson
  2019-08-29  5:40         ` Ben Luo
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2019-08-28 17:23 UTC (permalink / raw)
  To: Ben Luo; +Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

On Wed, 28 Aug 2019 18:08:02 +0800
Ben Luo <luoben@linux.alibaba.com> wrote:

> 在 2019/8/28 上午4:33, Alex Williamson 写道:
> > On Thu, 22 Aug 2019 23:34:43 +0800
> > Ben Luo <luoben@linux.alibaba.com> wrote:
> >  
> >> When userspace (e.g. qemu) triggers a switch between KVM
> >> irqfd and userspace eventfd, only dev_id of irq action
> >> (i.e. the "trigger" in this patch's context) will be
> >> changed, but a free-then-request-irq action is taken in
> >> current code. And, irq affinity setting in VM will also
> >> trigger a free-then-request-irq action, which actually
> >> changes nothing, but only fires a producer re-registration
> >> to update irte in case that posted-interrupt is in use.
> >>
> >> This patch makes use of irq_update_devid() and optimize
> >> both cases above, which reduces the risk of losing interrupt
> >> and also cuts some overhead.
> >>
> >> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
> >> ---
> >>   drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
> >>   1 file changed, 74 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> >> index 3fa3f72..60d3023 100644
> >> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> >> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
> >>   static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> >>   				      int vector, int fd, bool msix)
> >>   {
> >> +	struct eventfd_ctx *trigger = NULL;
> >>   	struct pci_dev *pdev = vdev->pdev;
> >> -	struct eventfd_ctx *trigger;
> >>   	int irq, ret;
> >>   
> >>   	if (vector < 0 || vector >= vdev->num_ctx)
> >>   		return -EINVAL;
> >>   
> >> +	if (fd >= 0) {
> >> +		trigger = eventfd_ctx_fdget(fd);
> >> +		if (IS_ERR(trigger))
> >> +			return PTR_ERR(trigger);
> >> +	}  
> > I think this is a user visible change.  Previously the vector is
> > disabled first, then if an error occurs re-enabling, we return an errno
> > with the vector disabled.  Here we instead fail the ioctl and leave the
> > state as if it had never happened.  For instance with QEMU, if they
> > were trying to change from KVM to userspace signaling and entered this
> > condition, previously the interrupt would signal to neither eventfd, now
> > it would continue signaling to KVM. If QEMU's intent was to emulate
> > vector masking, this could induce unhandled interrupts in the guest.
> > Maybe we need a tear-down on fault here to maintain that behavior, or
> > do you see some justification for the change?  
> 
> Thanks for your comments, this reminds me to think more about the 
> effects to users.
> 
> After I reviewed the related code in Qemu and VFIO, I think maybe there 
> is a problem in current behavior
> for the signal path changing case. Qemu has neither recovery nor retry 
> code in case that ioctl with
> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been 
> disabled on fault of setting
> up new path, the corresponding vector may be disabled forever. Following 
> is an example from qemu's
> vfio_msix_vector_do_use():
> 
>          ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>          g_free(irq_set);
>          if (ret) {
>              error_report("vfio: failed to modify vector, %d", ret);
>          }
> 
> I think the singal path before changing should be still working at this 
> moment and the caller should keep it
> working if the changing fails, so that at least we still have the old 
> path instead of no path.
> 
> For masking vector case, the 'fd' should be -1, and the interrupt will 
> be freed as before this patch.

QEMU doesn't really have an opportunity to signal an error to the
guest, we're emulating the hardware masking of MSI and MSI-X.  The
guest is simply trying to write a mask bit in the vector, there's no
provision in the PCI spec that setting this bit can fail.  The current
behavior is that the vector is disabled on error.  We can argue whether
that's the optimal behavior, but it's the existing behavior and
changing it would require and evaluation of all existing users.

> >> +
> >>   	irq = pci_irq_vector(pdev, vector);
> >>   
> >> +	/*
> >> +	 * For KVM-VFIO case, interrupt from passthrough device will be directly
> >> +	 * delivered to VM after producer and consumer connected successfully.
> >> +	 * If producer and consumer are disconnected, this interrupt process
> >> +	 * will fall back to remap mode, where interrupt handler uses 'trigger'
> >> +	 * to find the right way to deliver the interrupt to VM. So, it is safe
> >> +	 * to do irq_update_devid() before irq_bypass_unregister_producer() which
> >> +	 * switches interrupt process to remap mode. To producer and consumer,
> >> +	 * 'trigger' is only a token used for pairing them togather.
> >> +	 */
> >>   	if (vdev->ctx[vector].trigger) {
> >> -		free_irq(irq, vdev->ctx[vector].trigger);
> >> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> >> -		kfree(vdev->ctx[vector].name);
> >> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
> >> -		vdev->ctx[vector].trigger = NULL;
> >> +		if (vdev->ctx[vector].trigger == trigger) {
> >> +			/* switch back to remap mode */
> >> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);  
> > I think we leak the fd context we acquired above in this case.  
> Thanks for pointing it out, I will fix this in next version.
> >
> > Why do we do anything in this case, couldn't we just 'put' the extra ctx
> > and return 0 here?  
> 
> Sorry for confusing and I do this for a reason,  I will add some more 
> comments like this:
> 
> Unregistration here is for re-resigtraion later, which will trigger the 
> reconnection of irq_bypass producer
> and consumer, which in turn calls vmx_update_pi_irte() to update IRTE if 
> posted interrupt is in use.
> (vmx_update_pi_irte() will modify IRTE based on the information 
> retrieved from KVM.)
> Whether producer token changed or not, irq_bypass_register_producer() is 
> a way (seems the only way) to
> update IRTE by VFIO for posted interrupt. The IRTE will be used by IOMMU 
> directly to find the target CPU
> for an interrupt posted to VM, since hypervisor is bypassed.

This is only explaining what the bypass de-registration and
re-registration does, not why we need to perform those actions here.
If the trigger is the same as that already attached to this vector, why
is the IRTE changing?  Seems this ought to be a no-op for this vector.
 
> >> +		} else if (trigger) {
> >> +			ret = irq_update_devid(irq,
> >> +					       vdev->ctx[vector].trigger, trigger);
> >> +			if (unlikely(ret)) {
> >> +				dev_info(&pdev->dev,
> >> +					 "update devid of %d (token %p) failed: %d\n",
> >> +					 irq, vdev->ctx[vector].trigger, ret);
> >> +				eventfd_ctx_put(trigger);
> >> +				return ret;
> >> +			}
> >> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);  
> > Can you explain this ordering, I would have expected that we'd
> > unregister the bypass before we updated the devid.  Thanks,
> >
> > Alex  
> Actually, I have explained this in comments above this whole control 
> block. I think it is safe and better to
> update devid before irq_bypass_unregister_producer() which switches 
> interrupt process from posted mode
> to remap mode. So, if update fails, the posted interrupt can still work.
> 
> Anyway, to producer and consumer,  'trigger' is only a token used for 
> pairing them togather.

The bypass is not a guaranteed mechanism, it's an opportunistic
accelerator.  If the devid update fails, what state are we left with?
The irq action may not work, but the bypass does; maybe; maybe not all
the time?  This seems to fall into the same consistency in userspace
behavior issue identified above.  The user ABI is that the vector is
disabled if an error is returned.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-28 17:23       ` Alex Williamson
@ 2019-08-29  5:40         ` Ben Luo
  2019-08-29 13:48           ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Ben Luo @ 2019-08-29  5:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng


在 2019/8/29 上午1:23, Alex Williamson 写道:
> On Wed, 28 Aug 2019 18:08:02 +0800
> Ben Luo <luoben@linux.alibaba.com> wrote:
>
>> 在 2019/8/28 上午4:33, Alex Williamson 写道:
>>> On Thu, 22 Aug 2019 23:34:43 +0800
>>> Ben Luo <luoben@linux.alibaba.com> wrote:
>>>   
>>>> When userspace (e.g. qemu) triggers a switch between KVM
>>>> irqfd and userspace eventfd, only dev_id of irq action
>>>> (i.e. the "trigger" in this patch's context) will be
>>>> changed, but a free-then-request-irq action is taken in
>>>> current code. And, irq affinity setting in VM will also
>>>> trigger a free-then-request-irq action, which actually
>>>> changes nothing, but only fires a producer re-registration
>>>> to update irte in case that posted-interrupt is in use.
>>>>
>>>> This patch makes use of irq_update_devid() and optimize
>>>> both cases above, which reduces the risk of losing interrupt
>>>> and also cuts some overhead.
>>>>
>>>> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
>>>> ---
>>>>    drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
>>>>    1 file changed, 74 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>>>> index 3fa3f72..60d3023 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>>>> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
>>>>    static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
>>>>    				      int vector, int fd, bool msix)
>>>>    {
>>>> +	struct eventfd_ctx *trigger = NULL;
>>>>    	struct pci_dev *pdev = vdev->pdev;
>>>> -	struct eventfd_ctx *trigger;
>>>>    	int irq, ret;
>>>>    
>>>>    	if (vector < 0 || vector >= vdev->num_ctx)
>>>>    		return -EINVAL;
>>>>    
>>>> +	if (fd >= 0) {
>>>> +		trigger = eventfd_ctx_fdget(fd);
>>>> +		if (IS_ERR(trigger))
>>>> +			return PTR_ERR(trigger);
>>>> +	}
>>> I think this is a user visible change.  Previously the vector is
>>> disabled first, then if an error occurs re-enabling, we return an errno
>>> with the vector disabled.  Here we instead fail the ioctl and leave the
>>> state as if it had never happened.  For instance with QEMU, if they
>>> were trying to change from KVM to userspace signaling and entered this
>>> condition, previously the interrupt would signal to neither eventfd, now
>>> it would continue signaling to KVM. If QEMU's intent was to emulate
>>> vector masking, this could induce unhandled interrupts in the guest.
>>> Maybe we need a tear-down on fault here to maintain that behavior, or
>>> do you see some justification for the change?
>> Thanks for your comments, this reminds me to think more about the
>> effects to users.
>>
>> After I reviewed the related code in Qemu and VFIO, I think maybe there
>> is a problem in current behavior
>> for the signal path changing case. Qemu has neither recovery nor retry
>> code in case that ioctl with
>> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been
>> disabled on fault of setting
>> up new path, the corresponding vector may be disabled forever. Following
>> is an example from qemu's
>> vfio_msix_vector_do_use():
>>
>>           ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
>>           g_free(irq_set);
>>           if (ret) {
>>               error_report("vfio: failed to modify vector, %d", ret);
>>           }
>>
>> I think the singal path before changing should be still working at this
>> moment and the caller should keep it
>> working if the changing fails, so that at least we still have the old
>> path instead of no path.
>>
>> For masking vector case, the 'fd' should be -1, and the interrupt will
>> be freed as before this patch.
> QEMU doesn't really have an opportunity to signal an error to the
> guest, we're emulating the hardware masking of MSI and MSI-X.  The
> guest is simply trying to write a mask bit in the vector, there's no
> provision in the PCI spec that setting this bit can fail.  The current
> behavior is that the vector is disabled on error.  We can argue whether
> that's the optimal behavior, but it's the existing behavior and
> changing it would require and evaluation of all existing users.

I totally agree with you that masking of MSI and MSI-X should follow 
current behavior, and my code does follow this behavior when 'fd' == -1, 
the interrupt will surely be disabled.

There is another case that 'fd' is not -1, means userspace just want to 
change the singal path, this time we do have a chance of encountering 
error on eventfd_ctx_fdget(fd). So, I think it is better to keep the old 
path working in this case.

But, as you said this may break the expectation of existing users, I 
make it tear-down on fault in next version.

>
>>>> +
>>>>    	irq = pci_irq_vector(pdev, vector);
>>>>    
>>>> +	/*
>>>> +	 * For KVM-VFIO case, interrupt from passthrough device will be directly
>>>> +	 * delivered to VM after producer and consumer connected successfully.
>>>> +	 * If producer and consumer are disconnected, this interrupt process
>>>> +	 * will fall back to remap mode, where interrupt handler uses 'trigger'
>>>> +	 * to find the right way to deliver the interrupt to VM. So, it is safe
>>>> +	 * to do irq_update_devid() before irq_bypass_unregister_producer() which
>>>> +	 * switches interrupt process to remap mode. To producer and consumer,
>>>> +	 * 'trigger' is only a token used for pairing them togather.
>>>> +	 */
>>>>    	if (vdev->ctx[vector].trigger) {
>>>> -		free_irq(irq, vdev->ctx[vector].trigger);
>>>> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>> -		kfree(vdev->ctx[vector].name);
>>>> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
>>>> -		vdev->ctx[vector].trigger = NULL;
>>>> +		if (vdev->ctx[vector].trigger == trigger) {
>>>> +			/* switch back to remap mode */
>>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>> I think we leak the fd context we acquired above in this case.
>> Thanks for pointing it out, I will fix this in next version.
>>> Why do we do anything in this case, couldn't we just 'put' the extra ctx
>>> and return 0 here?
>> Sorry for confusing and I do this for a reason,  I will add some more
>> comments like this:
>>
>> Unregistration here is for re-resigtraion later, which will trigger the
>> reconnection of irq_bypass producer
>> and consumer, which in turn calls vmx_update_pi_irte() to update IRTE if
>> posted interrupt is in use.
>> (vmx_update_pi_irte() will modify IRTE based on the information
>> retrieved from KVM.)
>> Whether producer token changed or not, irq_bypass_register_producer() is
>> a way (seems the only way) to
>> update IRTE by VFIO for posted interrupt. The IRTE will be used by IOMMU
>> directly to find the target CPU
>> for an interrupt posted to VM, since hypervisor is bypassed.
> This is only explaining what the bypass de-registration and
> re-registration does, not why we need to perform those actions here.
> If the trigger is the same as that already attached to this vector, why
> is the IRTE changing?  Seems this ought to be a no-op for this vector.

Sorry, I think it cannot be a no-op here. The interrupt affinity setting 
in guest also triggers the calling of this function and IRTE will be 
modified with new affinity information retrieved from KVM's data 
structure by vmx_update_pi_irte() when posted interrupt is in use, not 
from the 'trigger'.

>   
>>>> +		} else if (trigger) {
>>>> +			ret = irq_update_devid(irq,
>>>> +					       vdev->ctx[vector].trigger, trigger);
>>>> +			if (unlikely(ret)) {
>>>> +				dev_info(&pdev->dev,
>>>> +					 "update devid of %d (token %p) failed: %d\n",
>>>> +					 irq, vdev->ctx[vector].trigger, ret);
>>>> +				eventfd_ctx_put(trigger);
>>>> +				return ret;
>>>> +			}
>>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>> Can you explain this ordering, I would have expected that we'd
>>> unregister the bypass before we updated the devid.  Thanks,
>>>
>>> Alex
>> Actually, I have explained this in comments above this whole control
>> block. I think it is safe and better to
>> update devid before irq_bypass_unregister_producer() which switches
>> interrupt process from posted mode
>> to remap mode. So, if update fails, the posted interrupt can still work.
>>
>> Anyway, to producer and consumer,  'trigger' is only a token used for
>> pairing them togather.
> The bypass is not a guaranteed mechanism, it's an opportunistic
> accelerator.  If the devid update fails, what state are we left with?
> The irq action may not work, but the bypass does; maybe; maybe not all
> the time?  This seems to fall into the same consistency in userspace
> behavior issue identified above.  The user ABI is that the vector is
> disabled if an error is returned.  Thanks,
>
> Alex

Thanks, now I see it is better to unregister the bypass before update 
the devid, will change ordering in next version.

Ben


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-29  5:40         ` Ben Luo
@ 2019-08-29 13:48           ` Alex Williamson
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Williamson @ 2019-08-29 13:48 UTC (permalink / raw)
  To: Ben Luo; +Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

On Thu, 29 Aug 2019 13:40:59 +0800
Ben Luo <luoben@linux.alibaba.com> wrote:

> 在 2019/8/29 上午1:23, Alex Williamson 写道:
> > On Wed, 28 Aug 2019 18:08:02 +0800
> > Ben Luo <luoben@linux.alibaba.com> wrote:
> >  
> >> 在 2019/8/28 上午4:33, Alex Williamson 写道:  
> >>> On Thu, 22 Aug 2019 23:34:43 +0800
> >>> Ben Luo <luoben@linux.alibaba.com> wrote:
> >>>     
> >>>> When userspace (e.g. qemu) triggers a switch between KVM
> >>>> irqfd and userspace eventfd, only dev_id of irq action
> >>>> (i.e. the "trigger" in this patch's context) will be
> >>>> changed, but a free-then-request-irq action is taken in
> >>>> current code. And, irq affinity setting in VM will also
> >>>> trigger a free-then-request-irq action, which actually
> >>>> changes nothing, but only fires a producer re-registration
> >>>> to update irte in case that posted-interrupt is in use.
> >>>>
> >>>> This patch makes use of irq_update_devid() and optimize
> >>>> both cases above, which reduces the risk of losing interrupt
> >>>> and also cuts some overhead.
> >>>>
> >>>> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
> >>>> ---
> >>>>    drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
> >>>>    1 file changed, 74 insertions(+), 38 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> index 3fa3f72..60d3023 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
> >>>>    static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
> >>>>    				      int vector, int fd, bool msix)
> >>>>    {
> >>>> +	struct eventfd_ctx *trigger = NULL;
> >>>>    	struct pci_dev *pdev = vdev->pdev;
> >>>> -	struct eventfd_ctx *trigger;
> >>>>    	int irq, ret;
> >>>>    
> >>>>    	if (vector < 0 || vector >= vdev->num_ctx)
> >>>>    		return -EINVAL;
> >>>>    
> >>>> +	if (fd >= 0) {
> >>>> +		trigger = eventfd_ctx_fdget(fd);
> >>>> +		if (IS_ERR(trigger))
> >>>> +			return PTR_ERR(trigger);
> >>>> +	}  
> >>> I think this is a user visible change.  Previously the vector is
> >>> disabled first, then if an error occurs re-enabling, we return an errno
> >>> with the vector disabled.  Here we instead fail the ioctl and leave the
> >>> state as if it had never happened.  For instance with QEMU, if they
> >>> were trying to change from KVM to userspace signaling and entered this
> >>> condition, previously the interrupt would signal to neither eventfd, now
> >>> it would continue signaling to KVM. If QEMU's intent was to emulate
> >>> vector masking, this could induce unhandled interrupts in the guest.
> >>> Maybe we need a tear-down on fault here to maintain that behavior, or
> >>> do you see some justification for the change?  
> >> Thanks for your comments, this reminds me to think more about the
> >> effects to users.
> >>
> >> After I reviewed the related code in Qemu and VFIO, I think maybe there
> >> is a problem in current behavior
> >> for the signal path changing case. Qemu has neither recovery nor retry
> >> code in case that ioctl with
> >> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been
> >> disabled on fault of setting
> >> up new path, the corresponding vector may be disabled forever. Following
> >> is an example from qemu's
> >> vfio_msix_vector_do_use():
> >>
> >>           ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >>           g_free(irq_set);
> >>           if (ret) {
> >>               error_report("vfio: failed to modify vector, %d", ret);
> >>           }
> >>
> >> I think the singal path before changing should be still working at this
> >> moment and the caller should keep it
> >> working if the changing fails, so that at least we still have the old
> >> path instead of no path.
> >>
> >> For masking vector case, the 'fd' should be -1, and the interrupt will
> >> be freed as before this patch.  
> > QEMU doesn't really have an opportunity to signal an error to the
> > guest, we're emulating the hardware masking of MSI and MSI-X.  The
> > guest is simply trying to write a mask bit in the vector, there's no
> > provision in the PCI spec that setting this bit can fail.  The current
> > behavior is that the vector is disabled on error.  We can argue whether
> > that's the optimal behavior, but it's the existing behavior and
> > changing it would require and evaluation of all existing users.  
> 
> I totally agree with you that masking of MSI and MSI-X should follow 
> current behavior, and my code does follow this behavior when 'fd' == -1, 
> the interrupt will surely be disabled.
> 
> There is another case that 'fd' is not -1, means userspace just want to 
> change the singal path, this time we do have a chance of encountering 
> error on eventfd_ctx_fdget(fd). So, I think it is better to keep the old 
> path working in this case.
> 
> But, as you said this may break the expectation of existing users, I 
> make it tear-down on fault in next version.
> 
> >  
> >>>> +
> >>>>    	irq = pci_irq_vector(pdev, vector);
> >>>>    
> >>>> +	/*
> >>>> +	 * For KVM-VFIO case, interrupt from passthrough device will be directly
> >>>> +	 * delivered to VM after producer and consumer connected successfully.
> >>>> +	 * If producer and consumer are disconnected, this interrupt process
> >>>> +	 * will fall back to remap mode, where interrupt handler uses 'trigger'
> >>>> +	 * to find the right way to deliver the interrupt to VM. So, it is safe
> >>>> +	 * to do irq_update_devid() before irq_bypass_unregister_producer() which
> >>>> +	 * switches interrupt process to remap mode. To producer and consumer,
> >>>> +	 * 'trigger' is only a token used for pairing them togather.
> >>>> +	 */
> >>>>    	if (vdev->ctx[vector].trigger) {
> >>>> -		free_irq(irq, vdev->ctx[vector].trigger);
> >>>> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> >>>> -		kfree(vdev->ctx[vector].name);
> >>>> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
> >>>> -		vdev->ctx[vector].trigger = NULL;
> >>>> +		if (vdev->ctx[vector].trigger == trigger) {
> >>>> +			/* switch back to remap mode */
> >>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);  
> >>> I think we leak the fd context we acquired above in this case.  
> >> Thanks for pointing it out, I will fix this in next version.  
> >>> Why do we do anything in this case, couldn't we just 'put' the extra ctx
> >>> and return 0 here?  
> >> Sorry for confusing and I do this for a reason,  I will add some more
> >> comments like this:
> >>
> >> Unregistration here is for re-resigtraion later, which will trigger the
> >> reconnection of irq_bypass producer
> >> and consumer, which in turn calls vmx_update_pi_irte() to update IRTE if
> >> posted interrupt is in use.
> >> (vmx_update_pi_irte() will modify IRTE based on the information
> >> retrieved from KVM.)
> >> Whether producer token changed or not, irq_bypass_register_producer() is
> >> a way (seems the only way) to
> >> update IRTE by VFIO for posted interrupt. The IRTE will be used by IOMMU
> >> directly to find the target CPU
> >> for an interrupt posted to VM, since hypervisor is bypassed.  
> > This is only explaining what the bypass de-registration and
> > re-registration does, not why we need to perform those actions here.
> > If the trigger is the same as that already attached to this vector, why
> > is the IRTE changing?  Seems this ought to be a no-op for this vector.  
> 
> Sorry, I think it cannot be a no-op here. The interrupt affinity setting 
> in guest also triggers the calling of this function and IRTE will be 
> modified with new affinity information retrieved from KVM's data 
> structure by vmx_update_pi_irte() when posted interrupt is in use, not 
> from the 'trigger'.

Aha, this is the key piece of information I was missing.  Please add a
comment that even if the trigger is unchanged we need to bounce the
bypass to allow affinity changes in the guest to be realized.  Thanks,

Alex

> >>>> +		} else if (trigger) {
> >>>> +			ret = irq_update_devid(irq,
> >>>> +					       vdev->ctx[vector].trigger, trigger);
> >>>> +			if (unlikely(ret)) {
> >>>> +				dev_info(&pdev->dev,
> >>>> +					 "update devid of %d (token %p) failed: %d\n",
> >>>> +					 irq, vdev->ctx[vector].trigger, ret);
> >>>> +				eventfd_ctx_put(trigger);
> >>>> +				return ret;
> >>>> +			}
> >>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);  
> >>> Can you explain this ordering, I would have expected that we'd
> >>> unregister the bypass before we updated the devid.  Thanks,
> >>>
> >>> Alex  
> >> Actually, I have explained this in comments above this whole control
> >> block. I think it is safe and better to
> >> update devid before irq_bypass_unregister_producer() which switches
> >> interrupt process from posted mode
> >> to remap mode. So, if update fails, the posted interrupt can still work.
> >>
> >> Anyway, to producer and consumer,  'trigger' is only a token used for
> >> pairing them togather.  
> > The bypass is not a guaranteed mechanism, it's an opportunistic
> > accelerator.  If the devid update fails, what state are we left with?
> > The irq action may not work, but the bypass does; maybe; maybe not all
> > the time?  This seems to fall into the same consistency in userspace
> > behavior issue identified above.  The user ABI is that the vector is
> > disabled if an error is returned.  Thanks,
> >
> > Alex  
> 
> Thanks, now I see it is better to unregister the bypass before update 
> the devid, will change ordering in next version.
> 
> Ben
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-08-29 13:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-22 15:34 [PATCH v4 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
2019-08-22 15:34 ` [PATCH v4 1/3] genirq: enhance error recovery code in free irq Ben Luo
2019-08-22 15:34 ` [PATCH v4 2/3] genirq: introduce irq_update_devid() Ben Luo
2019-08-22 15:34 ` [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
2019-08-27 20:33   ` Alex Williamson
2019-08-28 10:08     ` Ben Luo
2019-08-28 17:23       ` Alex Williamson
2019-08-29  5:40         ` Ben Luo
2019-08-29 13:48           ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.