linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops
@ 2019-08-30  8:41 Ben Luo
  2019-08-30  8:41 ` [PATCH v5 1/3] genirq: enhance error recovery code in free irq Ben Luo
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ben Luo @ 2019-08-30  8:41 UTC (permalink / raw)
  To: tglx, alex.williamson; +Cc: linux-kernel, tao.ma, gerry, nanhai.zou

Currently, VFIO takes a free-then-request-irq way to do interrupt
affinity setting and masking/unmasking for a VM (with device passthru
via VFIO). Sometimes it only changes the cookie data of irqaction or even
changes nothing. The free-then-request-irq not only adds more latency,
but also increases the risk of losing interrupt, which may lead to a
VM hung forever in waiting for IO completion

This patchset solved the issue by:
Patch 2 introduces irq_update_devid() to only update dev_id of irqaction
Patch 3 make use of this function and optimize irq operations in VFIO

changes from v4:
 - Patch 3: follow the previous behavior to disable interrupt on error path
 - Patch 3: do irqbypass registration before update or free the interrupt
 - Patch 3: add more comments

changes from v3:
 - Patch 2: rename the new function to irq_update_devid()
 - Patch 2: use disbale_irq() to avoid a twist for threaded interrupt
 - ALL: amend commit messages and code comments

changes from v2:
 - reformat to avoid quoted string split across lines and etc.

changes from v1:
 - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
 - enhance error recovery code and debugging info in irq_update_devid
 - use __must_check in external referencing of this function
 - use EXPORT_SYMBOL_GPL for irq_update_devid
 - reformat code of patch 3 for better readability

Ben Luo (3):
  genirq: enhance error recovery code in free irq
  genirq: introduce irq_update_devid()
  vfio/pci: make use of irq_update_devid and optimize irq ops

 drivers/vfio/pci/vfio_pci_intrs.c | 124 ++++++++++++++++++++++++++------------
 include/linux/interrupt.h         |   3 +
 kernel/irq/manage.c               | 105 ++++++++++++++++++++++++++++----
 3 files changed, 183 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 1/3] genirq: enhance error recovery code in free irq
  2019-08-30  8:41 [PATCH v5 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
@ 2019-08-30  8:41 ` Ben Luo
  2019-08-30  8:41 ` [PATCH v5 2/3] genirq: introduce irq_update_devid() Ben Luo
  2019-08-30  8:42 ` [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
  2 siblings, 0 replies; 6+ messages in thread
From: Ben Luo @ 2019-08-30  8:41 UTC (permalink / raw)
  To: tglx, alex.williamson; +Cc: linux-kernel, tao.ma, gerry, nanhai.zou

__free_irq()/__free_percpu_irq() need to return if called from IRQ
context because the interrupt handler loop runs with desc->lock dropped
and dev_id can be subject to load and store tearing. Also move WARNs
out of lock region and print out dev_id to help debugging.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 kernel/irq/manage.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f17..10ec3e9 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1690,7 +1690,10 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 	struct irqaction *action, **action_ptr;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (WARN(in_interrupt(),
+		 "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+		 irq, dev_id))
+		return NULL;
 
 	mutex_lock(&desc->request_mutex);
 	chip_bus_lock(desc);
@@ -1705,10 +1708,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 		action = *action_ptr;
 
 		if (!action) {
-			WARN(1, "Trying to free already-free IRQ %d\n", irq);
 			raw_spin_unlock_irqrestore(&desc->lock, flags);
 			chip_bus_sync_unlock(desc);
 			mutex_unlock(&desc->request_mutex);
+			WARN(1, "Trying to free already-free IRQ %d (dev_id %p)\n",
+			     irq, dev_id);
 			return NULL;
 		}
 
@@ -2286,7 +2290,10 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	struct irqaction *action;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (WARN(in_interrupt(),
+		 "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+		 irq, dev_id))
+		return NULL;
 
 	if (!desc)
 		return NULL;
@@ -2295,14 +2302,17 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 
 	action = desc->action;
 	if (!action || action->percpu_dev_id != dev_id) {
-		WARN(1, "Trying to free already-free IRQ %d\n", irq);
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "Trying to free already-free IRQ (dev_id %p) %d\n",
+		     dev_id, irq);
+		return NULL;
 	}
 
 	if (!cpumask_empty(desc->percpu_enabled)) {
-		WARN(1, "percpu IRQ %d still enabled on CPU%d!\n",
-		     irq, cpumask_first(desc->percpu_enabled));
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "percpu IRQ %d (dev_id %p) still enabled on CPU%d!\n",
+		     irq, dev_id, cpumask_first(desc->percpu_enabled));
+		return NULL;
 	}
 
 	/* Found it - now remove it from the list of entries: */
@@ -2317,10 +2327,6 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	irq_chip_pm_put(&desc->irq_data);
 	module_put(desc->owner);
 	return action;
-
-bad:
-	raw_spin_unlock_irqrestore(&desc->lock, flags);
-	return NULL;
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 2/3] genirq: introduce irq_update_devid()
  2019-08-30  8:41 [PATCH v5 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
  2019-08-30  8:41 ` [PATCH v5 1/3] genirq: enhance error recovery code in free irq Ben Luo
@ 2019-08-30  8:41 ` Ben Luo
  2019-08-30  8:42 ` [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
  2 siblings, 0 replies; 6+ messages in thread
From: Ben Luo @ 2019-08-30  8:41 UTC (permalink / raw)
  To: tglx, alex.williamson; +Cc: linux-kernel, tao.ma, gerry, nanhai.zou

Sometimes, only the dev_id field of irqaction needs to be changed.

E.g. KVM VM with device passthru via VFIO may switch the interrupt
injection path between KVM irqfd and userspace eventfd. These two
paths share the same interrupt number and handler for the same msi
vector of a device, only with different 'dev_id's referencing to
different fds' contexts. Set interrupt affinity in this VM is a way
to trigger the path switching.

Currently, VFIO uses a free-then-request-irq way for the path switching.
There is a time window between free_irq() and request_irq() where the
target IRTE is invalid. So, in-flight interrupts (buffering in hardware
layer and unfortunately cannot be synchronized in software) can cause
DMAR faults and even worse, this VM may hang in waiting IO completion.

By using irq_update_devid(), this issue can be avoided since IRTE will
not be invalidated during the whole process.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 include/linux/interrupt.h |  3 ++
 kernel/irq/manage.c       | 75 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a..09b6a0f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -172,6 +172,9 @@ struct irqaction {
 request_percpu_nmi(unsigned int irq, irq_handler_t handler,
 		   const char *devname, void __percpu *dev);
 
+extern int __must_check
+irq_update_devid(unsigned int irq, void *dev_id, void *new_dev_id);
+
 extern const void *free_irq(unsigned int, void *);
 extern void free_percpu_irq(unsigned int, void __percpu *);
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 10ec3e9..adb1980 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2063,6 +2063,81 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 EXPORT_SYMBOL(request_threaded_irq);
 
 /**
+ *	irq_update_devid - update irq dev_id to a new one
+ *
+ *	@irq:		Interrupt line to update
+ *	@dev_id:	A cookie to find the irqaction to update
+ *	@new_dev_id:	New cookie passed to the handler function
+ *
+ *	Sometimes, only the cookie data need to be changed. Instead of
+ *	free-then-request interrupt, only update dev_id of irqaction can
+ *	not only gain some performance benefit, but also reduce the risk
+ *	of losing interrupt.
+ *
+ *	This function won't update dev_id until any executing interrupts
+ *	for this IRQ have completed. This function must not be called
+ *	from interrupt context.
+ *
+ *	On failure, it returns a negative value. On success,
+ *	it returns 0
+ */
+int irq_update_devid(unsigned int irq, void *dev_id, void *new_dev_id)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irqaction *action, **action_ptr;
+	unsigned long flags;
+
+	if (WARN(in_interrupt(),
+		 "Trying to update IRQ %d (dev_id %p to %p) from IRQ context!\n",
+		 irq, dev_id, new_dev_id))
+		return -EPERM;
+
+	if (!desc)
+		return -EINVAL;
+
+	/*
+	 * Ensure that an interrupt in flight on another CPU which uses the
+	 * old 'dev_id' has completed because the caller can free the memory
+	 * to which it points after this function returns. And also avoid to
+	 * update 'dev_id' in the middle of a threaded interrupt process, it
+	 * can lead to a twist that primary handler uses old 'dev_id' but new
+	 * 'dev_id' is used by secondary handler.
+	 */
+	disable_irq(irq);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	/*
+	 * There can be multiple actions per IRQ descriptor, find the right
+	 * one based on the dev_id:
+	 */
+	action_ptr = &desc->action;
+	for (;;) {
+		action = *action_ptr;
+
+		if (!action) {
+			raw_spin_unlock_irqrestore(&desc->lock, flags);
+			enable_irq(irq);
+			WARN(1,
+			     "Trying to update already-free IRQ %d (dev_id %p to %p)\n",
+			     irq, dev_id, new_dev_id);
+			return -ENXIO;
+		}
+
+		if (action->dev_id == dev_id) {
+			action->dev_id = new_dev_id;
+			break;
+		}
+		action_ptr = &action->next;
+	}
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	enable_irq(irq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(irq_update_devid);
+
+/**
  *	request_any_context_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
  *	@handler: Function to be called when the IRQ occurs.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-30  8:41 [PATCH v5 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
  2019-08-30  8:41 ` [PATCH v5 1/3] genirq: enhance error recovery code in free irq Ben Luo
  2019-08-30  8:41 ` [PATCH v5 2/3] genirq: introduce irq_update_devid() Ben Luo
@ 2019-08-30  8:42 ` Ben Luo
  2019-08-30 20:06   ` Alex Williamson
  2 siblings, 1 reply; 6+ messages in thread
From: Ben Luo @ 2019-08-30  8:42 UTC (permalink / raw)
  To: tglx, alex.williamson; +Cc: linux-kernel, tao.ma, gerry, nanhai.zou

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irqaction
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only need to bounce the irqbypass
registraion in case that posted-interrupt is in use.

This patch makes use of irq_update_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 124 ++++++++++++++++++++++++++------------
 1 file changed, 87 insertions(+), 37 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..d3a93d7 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,120 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 				      int vector, int fd, bool msix)
 {
+	struct eventfd_ctx *trigger = NULL;
 	struct pci_dev *pdev = vdev->pdev;
-	struct eventfd_ctx *trigger;
 	int irq, ret;
 
 	if (vector < 0 || vector >= vdev->num_ctx)
 		return -EINVAL;
 
+	if (fd >= 0) {
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger)) {
+			/* oops, going to disable this interrupt */
+			dev_info(&pdev->dev,
+				 "get ctx error on bad fd: %d for vector:%d\n",
+				 fd, vector);
+		}
+	}
+
 	irq = pci_irq_vector(pdev, vector);
 
+	/*
+	 * 'trigger' is NULL or invalid, disable the interrupt
+	 * 'trigger' is same as before, only bounce the bypass registration
+	 * 'trigger' is a new invalid one, update it to irqaction and other
+	 * data structures referencing to the old one; fallback to disable
+	 * the interrupt on error
+	 */
 	if (vdev->ctx[vector].trigger) {
-		free_irq(irq, vdev->ctx[vector].trigger);
+		/*
+		 * even if the trigger is unchanged we need to bounce the
+		 * interrupt bypass connection to allow affinity changes in
+		 * the guest to be realized.
+		 */
 		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(vdev->ctx[vector].trigger);
-		vdev->ctx[vector].trigger = NULL;
+
+		if (vdev->ctx[vector].trigger == trigger) {
+			/* avoid duplicated referencing to the same trigger */
+			eventfd_ctx_put(trigger);
+
+		} else if (trigger && !IS_ERR(trigger)) {
+			ret = irq_update_devid(irq,
+					       vdev->ctx[vector].trigger, trigger);
+			if (unlikely(ret)) {
+				dev_info(&pdev->dev,
+					 "update devid of %d (token %p) failed: %d\n",
+					 irq, vdev->ctx[vector].trigger, ret);
+				eventfd_ctx_put(trigger);
+				free_irq(irq, vdev->ctx[vector].trigger);
+				kfree(vdev->ctx[vector].name);
+				eventfd_ctx_put(vdev->ctx[vector].trigger);
+				vdev->ctx[vector].trigger = NULL;
+				return ret;
+			}
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].producer.token = trigger;
+			vdev->ctx[vector].trigger = trigger;
+
+		} else {
+			free_irq(irq, vdev->ctx[vector].trigger);
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].trigger = NULL;
+		}
 	}
 
 	if (fd < 0)
 		return 0;
+	else if (IS_ERR(trigger))
+		return PTR_ERR(trigger);
 
-	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
-					   msix ? "x" : "", vector,
-					   pci_name(pdev));
-	if (!vdev->ctx[vector].name)
-		return -ENOMEM;
+	if (!vdev->ctx[vector].trigger) {
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
+						   "vfio-msi%s[%d](%s)",
+						   msix ? "x" : "", vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name) {
+			eventfd_ctx_put(trigger);
+			return -ENOMEM;
+		}
 
-	trigger = eventfd_ctx_fdget(fd);
-	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[vector].name);
-		return PTR_ERR(trigger);
-	}
+		/*
+		 * The MSIx vector table resides in device memory which may be
+		 * cleared via backdoor resets. We don't allow direct access to
+		 * the vector table so even if a userspace driver attempts to
+		 * save/restore around such a reset it would be unsuccessful.
+		 * To avoid this, restore the cached value of the message prior
+		 * to enabling.
+		 */
+		if (msix) {
+			struct msi_msg msg;
 
-	/*
-	 * The MSIx vector table resides in device memory which may be cleared
-	 * via backdoor resets. We don't allow direct access to the vector
-	 * table so even if a userspace driver attempts to save/restore around
-	 * such a reset it would be unsuccessful. To avoid this, restore the
-	 * cached value of the message prior to enabling.
-	 */
-	if (msix) {
-		struct msi_msg msg;
+			get_cached_msi_msg(irq, &msg);
+			pci_write_msi_msg(irq, &msg);
+		}
 
-		get_cached_msi_msg(irq, &msg);
-		pci_write_msi_msg(irq, &msg);
-	}
+		ret = request_irq(irq, vfio_msihandler, 0,
+				  vdev->ctx[vector].name, trigger);
+		if (ret) {
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(trigger);
+			return ret;
+		}
 
-	ret = request_irq(irq, vfio_msihandler, 0,
-			  vdev->ctx[vector].name, trigger);
-	if (ret) {
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(trigger);
-		return ret;
+		vdev->ctx[vector].producer.token = trigger;
+		vdev->ctx[vector].producer.irq = irq;
+		vdev->ctx[vector].trigger = trigger;
 	}
 
-	vdev->ctx[vector].producer.token = trigger;
-	vdev->ctx[vector].producer.irq = irq;
+	/* setup bypass connection and make irte updated */
 	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
 	if (unlikely(ret))
 		dev_info(&pdev->dev,
 		"irq bypass producer (token %p) registration fails: %d\n",
 		vdev->ctx[vector].producer.token, ret);
 
-	vdev->ctx[vector].trigger = trigger;
-
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-30  8:42 ` [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
@ 2019-08-30 20:06   ` Alex Williamson
  2019-08-31  4:38     ` Ben Luo
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Williamson @ 2019-08-30 20:06 UTC (permalink / raw)
  To: Ben Luo; +Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou

On Fri, 30 Aug 2019 16:42:06 +0800
Ben Luo <luoben@linux.alibaba.com> wrote:

> When userspace (e.g. qemu) triggers a switch between KVM
> irqfd and userspace eventfd, only dev_id of irqaction
> (i.e. the "trigger" in this patch's context) will be
> changed, but a free-then-request-irq action is taken in
> current code. And, irq affinity setting in VM will also
> trigger a free-then-request-irq action, which actually
> changes nothing, but only need to bounce the irqbypass
> registraion in case that posted-interrupt is in use.
> 
> This patch makes use of irq_update_devid() and optimize
> both cases above, which reduces the risk of losing interrupt
> and also cuts some overhead.
> 
> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 124 ++++++++++++++++++++++++++------------
>  1 file changed, 87 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3fa3f72..d3a93d7 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -284,70 +284,120 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
>  static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
>  				      int vector, int fd, bool msix)
>  {
> +	struct eventfd_ctx *trigger = NULL;
>  	struct pci_dev *pdev = vdev->pdev;
> -	struct eventfd_ctx *trigger;
>  	int irq, ret;
>  
>  	if (vector < 0 || vector >= vdev->num_ctx)
>  		return -EINVAL;
>  
> +	if (fd >= 0) {
> +		trigger = eventfd_ctx_fdget(fd);
> +		if (IS_ERR(trigger)) {
> +			/* oops, going to disable this interrupt */
> +			dev_info(&pdev->dev,
> +				 "get ctx error on bad fd: %d for vector:%d\n",
> +				 fd, vector);

I think a user could trigger this maliciously as a denial of service by
simply providing a bogus file descriptor.  The user is informed of the
error by the return value, why do we need to spam the logs?

> +		}
> +	}
> +
>  	irq = pci_irq_vector(pdev, vector);
>  
> +	/*
> +	 * 'trigger' is NULL or invalid, disable the interrupt
> +	 * 'trigger' is same as before, only bounce the bypass registration
> +	 * 'trigger' is a new invalid one, update it to irqaction and other

s/invalid/valid/

> +	 * data structures referencing to the old one; fallback to disable
> +	 * the interrupt on error
> +	 */
>  	if (vdev->ctx[vector].trigger) {
> -		free_irq(irq, vdev->ctx[vector].trigger);
> +		/*
> +		 * even if the trigger is unchanged we need to bounce the
> +		 * interrupt bypass connection to allow affinity changes in
> +		 * the guest to be realized.
> +		 */
>  		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> -		kfree(vdev->ctx[vector].name);
> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
> -		vdev->ctx[vector].trigger = NULL;
> +
> +		if (vdev->ctx[vector].trigger == trigger) {
> +			/* avoid duplicated referencing to the same trigger */
> +			eventfd_ctx_put(trigger);
> +
> +		} else if (trigger && !IS_ERR(trigger)) {
> +			ret = irq_update_devid(irq,
> +					       vdev->ctx[vector].trigger, trigger);
> +			if (unlikely(ret)) {
> +				dev_info(&pdev->dev,
> +					 "update devid of %d (token %p) failed: %d\n",
> +					 irq, vdev->ctx[vector].trigger, ret);
> +				eventfd_ctx_put(trigger);
> +				free_irq(irq, vdev->ctx[vector].trigger);
> +				kfree(vdev->ctx[vector].name);
> +				eventfd_ctx_put(vdev->ctx[vector].trigger);
> +				vdev->ctx[vector].trigger = NULL;
> +				return ret;
> +			}
> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
> +			vdev->ctx[vector].producer.token = trigger;
> +			vdev->ctx[vector].trigger = trigger;
> +
> +		} else {
> +			free_irq(irq, vdev->ctx[vector].trigger);
> +			kfree(vdev->ctx[vector].name);
> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
> +			vdev->ctx[vector].trigger = NULL;
> +		}
>  	}
>  
>  	if (fd < 0)
>  		return 0;
> +	else if (IS_ERR(trigger))
> +		return PTR_ERR(trigger);
>  
> -	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
> -					   msix ? "x" : "", vector,
> -					   pci_name(pdev));
> -	if (!vdev->ctx[vector].name)
> -		return -ENOMEM;
> +	if (!vdev->ctx[vector].trigger) {
> +		vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
> +						   "vfio-msi%s[%d](%s)",
> +						   msix ? "x" : "", vector,
> +						   pci_name(pdev));
> +		if (!vdev->ctx[vector].name) {
> +			eventfd_ctx_put(trigger);
> +			return -ENOMEM;
> +		}
>  
> -	trigger = eventfd_ctx_fdget(fd);
> -	if (IS_ERR(trigger)) {
> -		kfree(vdev->ctx[vector].name);
> -		return PTR_ERR(trigger);
> -	}
> +		/*
> +		 * The MSIx vector table resides in device memory which may be
> +		 * cleared via backdoor resets. We don't allow direct access to
> +		 * the vector table so even if a userspace driver attempts to
> +		 * save/restore around such a reset it would be unsuccessful.
> +		 * To avoid this, restore the cached value of the message prior
> +		 * to enabling.
> +		 */
> +		if (msix) {
> +			struct msi_msg msg;
>  
> -	/*
> -	 * The MSIx vector table resides in device memory which may be cleared
> -	 * via backdoor resets. We don't allow direct access to the vector
> -	 * table so even if a userspace driver attempts to save/restore around
> -	 * such a reset it would be unsuccessful. To avoid this, restore the
> -	 * cached value of the message prior to enabling.
> -	 */
> -	if (msix) {
> -		struct msi_msg msg;
> +			get_cached_msi_msg(irq, &msg);
> +			pci_write_msi_msg(irq, &msg);
> +		}
>  
> -		get_cached_msi_msg(irq, &msg);
> -		pci_write_msi_msg(irq, &msg);
> -	}
> +		ret = request_irq(irq, vfio_msihandler, 0,
> +				  vdev->ctx[vector].name, trigger);
> +		if (ret) {
> +			kfree(vdev->ctx[vector].name);
> +			eventfd_ctx_put(trigger);
> +			return ret;
> +		}
>  
> -	ret = request_irq(irq, vfio_msihandler, 0,
> -			  vdev->ctx[vector].name, trigger);
> -	if (ret) {
> -		kfree(vdev->ctx[vector].name);
> -		eventfd_ctx_put(trigger);
> -		return ret;
> +		vdev->ctx[vector].producer.token = trigger;
> +		vdev->ctx[vector].producer.irq = irq;
> +		vdev->ctx[vector].trigger = trigger;
>  	}
>  
> -	vdev->ctx[vector].producer.token = trigger;
> -	vdev->ctx[vector].producer.irq = irq;
> +	/* setup bypass connection and make irte updated */
>  	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>  	if (unlikely(ret))
>  		dev_info(&pdev->dev,
>  		"irq bypass producer (token %p) registration fails: %d\n",
>  		vdev->ctx[vector].producer.token, ret);
>  
> -	vdev->ctx[vector].trigger = trigger;
> -
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops
  2019-08-30 20:06   ` Alex Williamson
@ 2019-08-31  4:38     ` Ben Luo
  0 siblings, 0 replies; 6+ messages in thread
From: Ben Luo @ 2019-08-31  4:38 UTC (permalink / raw)
  To: Alex Williamson; +Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou


在 2019/8/31 上午4:06, Alex Williamson 写道:
> On Fri, 30 Aug 2019 16:42:06 +0800
> Ben Luo <luoben@linux.alibaba.com> wrote:
>
>> When userspace (e.g. qemu) triggers a switch between KVM
>> irqfd and userspace eventfd, only dev_id of irqaction
>> (i.e. the "trigger" in this patch's context) will be
>> changed, but a free-then-request-irq action is taken in
>> current code. And, irq affinity setting in VM will also
>> trigger a free-then-request-irq action, which actually
>> changes nothing, but only need to bounce the irqbypass
>> registraion in case that posted-interrupt is in use.
>>
>> This patch makes use of irq_update_devid() and optimize
>> both cases above, which reduces the risk of losing interrupt
>> and also cuts some overhead.
>>
>> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
>> ---
>>   drivers/vfio/pci/vfio_pci_intrs.c | 124 ++++++++++++++++++++++++++------------
>>   1 file changed, 87 insertions(+), 37 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>> index 3fa3f72..d3a93d7 100644
>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>> @@ -284,70 +284,120 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
>>   static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
>>   				      int vector, int fd, bool msix)
>>   {
>> +	struct eventfd_ctx *trigger = NULL;
>>   	struct pci_dev *pdev = vdev->pdev;
>> -	struct eventfd_ctx *trigger;
>>   	int irq, ret;
>>   
>>   	if (vector < 0 || vector >= vdev->num_ctx)
>>   		return -EINVAL;
>>   
>> +	if (fd >= 0) {
>> +		trigger = eventfd_ctx_fdget(fd);
>> +		if (IS_ERR(trigger)) {
>> +			/* oops, going to disable this interrupt */
>> +			dev_info(&pdev->dev,
>> +				 "get ctx error on bad fd: %d for vector:%d\n",
>> +				 fd, vector);
> I think a user could trigger this maliciously as a denial of service by
> simply providing a bogus file descriptor.  The user is informed of the
> error by the return value, why do we need to spam the logs?
Ah, you are right, I will remove this log in next version
>> +		}
>> +	}
>> +
>>   	irq = pci_irq_vector(pdev, vector);
>>   
>> +	/*
>> +	 * 'trigger' is NULL or invalid, disable the interrupt
>> +	 * 'trigger' is same as before, only bounce the bypass registration
>> +	 * 'trigger' is a new invalid one, update it to irqaction and other
> s/invalid/valid/
sorry for typo :)
>> +	 * data structures referencing to the old one; fallback to disable
>> +	 * the interrupt on error
>> +	 */
>>   	if (vdev->ctx[vector].trigger) {
>> -		free_irq(irq, vdev->ctx[vector].trigger);
>> +		/*
>> +		 * even if the trigger is unchanged we need to bounce the
>> +		 * interrupt bypass connection to allow affinity changes in
>> +		 * the guest to be realized.
>> +		 */
>>   		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>> -		kfree(vdev->ctx[vector].name);
>> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
>> -		vdev->ctx[vector].trigger = NULL;
>> +
>> +		if (vdev->ctx[vector].trigger == trigger) {
>> +			/* avoid duplicated referencing to the same trigger */
>> +			eventfd_ctx_put(trigger);
>> +
>> +		} else if (trigger && !IS_ERR(trigger)) {
>> +			ret = irq_update_devid(irq,
>> +					       vdev->ctx[vector].trigger, trigger);
>> +			if (unlikely(ret)) {
>> +				dev_info(&pdev->dev,
>> +					 "update devid of %d (token %p) failed: %d\n",
>> +					 irq, vdev->ctx[vector].trigger, ret);
>> +				eventfd_ctx_put(trigger);
>> +				free_irq(irq, vdev->ctx[vector].trigger);
>> +				kfree(vdev->ctx[vector].name);
>> +				eventfd_ctx_put(vdev->ctx[vector].trigger);
>> +				vdev->ctx[vector].trigger = NULL;
>> +				return ret;
>> +			}
>> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
>> +			vdev->ctx[vector].producer.token = trigger;
>> +			vdev->ctx[vector].trigger = trigger;
>> +
>> +		} else {
>> +			free_irq(irq, vdev->ctx[vector].trigger);
>> +			kfree(vdev->ctx[vector].name);
>> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
>> +			vdev->ctx[vector].trigger = NULL;
>> +		}
>>   	}
>>   
>>   	if (fd < 0)
>>   		return 0;
>> +	else if (IS_ERR(trigger))
>> +		return PTR_ERR(trigger);
>>   
>> -	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
>> -					   msix ? "x" : "", vector,
>> -					   pci_name(pdev));
>> -	if (!vdev->ctx[vector].name)
>> -		return -ENOMEM;
>> +	if (!vdev->ctx[vector].trigger) {
>> +		vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
>> +						   "vfio-msi%s[%d](%s)",
>> +						   msix ? "x" : "", vector,
>> +						   pci_name(pdev));
>> +		if (!vdev->ctx[vector].name) {
>> +			eventfd_ctx_put(trigger);
>> +			return -ENOMEM;
>> +		}
>>   
>> -	trigger = eventfd_ctx_fdget(fd);
>> -	if (IS_ERR(trigger)) {
>> -		kfree(vdev->ctx[vector].name);
>> -		return PTR_ERR(trigger);
>> -	}
>> +		/*
>> +		 * The MSIx vector table resides in device memory which may be
>> +		 * cleared via backdoor resets. We don't allow direct access to
>> +		 * the vector table so even if a userspace driver attempts to
>> +		 * save/restore around such a reset it would be unsuccessful.
>> +		 * To avoid this, restore the cached value of the message prior
>> +		 * to enabling.
>> +		 */
>> +		if (msix) {
>> +			struct msi_msg msg;
>>   
>> -	/*
>> -	 * The MSIx vector table resides in device memory which may be cleared
>> -	 * via backdoor resets. We don't allow direct access to the vector
>> -	 * table so even if a userspace driver attempts to save/restore around
>> -	 * such a reset it would be unsuccessful. To avoid this, restore the
>> -	 * cached value of the message prior to enabling.
>> -	 */
>> -	if (msix) {
>> -		struct msi_msg msg;
>> +			get_cached_msi_msg(irq, &msg);
>> +			pci_write_msi_msg(irq, &msg);
>> +		}
>>   
>> -		get_cached_msi_msg(irq, &msg);
>> -		pci_write_msi_msg(irq, &msg);
>> -	}
>> +		ret = request_irq(irq, vfio_msihandler, 0,
>> +				  vdev->ctx[vector].name, trigger);
>> +		if (ret) {
>> +			kfree(vdev->ctx[vector].name);
>> +			eventfd_ctx_put(trigger);
>> +			return ret;
>> +		}
>>   
>> -	ret = request_irq(irq, vfio_msihandler, 0,
>> -			  vdev->ctx[vector].name, trigger);
>> -	if (ret) {
>> -		kfree(vdev->ctx[vector].name);
>> -		eventfd_ctx_put(trigger);
>> -		return ret;
>> +		vdev->ctx[vector].producer.token = trigger;
>> +		vdev->ctx[vector].producer.irq = irq;
>> +		vdev->ctx[vector].trigger = trigger;
>>   	}
>>   
>> -	vdev->ctx[vector].producer.token = trigger;
>> -	vdev->ctx[vector].producer.irq = irq;
>> +	/* setup bypass connection and make irte updated */
>>   	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>>   	if (unlikely(ret))
>>   		dev_info(&pdev->dev,
>>   		"irq bypass producer (token %p) registration fails: %d\n",
>>   		vdev->ctx[vector].producer.token, ret);
>>   
>> -	vdev->ctx[vector].trigger = trigger;
>> -
>>   	return 0;
>>   }
>>   

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-08-31  4:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30  8:41 [PATCH v5 0/3] genirq/vfio: Introduce irq_update_devid() and optimize VFIO irq ops Ben Luo
2019-08-30  8:41 ` [PATCH v5 1/3] genirq: enhance error recovery code in free irq Ben Luo
2019-08-30  8:41 ` [PATCH v5 2/3] genirq: introduce irq_update_devid() Ben Luo
2019-08-30  8:42 ` [PATCH v5 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Ben Luo
2019-08-30 20:06   ` Alex Williamson
2019-08-31  4:38     ` Ben Luo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).