linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops
@ 2019-08-15 13:02 Ben Luo
  2019-08-15 13:02 ` [PATCH v3 1/3] genirq: enhance error recovery code in free irq Ben Luo
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Ben Luo @ 2019-08-15 13:02 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

Currently, VFIO takes a lot of free-then-request-irq actions whenever
a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
irq. Those actions only change the cookie data of irqaction or even
change nothing. The free-then-request-irq not only adds more latency,
but also increases the risk of losing interrupt, which may lead to a
VM hung forever in waiting for IO completion

This patchset solved the issue by:
Patch 2 introduces update_irq_devid to only update dev_id of irqaction
Patch 3 make use of update_irq_devid and optimize irq operations in VFIO

changes from v2:
 - reformat to avoid quoted string split across lines and etc.

changes from v1:
 - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
 - enhance error recovery code and debugging info in update_irq_devid
 - use __must_check in external referencing of update_irq_devid
 - use EXPORT_SYMBOL_GPL for update_irq_devid
 - reformat code of patch 3 for better readability

Ben Luo (3):
  genirq: enhance error recovery code in free irq
  genirq: introduce update_irq_devid()
  vfio_pci: make use of update_irq_devid and optimize irq ops

 drivers/vfio/pci/vfio_pci_intrs.c | 101 +++++++++++++++++++++-------------
 include/linux/interrupt.h         |   3 ++
 kernel/irq/manage.c               | 110 +++++++++++++++++++++++++++++++++-----
 3 files changed, 164 insertions(+), 50 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/3] genirq: enhance error recovery code in free irq
  2019-08-15 13:02 [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops Ben Luo
@ 2019-08-15 13:02 ` Ben Luo
  2019-08-15 14:20   ` Thomas Gleixner
  2019-08-15 13:03 ` [PATCH v3 2/3] genirq: introduce update_irq_devid() Ben Luo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Ben Luo @ 2019-08-15 13:02 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

Per Thomas Gleixner's comments:
1) free_irq/free_percpu_irq returns if called from IRQ context
2) move WARN out of the locked region and print out dev_id

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 kernel/irq/manage.c | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f17..6f9b20f 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1690,7 +1690,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 	struct irqaction *action, **action_ptr;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (in_interrupt()) {
+		WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+		     irq, dev_id);
+		return NULL;
+	}
 
 	mutex_lock(&desc->request_mutex);
 	chip_bus_lock(desc);
@@ -1705,10 +1709,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 		action = *action_ptr;
 
 		if (!action) {
-			WARN(1, "Trying to free already-free IRQ %d\n", irq);
 			raw_spin_unlock_irqrestore(&desc->lock, flags);
 			chip_bus_sync_unlock(desc);
 			mutex_unlock(&desc->request_mutex);
+			WARN(1, "Trying to free already-free IRQ %d (dev_id %p)\n",
+			     irq, dev_id);
 			return NULL;
 		}
 
@@ -2286,7 +2291,11 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	struct irqaction *action;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (in_interrupt()) {
+		WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+		     irq, dev_id);
+		return NULL;
+	}
 
 	if (!desc)
 		return NULL;
@@ -2295,14 +2304,17 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 
 	action = desc->action;
 	if (!action || action->percpu_dev_id != dev_id) {
-		WARN(1, "Trying to free already-free IRQ %d\n", irq);
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "Trying to free already-free IRQ (dev_id %p) %d\n",
+		     irq, dev_id);
+		return NULL;
 	}
 
 	if (!cpumask_empty(desc->percpu_enabled)) {
-		WARN(1, "percpu IRQ %d still enabled on CPU%d!\n",
-		     irq, cpumask_first(desc->percpu_enabled));
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "percpu IRQ %d (dev_id %p) still enabled on CPU%d!\n",
+		     irq, dev_id, cpumask_first(desc->percpu_enabled));
+		return NULL;
 	}
 
 	/* Found it - now remove it from the list of entries: */
@@ -2317,10 +2329,6 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	irq_chip_pm_put(&desc->irq_data);
 	module_put(desc->owner);
 	return action;
-
-bad:
-	raw_spin_unlock_irqrestore(&desc->lock, flags);
-	return NULL;
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/3] genirq: introduce update_irq_devid()
  2019-08-15 13:02 [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops Ben Luo
  2019-08-15 13:02 ` [PATCH v3 1/3] genirq: enhance error recovery code in free irq Ben Luo
@ 2019-08-15 13:03 ` Ben Luo
  2019-08-15 14:58   ` Thomas Gleixner
  2019-08-15 13:03 ` [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
  2019-08-19 20:51 ` [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO " Alex Williamson
  3 siblings, 1 reply; 15+ messages in thread
From: Ben Luo @ 2019-08-15 13:03 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

Sometimes, only the dev_id field of irqaction need to be changed.
E.g. KVM VM with device passthru via VFIO may switch irq injection
path between KVM irqfd and userspace eventfd. These two paths
share the same irq num and handler for the same vector of a device,
only with different dev_id referencing to different fds' contexts.

So, instead of free/request irq, only update dev_id of irqaction.
This narrows the gap for setting up new irq (and irte, if has)
and also gains some performance benefit.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/interrupt.h |  3 ++
 kernel/irq/manage.c       | 78 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a..6060c5a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -172,6 +172,9 @@ struct irqaction {
 request_percpu_nmi(unsigned int irq, irq_handler_t handler,
 		   const char *devname, void __percpu *dev);
 
+extern int __must_check
+update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id);
+
 extern const void *free_irq(unsigned int, void *);
 extern void free_percpu_irq(unsigned int, void __percpu *);
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 6f9b20f..a76ef61 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2064,6 +2064,84 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 EXPORT_SYMBOL(request_threaded_irq);
 
 /**
+ *	update_irq_devid - update irq dev_id to a new one
+ *	@irq: Interrupt line to update
+ *	@dev_id: A cookie to find the irqaction to update
+ *	@new_dev_id: New cookie passed to the handler function
+ *
+ *	Sometimes, only the cookie data need to be changed.
+ *	Instead of free/request irq, only update dev_id here
+ *	Not only to gain some performance benefit, but also
+ *	reduce the risk of losing interrupt.
+ *
+ *	E.g. irq affinity setting in a VM with VFIO passthru
+ *	device is carried out in a free-then-request-irq way
+ *	since lack of this very function. The free_irq()
+ *	eventually triggers deactivation of IR domain, which
+ *	will cleanup IRTE. There is a gap before request_irq()
+ *	finally setup the IRTE again. In this gap, a in-flight
+ *	irq buffering in hardware layer may trigger DMAR fault
+ *	and be lost.
+ *
+ *	On failure, it returns a negative value. On success,
+ *	it returns 0
+ */
+int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irqaction *action, **action_ptr;
+	unsigned long flags;
+
+	if (in_interrupt()) {
+		WARN(1, "Trying to update IRQ %d (dev_id %p to %p) from IRQ context!\n",
+		     irq, dev_id, new_dev_id);
+		return -EPERM;
+	}
+
+	if (!desc)
+		return -EINVAL;
+
+	chip_bus_lock(desc);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	/*
+	 * There can be multiple actions per IRQ descriptor, find the right
+	 * one based on the dev_id:
+	 */
+	action_ptr = &desc->action;
+	for (;;) {
+		action = *action_ptr;
+
+		if (!action) {
+			raw_spin_unlock_irqrestore(&desc->lock, flags);
+			chip_bus_sync_unlock(desc);
+			WARN(1, "Trying to update already-free IRQ %d (dev_id %p to %p)\n",
+			     irq, dev_id, new_dev_id);
+			return -ENXIO;
+		}
+
+		if (action->dev_id == dev_id) {
+			action->dev_id = new_dev_id;
+			break;
+		}
+		action_ptr = &action->next;
+	}
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	chip_bus_sync_unlock(desc);
+
+	/*
+	 * Make sure it's not being used on another CPU:
+	 * There is a risk of UAF for old *dev_id, if it is
+	 * freed in a short time after this func returns
+	 */
+	synchronize_irq(irq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(update_irq_devid);
+
+/**
  *	request_any_context_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
  *	@handler: Function to be called when the IRQ occurs.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops
  2019-08-15 13:02 [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops Ben Luo
  2019-08-15 13:02 ` [PATCH v3 1/3] genirq: enhance error recovery code in free irq Ben Luo
  2019-08-15 13:03 ` [PATCH v3 2/3] genirq: introduce update_irq_devid() Ben Luo
@ 2019-08-15 13:03 ` Ben Luo
  2019-08-15 16:45   ` Thomas Gleixner
  2019-08-19 20:51 ` [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO " Alex Williamson
  3 siblings, 1 reply; 15+ messages in thread
From: Ben Luo @ 2019-08-15 13:03 UTC (permalink / raw)
  To: tglx, alex.williamson
  Cc: linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of update_irq_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
Reviewed-by: Zou Nanhai <nanhai.zou@linux.alibaba.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 101 ++++++++++++++++++++++++--------------
 1 file changed, 63 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..b2654ba 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,95 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 				      int vector, int fd, bool msix)
 {
+	struct eventfd_ctx *trigger = NULL;
 	struct pci_dev *pdev = vdev->pdev;
-	struct eventfd_ctx *trigger;
 	int irq, ret;
 
 	if (vector < 0 || vector >= vdev->num_ctx)
 		return -EINVAL;
 
+	if (fd >= 0) {
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger))
+			return PTR_ERR(trigger);
+	}
+
 	irq = pci_irq_vector(pdev, vector);
 
 	if (vdev->ctx[vector].trigger) {
-		free_irq(irq, vdev->ctx[vector].trigger);
-		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(vdev->ctx[vector].trigger);
-		vdev->ctx[vector].trigger = NULL;
+		if (vdev->ctx[vector].trigger == trigger) {
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+		} else if (trigger) {
+			ret = update_irq_devid(irq,
+					vdev->ctx[vector].trigger, trigger);
+			if (unlikely(ret)) {
+				dev_info(&pdev->dev,
+					 "update_irq_devid %d (token %p) fails: %d\n",
+					 irq, vdev->ctx[vector].trigger, ret);
+				eventfd_ctx_put(trigger);
+				return ret;
+			}
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].producer.token = trigger;
+			vdev->ctx[vector].trigger = trigger;
+		} else {
+			free_irq(irq, vdev->ctx[vector].trigger);
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].trigger = NULL;
+		}
 	}
 
 	if (fd < 0)
 		return 0;
 
-	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
-					   msix ? "x" : "", vector,
-					   pci_name(pdev));
-	if (!vdev->ctx[vector].name)
-		return -ENOMEM;
+	if (!vdev->ctx[vector].trigger) {
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL,
+						   "vfio-msi%s[%d](%s)",
+						   msix ? "x" : "", vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name) {
+			eventfd_ctx_put(trigger);
+			return -ENOMEM;
+		}
 
-	trigger = eventfd_ctx_fdget(fd);
-	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[vector].name);
-		return PTR_ERR(trigger);
-	}
+		/*
+		 * The MSIx vector table resides in device memory which may be
+		 * cleared via backdoor resets. We don't allow direct access to
+		 * the vector table so even if a userspace driver attempts to
+		 * save/restore around such a reset it would be unsuccessful.
+		 * To avoid this, restore the cached value of the message prior
+		 * to enabling.
+		 */
+		if (msix) {
+			struct msi_msg msg;
 
-	/*
-	 * The MSIx vector table resides in device memory which may be cleared
-	 * via backdoor resets. We don't allow direct access to the vector
-	 * table so even if a userspace driver attempts to save/restore around
-	 * such a reset it would be unsuccessful. To avoid this, restore the
-	 * cached value of the message prior to enabling.
-	 */
-	if (msix) {
-		struct msi_msg msg;
+			get_cached_msi_msg(irq, &msg);
+			pci_write_msi_msg(irq, &msg);
+		}
 
-		get_cached_msi_msg(irq, &msg);
-		pci_write_msi_msg(irq, &msg);
-	}
+		ret = request_irq(irq, vfio_msihandler, 0,
+				  vdev->ctx[vector].name, trigger);
+		if (ret) {
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(trigger);
+			return ret;
+		}
 
-	ret = request_irq(irq, vfio_msihandler, 0,
-			  vdev->ctx[vector].name, trigger);
-	if (ret) {
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(trigger);
-		return ret;
+		vdev->ctx[vector].producer.token = trigger;
+		vdev->ctx[vector].producer.irq = irq;
+		vdev->ctx[vector].trigger = trigger;
 	}
 
-	vdev->ctx[vector].producer.token = trigger;
-	vdev->ctx[vector].producer.irq = irq;
+	/* always update irte for posted mode */
 	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
 	if (unlikely(ret))
 		dev_info(&pdev->dev,
 		"irq bypass producer (token %p) registration fails: %d\n",
 		vdev->ctx[vector].producer.token, ret);
 
-	vdev->ctx[vector].trigger = trigger;
-
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] genirq: enhance error recovery code in free irq
  2019-08-15 13:02 ` [PATCH v3 1/3] genirq: enhance error recovery code in free irq Ben Luo
@ 2019-08-15 14:20   ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2019-08-15 14:20 UTC (permalink / raw)
  To: Ben Luo
  Cc: alex.williamson, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

On Thu, 15 Aug 2019, Ben Luo wrote:

> Per Thomas Gleixner's comments:
> 1) free_irq/free_percpu_irq returns if called from IRQ context
> 2) move WARN out of the locked region and print out dev_id

Please do not describe WHAT the patch is doing, please describe WHY.
 
> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
> ---
>  kernel/irq/manage.c | 32 ++++++++++++++++++++------------
>  1 file changed, 20 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index e8f7f17..6f9b20f 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1690,7 +1690,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
>  	struct irqaction *action, **action_ptr;
>  	unsigned long flags;
>  
> -	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
> +	if (in_interrupt()) {
> +		WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
> +		     irq, dev_id);
> +		return NULL;
> +	}

	if (WARN(in_interrupt(), ."Tr...."))
		return NULL;

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/3] genirq: introduce update_irq_devid()
  2019-08-15 13:03 ` [PATCH v3 2/3] genirq: introduce update_irq_devid() Ben Luo
@ 2019-08-15 14:58   ` Thomas Gleixner
  2019-08-19  5:31     ` luoben
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2019-08-15 14:58 UTC (permalink / raw)
  To: Ben Luo
  Cc: alex.williamson, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

Ben,

On Thu, 15 Aug 2019, Ben Luo wrote:

> Sometimes, only the dev_id field of irqaction need to be changed.
> E.g. KVM VM with device passthru via VFIO may switch irq injection
> path between KVM irqfd and userspace eventfd. These two paths
> share the same irq num and handler for the same vector of a device,

s/irq num/interrupt number/

Changelogs are text and should not contain cryptic abbreviations.

> only with different dev_id referencing to different fds' contexts.
> 
> So, instead of free/request irq, only update dev_id of irqaction.

Please write functions like this: function_name() so they can be easily
identified in the text as such.

> This narrows the gap for setting up new irq (and irte, if has)

What does that mean: "narrows the gap"

What's the gap and why is it only made smaller and not closed?

> and also gains some performance benefit.
> 
> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
> Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>

I prefer to see the 'reviewed-by' as a reply on LKML rather than coming
from some internal process.

> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

While I reviewed the previous version, I surely did not give a
'Reviewed-by' tag. That tag means that the person did review the patch and
did not find an issue. I surely found issues, right?

> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 6f9b20f..a76ef61 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -2064,6 +2064,84 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
>  EXPORT_SYMBOL(request_threaded_irq);
>  
>  /**
> + *	update_irq_devid - update irq dev_id to a new one

Can you please name this: irq_update_devid(). We try to have a consistent
name space for new functions.

> + *	@irq: Interrupt line to update
> + *	@dev_id: A cookie to find the irqaction to update
> + *	@new_dev_id: New cookie passed to the handler function

Can you please arrange these in tabular fashion:

 *	@irq:		Interrupt line to update
 *	@dev_id:	A cookie to find the irqaction to update
 *	@new_dev_id:	New cookie passed to the handler function

> + *	Sometimes, only the cookie data need to be changed.
> + *	Instead of free/request irq, only update dev_id here
> + *	Not only to gain some performance benefit, but also
> + *	reduce the risk of losing interrupt.
> + *
> + *	E.g. irq affinity setting in a VM with VFIO passthru

Again. Please write it out 'interrupt' and everything else.

> + *	device is carried out in a free-then-request-irq way
> + *	since lack of this very function. The free_irq()

That does not make sense. The function is there for such a use case. So
immediately when the VFIO change is merged this comment is stale and bogus.

> + *	eventually triggers deactivation of IR domain, which
> + *	will cleanup IRTE. There is a gap before request_irq()
> + *	finally setup the IRTE again. In this gap, a in-flight
> + *	irq buffering in hardware layer may trigger DMAR fault
> + *	and be lost.

Exactly this information wants to be in the changelog.

> + *
> + *	On failure, it returns a negative value. On success,
> + *	it returns 0
> + */
> +int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
> +{
> +	struct irq_desc *desc = irq_to_desc(irq);
> +	struct irqaction *action, **action_ptr;
> +	unsigned long flags;
> +
> +	if (in_interrupt()) {
> +		WARN(1, "Trying to update IRQ %d (dev_id %p to %p) from IRQ context!\n",
> +		     irq, dev_id, new_dev_id);
> +		return -EPERM;
> +	}

  	if (WARN(....)

> +
> +	if (!desc)
> +		return -EINVAL;
> +
> +	chip_bus_lock(desc);

This does not need to take bus lock. The action pointer is sufficiently
protected by desc->lock.

> +	raw_spin_lock_irqsave(&desc->lock, flags);
> +
> +	/*
> +	 * There can be multiple actions per IRQ descriptor, find the right
> +	 * one based on the dev_id:
> +	 */
> +	action_ptr = &desc->action;
> +	for (;;) {
> +		action = *action_ptr;
> +
> +		if (!action) {
> +			raw_spin_unlock_irqrestore(&desc->lock, flags);
> +			chip_bus_sync_unlock(desc);
> +			WARN(1, "Trying to update already-free IRQ %d (dev_id %p to %p)\n",
> +			     irq, dev_id, new_dev_id);
> +			return -ENXIO;
> +		}
> +
> +		if (action->dev_id == dev_id) {
> +			action->dev_id = new_dev_id;
> +			break;
> +		}
> +		action_ptr = &action->next;
> +	}
> +
> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
> +	chip_bus_sync_unlock(desc);
> +
> +	/*
> +	 * Make sure it's not being used on another CPU:
> +	 * There is a risk of UAF for old *dev_id, if it is
> +	 * freed in a short time after this func returns

function please. Also it does not matter whether the time is short or
not. The point is:

     	 Ensure that an interrupt in flight on another CPU which uses the
     	 old 'dev_id' has completed because the caller can free the memory
	 to which it points after this function returns.

But this has another twist:

    CPU0				CPU1

    interrupt
    	primary_handler(old_dev_id)
	   do_stuff_on(old_dev_id)
	   return WAKE_THREAD;		update_dev_id()
        wakeup_thread();
					  action->dev_id = new_dev_id;
    irq_thread()
        secondary_handler(new_dev_id)
	
That's broken and synchronize_irq() does not protect against it.

> +	 */
> +	synchronize_irq(irq);

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops
  2019-08-15 13:03 ` [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
@ 2019-08-15 16:45   ` Thomas Gleixner
       [not found]     ` <7a3606ad-8fa6-45d5-b5a4-ee3f07893a25@linux.alibaba.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2019-08-15 16:45 UTC (permalink / raw)
  To: Ben Luo
  Cc: alex.williamson, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

On Thu, 15 Aug 2019, Ben Luo wrote:
>  	if (vdev->ctx[vector].trigger) {
> -		free_irq(irq, vdev->ctx[vector].trigger);
> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> -		kfree(vdev->ctx[vector].name);
> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
> -		vdev->ctx[vector].trigger = NULL;
> +		if (vdev->ctx[vector].trigger == trigger) {
> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

What's the point of unregistering the producer, just to register it again below?

> +		} else if (trigger) {
> +			ret = update_irq_devid(irq,
> +					vdev->ctx[vector].trigger, trigger);
> +			if (unlikely(ret)) {
> +				dev_info(&pdev->dev,
> +					 "update_irq_devid %d (token %p) fails: %d\n",

s/fails/failed/

> +					 irq, vdev->ctx[vector].trigger, ret);
> +				eventfd_ctx_put(trigger);
> +				return ret;
> +			}

This lacks any form of comment why this is correct. dev_id is updated and
the producer with the old token is still registered.

> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);

Now it's unregistered.

> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
> +			vdev->ctx[vector].producer.token = trigger;

The token is updated and then it's newly registered below.

> +			vdev->ctx[vector].trigger = trigger;
> -	vdev->ctx[vector].producer.token = trigger;
> -	vdev->ctx[vector].producer.irq = irq;
> +	/* always update irte for posted mode */
>  	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>  	if (unlikely(ret))
>  		dev_info(&pdev->dev,
>  		"irq bypass producer (token %p) registration fails: %d\n",
>  		vdev->ctx[vector].producer.token, ret);

I know this code already existed, but again this looks inconsistent. If the
registration fails then a subsequent update will try to unregister a not
registered producer. Does not make any sense to me.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/3] genirq: introduce update_irq_devid()
  2019-08-15 14:58   ` Thomas Gleixner
@ 2019-08-19  5:31     ` luoben
  2019-08-19  8:18       ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: luoben @ 2019-08-19  5:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: alex.williamson, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng


在 2019/8/15 下午10:58, Thomas Gleixner 写道:
> Ben,
>
> On Thu, 15 Aug 2019, Ben Luo wrote:
>
>> Sometimes, only the dev_id field of irqaction need to be changed.
>> E.g. KVM VM with device passthru via VFIO may switch irq injection
>> path between KVM irqfd and userspace eventfd. These two paths
>> share the same irq num and handler for the same vector of a device,
> s/irq num/interrupt number/
>
> Changelogs are text and should not contain cryptic abbreviations.
>
>> only with different dev_id referencing to different fds' contexts.
>>
>> So, instead of free/request irq, only update dev_id of irqaction.
> Please write functions like this: function_name() so they can be easily
> identified in the text as such.
>
>> This narrows the gap for setting up new irq (and irte, if has)
> What does that mean: "narrows the gap"
>
> What's the gap and why is it only made smaller and not closed?

Sorry for confusing. The so called 'gap' is a time window between 
free_irq() and request_irq().

For example, interrupt affinity setting in a VM (with VFIO passthrough 
device) will trigger VFIO to

do free-then-setup interrupt as below. After free_irq(), the target IRTE 
is invalidated, an in-flight

interrupt (buffering in hardware layer which cannot be synchronized in 
software) may cause a

DMAR fault before IRTE being setup again with exactly the same value as 
this entry was before cleared

[1566032533719954] modify_irte with index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
  0xffffffff81465520 : modify_irte+0x0/0x160 [kernel]
...
  0xffffffff810d2a89 : free_irq+0x39/0x90 [kernel]
  0xffffffffa02891e0 : vfio_msi_set_vector_signal+0x80/0x280 [vfio_pci]
  0x0 (inexact)

[1566032533719982] DMAR FAULT begin

[1566032533719971] modify_irte with index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
  0xffffffff81465520 : modify_irte+0x0/0x160 [kernel]
...
  0xffffffff810d3d2b : request_threaded_irq+0x10b/0x1a0 [kernel]
  0xffffffffa02892d6 : vfio_msi_set_vector_signal+0x176/0x280 [vfio_pci]
  0x0 (inexact)

By using a new function who only updates devid, this gap can actually be 
closed since

this new function won't modify irte.

When posted interrupt is in use, the target IRTE does need to be 
modified because this IRTE

will be used by IOMMU to find the target CPU since hypervisor is 
bypassed. That is also the

reason why still call irq_bypass_register_producer() although the token 
of producer has not

been changed. irq_bypass_register_producer() will modify irte if posted 
interrupt is in use:

[1566032533720007] index:28 irte_hi:000000010004a601 
irte_lo:adb54bc000b98001
  0xffffffff81465520 : modify_irte+0x0/0x160 [kernel]
...
  0xffffffffa0338386 : vfio_msi_set_vector_signal+0x1b6/0x2c0 [vfio_pci]

>
>> and also gains some performance benefit.
>>
>> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
>> Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
> I prefer to see the 'reviewed-by' as a reply on LKML rather than coming
> from some internal process.
>
>> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> While I reviewed the previous version, I surely did not give a
> 'Reviewed-by' tag. That tag means that the person did review the patch and
> did not find an issue. I surely found issues, right?
Sorry for that,I will amend the commit messages
>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
>> index 6f9b20f..a76ef61 100644
>> --- a/kernel/irq/manage.c
>> +++ b/kernel/irq/manage.c
>> @@ -2064,6 +2064,84 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
>>   EXPORT_SYMBOL(request_threaded_irq);
>>   
>>   /**
>> + *	update_irq_devid - update irq dev_id to a new one
> Can you please name this: irq_update_devid(). We try to have a consistent
> name space for new functions.
ok, will name it to irq_update_devid() in next version.
>
>> + *	@irq: Interrupt line to update
>> + *	@dev_id: A cookie to find the irqaction to update
>> + *	@new_dev_id: New cookie passed to the handler function
> Can you please arrange these in tabular fashion:
>
>   *	@irq:		Interrupt line to update
>   *	@dev_id:	A cookie to find the irqaction to update
>   *	@new_dev_id:	New cookie passed to the handler function
>
>> + *	Sometimes, only the cookie data need to be changed.
>> + *	Instead of free/request irq, only update dev_id here
>> + *	Not only to gain some performance benefit, but also
>> + *	reduce the risk of losing interrupt.
>> + *
>> + *	E.g. irq affinity setting in a VM with VFIO passthru
> Again. Please write it out 'interrupt' and everything else.
>
>> + *	device is carried out in a free-then-request-irq way
>> + *	since lack of this very function. The free_irq()
> That does not make sense. The function is there for such a use case. So
> immediately when the VFIO change is merged this comment is stale and bogus.
>
>> + *	eventually triggers deactivation of IR domain, which
>> + *	will cleanup IRTE. There is a gap before request_irq()
>> + *	finally setup the IRTE again. In this gap, a in-flight
>> + *	irq buffering in hardware layer may trigger DMAR fault
>> + *	and be lost.
> Exactly this information wants to be in the changelog.
>
>> + *
>> + *	On failure, it returns a negative value. On success,
>> + *	it returns 0
>> + */
>> +int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
>> +{
>> +	struct irq_desc *desc = irq_to_desc(irq);
>> +	struct irqaction *action, **action_ptr;
>> +	unsigned long flags;
>> +
>> +	if (in_interrupt()) {
>> +		WARN(1, "Trying to update IRQ %d (dev_id %p to %p) from IRQ context!\n",
>> +		     irq, dev_id, new_dev_id);
>> +		return -EPERM;
>> +	}
>    	if (WARN(....)
>
>> +
>> +	if (!desc)
>> +		return -EINVAL;
>> +
>> +	chip_bus_lock(desc);
> This does not need to take bus lock. The action pointer is sufficiently
> protected by desc->lock.
>
>> +	raw_spin_lock_irqsave(&desc->lock, flags);
>> +
>> +	/*
>> +	 * There can be multiple actions per IRQ descriptor, find the right
>> +	 * one based on the dev_id:
>> +	 */
>> +	action_ptr = &desc->action;
>> +	for (;;) {
>> +		action = *action_ptr;
>> +
>> +		if (!action) {
>> +			raw_spin_unlock_irqrestore(&desc->lock, flags);
>> +			chip_bus_sync_unlock(desc);
>> +			WARN(1, "Trying to update already-free IRQ %d (dev_id %p to %p)\n",
>> +			     irq, dev_id, new_dev_id);
>> +			return -ENXIO;
>> +		}
>> +
>> +		if (action->dev_id == dev_id) {
>> +			action->dev_id = new_dev_id;
>> +			break;
>> +		}
>> +		action_ptr = &action->next;
>> +	}
>> +
>> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
>> +	chip_bus_sync_unlock(desc);
>> +
>> +	/*
>> +	 * Make sure it's not being used on another CPU:
>> +	 * There is a risk of UAF for old *dev_id, if it is
>> +	 * freed in a short time after this func returns
> function please. Also it does not matter whether the time is short or
> not. The point is:
>
>       	 Ensure that an interrupt in flight on another CPU which uses the
>       	 old 'dev_id' has completed because the caller can free the memory
> 	 to which it points after this function returns.
>
> But this has another twist:
>
>      CPU0				CPU1
>
>      interrupt
>      	primary_handler(old_dev_id)
> 	   do_stuff_on(old_dev_id)
> 	   return WAKE_THREAD;		update_dev_id()
>          wakeup_thread();
> 					  action->dev_id = new_dev_id;
>      irq_thread()
>          secondary_handler(new_dev_id)
> 	
> That's broken and synchronize_irq() does not protect against it.

Thanks to point it out, I will change to the following in next version, 
is that ok?

...

     /*
      * Ensure that an interrupt in flight on another CPU which uses the
      * old 'dev_id' has completed because the caller can free the memory
      * to which it points after this function returns. And also void to
      * update 'dev_id' in the middle of a threaded interrupt process, it
      * can lead to a twist that primary handler uses old 'dev_id' but new
      * 'dev_id' is used by secondary handler.
      */
     disable_irq(irq);
     raw_spin_lock_irqsave(&desc->lock, flags);

     /*
      * There can be multiple actions per IRQ descriptor, find the right
      * one based on the dev_id:
      */
     action_ptr = &desc->action;
     for (;;) {
         action = *action_ptr;

         if (!action) {
             raw_spin_unlock_irqrestore(&desc->lock, flags);
             enable_irq(irq);
             WARN(1, "Trying to update already-free IRQ %d "
                     "(dev_id %p to %p)\n", irq, dev_id, new_dev_id);
             return -ENXIO;
         }

         if (action->dev_id == dev_id) {
             action->dev_id = new_dev_id;
             break;
         }
         action_ptr = &action->next;
     }

     raw_spin_unlock_irqrestore(&desc->lock, flags);
     enable_irq(irq);

     return 0;

>
>> +	 */
>> +	synchronize_irq(irq);
> Thanks,
>
> 	tglx

I will also amend the changelog and function comments etc. in next version


Thanks,

             Ben


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/3] genirq: introduce update_irq_devid()
  2019-08-19  5:31     ` luoben
@ 2019-08-19  8:18       ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2019-08-19  8:18 UTC (permalink / raw)
  To: luoben
  Cc: alex.williamson, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

[-- Attachment #1: Type: text/plain, Size: 1906 bytes --]

Ben,

On Mon, 19 Aug 2019, luoben wrote:
> 在 2019/8/15 下午10:58, Thomas Gleixner 写道:
> > > This narrows the gap for setting up new irq (and irte, if has)
> > What does that mean: "narrows the gap"
> > 
> > What's the gap and why is it only made smaller and not closed?
> 
> Sorry for confusing. The so called 'gap' is a time window between free_irq()
> and request_irq().

And exactly this information wants to be in the changelog.
 
> > function please. Also it does not matter whether the time is short or
> > not. The point is:
> > 
> >       	 Ensure that an interrupt in flight on another CPU which uses
> > the
> >       	 old 'dev_id' has completed because the caller can free the
> > memory
> > 	 to which it points after this function returns.
> > 
> > But this has another twist:
> > 
> >      CPU0				CPU1
> > 
> >      interrupt
> >      	primary_handler(old_dev_id)
> > 	   do_stuff_on(old_dev_id)
> > 	   return WAKE_THREAD;		update_dev_id()
> >          wakeup_thread();
> > 					  action->dev_id = new_dev_id;
> >      irq_thread()
> >          secondary_handler(new_dev_id)
> > 	
> > That's broken and synchronize_irq() does not protect against it.
> 
> Thanks to point it out, I will change to the following in next version, is
> that ok?
> 
> ...
> 
>     /*

  ^^^
Please use a mail client which does not insert random wierd characters.

>      * Ensure that an interrupt in flight on another CPU which uses the
>      * old 'dev_id' has completed because the caller can free the memory
>      * to which it points after this function returns. And also void to

s/void/avoid/

>      * update 'dev_id' in the middle of a threaded interrupt process, it
>      * can lead to a twist that primary handler uses old 'dev_id' but new
>      * 'dev_id' is used by secondary handler.
>      */
>     disable_irq(irq);

Yes, that works.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops
  2019-08-15 13:02 [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops Ben Luo
                   ` (2 preceding siblings ...)
  2019-08-15 13:03 ` [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
@ 2019-08-19 20:51 ` Alex Williamson
  2019-08-20  4:03   ` luoben
  3 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2019-08-19 20:51 UTC (permalink / raw)
  To: Ben Luo; +Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

On Thu, 15 Aug 2019 21:02:58 +0800
Ben Luo <luoben@linux.alibaba.com> wrote:

> Currently, VFIO takes a lot of free-then-request-irq actions whenever
> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
> irq. Those actions only change the cookie data of irqaction or even
> change nothing. The free-then-request-irq not only adds more latency,
> but also increases the risk of losing interrupt, which may lead to a
> VM hung forever in waiting for IO completion

What guest environment is generating this?  Typically I don't see that
Windows or Linux guests bounce the interrupt configuration much.
Thanks,

Alex

> 
> This patchset solved the issue by:
> Patch 2 introduces update_irq_devid to only update dev_id of irqaction
> Patch 3 make use of update_irq_devid and optimize irq operations in VFIO
> 
> changes from v2:
>  - reformat to avoid quoted string split across lines and etc.
> 
> changes from v1:
>  - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
>  - enhance error recovery code and debugging info in update_irq_devid
>  - use __must_check in external referencing of update_irq_devid
>  - use EXPORT_SYMBOL_GPL for update_irq_devid
>  - reformat code of patch 3 for better readability
> 
> Ben Luo (3):
>   genirq: enhance error recovery code in free irq
>   genirq: introduce update_irq_devid()
>   vfio_pci: make use of update_irq_devid and optimize irq ops
> 
>  drivers/vfio/pci/vfio_pci_intrs.c | 101 +++++++++++++++++++++-------------
>  include/linux/interrupt.h         |   3 ++
>  kernel/irq/manage.c               | 110 +++++++++++++++++++++++++++++++++-----
>  3 files changed, 164 insertions(+), 50 deletions(-)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops
  2019-08-19 20:51 ` [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO " Alex Williamson
@ 2019-08-20  4:03   ` luoben
  2019-08-20 15:22     ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: luoben @ 2019-08-20  4:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng


在 2019/8/20 上午4:51, Alex Williamson 写道:
> On Thu, 15 Aug 2019 21:02:58 +0800
> Ben Luo <luoben@linux.alibaba.com> wrote:
>
>> Currently, VFIO takes a lot of free-then-request-irq actions whenever
>> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
>> irq. Those actions only change the cookie data of irqaction or even
>> change nothing. The free-then-request-irq not only adds more latency,
>> but also increases the risk of losing interrupt, which may lead to a
>> VM hung forever in waiting for IO completion
> What guest environment is generating this?  Typically I don't see that
> Windows or Linux guests bounce the interrupt configuration much.
> Thanks,
>
> Alex

By tracing centos5u8 on host, I found it keep masking and unmasking 
interrupt like this:

[1566032533709879] index:28 irte_hi:000000010004a601 
irte_lo:adb54bc000b98001
[1566032533711242] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533711258] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533711269] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533711291] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533711321] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533711340] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533711361] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533711376] index:28 irte_hi:000000010004a601 
irte_lo:adb54bc000b98001
[1566032533713368] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533713385] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533713396] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533713416] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533713447] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533713464] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533713485] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533713499] index:28 irte_hi:000000010004a601 
irte_lo:adb54bc000b98001
[1566032533718855] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533718871] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533718882] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533718902] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533718932] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533718949] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533718969] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533718984] index:28 irte_hi:000000010004a601 
irte_lo:adb54bc000b98001
[1566032533719873] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533719889] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533719900] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533719921] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533719954] index:28 irte_hi:0000000000000000 
irte_lo:0000000000000000
[1566032533719971] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533719992] index:28 irte_hi:000000000004a601 
irte_lo:00003fff00ac002d
[1566032533720007] index:28 irte_hi:000000010004a601 
irte_lo:adb54bc000b98001

"[1566032533720007]" is timestamp in μs, so centos5u8 tiggers 30+ irte 
modification within 10ms

Thanks,

     Ben

>> This patchset solved the issue by:
>> Patch 2 introduces update_irq_devid to only update dev_id of irqaction
>> Patch 3 make use of update_irq_devid and optimize irq operations in VFIO
>>
>> changes from v2:
>>   - reformat to avoid quoted string split across lines and etc.
>>
>> changes from v1:
>>   - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
>>   - enhance error recovery code and debugging info in update_irq_devid
>>   - use __must_check in external referencing of update_irq_devid
>>   - use EXPORT_SYMBOL_GPL for update_irq_devid
>>   - reformat code of patch 3 for better readability
>>
>> Ben Luo (3):
>>    genirq: enhance error recovery code in free irq
>>    genirq: introduce update_irq_devid()
>>    vfio_pci: make use of update_irq_devid and optimize irq ops
>>
>>   drivers/vfio/pci/vfio_pci_intrs.c | 101 +++++++++++++++++++++-------------
>>   include/linux/interrupt.h         |   3 ++
>>   kernel/irq/manage.c               | 110 +++++++++++++++++++++++++++++++++-----
>>   3 files changed, 164 insertions(+), 50 deletions(-)
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops
  2019-08-20  4:03   ` luoben
@ 2019-08-20 15:22     ` Alex Williamson
  2019-08-20 15:51       ` luoben
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2019-08-20 15:22 UTC (permalink / raw)
  To: luoben; +Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng

On Tue, 20 Aug 2019 12:03:50 +0800
luoben <luoben@linux.alibaba.com> wrote:

> 在 2019/8/20 上午4:51, Alex Williamson 写道:
> > On Thu, 15 Aug 2019 21:02:58 +0800
> > Ben Luo <luoben@linux.alibaba.com> wrote:
> >  
> >> Currently, VFIO takes a lot of free-then-request-irq actions whenever
> >> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
> >> irq. Those actions only change the cookie data of irqaction or even
> >> change nothing. The free-then-request-irq not only adds more latency,
> >> but also increases the risk of losing interrupt, which may lead to a
> >> VM hung forever in waiting for IO completion  
> > What guest environment is generating this?  Typically I don't see that
> > Windows or Linux guests bounce the interrupt configuration much.
> > Thanks,
> >
> > Alex  
> 
> By tracing centos5u8 on host, I found it keep masking and unmasking 
> interrupt like this:
> 
> [1566032533709879] index:28 irte_hi:000000010004a601 
> irte_lo:adb54bc000b98001
> [1566032533711242] index:28 irte_hi:0000000000000000 
> irte_lo:0000000000000000
> [1566032533711258] index:28 irte_hi:000000000004a601 
> irte_lo:00003fff00ac002d
> [1566032533711269] index:28 irte_hi:000000000004a601 
> irte_lo:00003fff00ac002d
[snip] 
> "[1566032533720007]" is timestamp in μs, so centos5u8 tiggers 30+ irte 
> modification within 10ms

Ok, that matches my understanding that only very old guests behave in
this manner.  It's a curious case to optimize as RHEL5 is in extended
life-cycle support, with regular maintenance releases ending 2+ years
ago.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops
       [not found]     ` <7a3606ad-8fa6-45d5-b5a4-ee3f07893a25@linux.alibaba.com>
@ 2019-08-20 15:27       ` Liu, Jiang
  2019-08-22 13:41         ` luoben
  0 siblings, 1 reply; 15+ messages in thread
From: Liu, Jiang @ 2019-08-20 15:27 UTC (permalink / raw)
  To: luoben
  Cc: Thomas Gleixner, alex.williamson, linux-kernel, tao.ma,
	nanhai.zou, linyunsheng



> On Aug 20, 2019, at 11:24 PM, luoben <luoben@linux.alibaba.com> wrote:
> 
> 
> 
> 在 2019/8/16 上午12:45, Thomas Gleixner 写道:
>> On Thu, 15 Aug 2019, Ben Luo wrote:
>> 
>>>  	if (vdev->ctx[vector].trigger) {
>>> -		free_irq(irq, vdev->ctx[vector].trigger);
>>> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>> -		kfree(vdev->ctx[vector].name);
>>> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
>>> -		vdev->ctx[vector].trigger = NULL;
>>> +		if (vdev->ctx[vector].trigger == trigger) {
>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>> 
>> What's the point of unregistering the producer, just to register it again below?
> Whether producer token changed or not, irq_bypass_register_producer() is a way (seems the
> 
> only way) to update IRTE by VFIO for posted interrupt.
> 
> When posted interrupt is in use, the target IRTE will be used by IOMMU directly to find the
> 
> target CPU for an interrupt posted to VM, since hypervisor is bypassed.
> 
> irq_bypass_register_producer() will modify IRTE based on the information retrieved from KVM,
> 
> 
> 0xffffffff8150a920 : modify_irte+0x0/0x180 [kernel]
> 0xffffffff8150ab94 : intel_ir_set_vcpu_affinity+0xf4/0x150 [kernel]
> 0xffffffff81125f3c : irq_set_vcpu_affinity+0x5c/0xa0 [kernel]
> 0xffffffffa0550818 : vmx_update_pi_irte+0x178/0x290 [kvm_intel]    // get pi_desc etc. from KVM
> 0xffffffffa052b789 : kvm_arch_irq_bypass_add_producer+0x39/0x50 [kvm_intel] (inexact)
> 0xffffffffa024a50b : __connect+0x7b/0xa0 [kvm] (inexact)
> 0xffffffffa024a6a8 : irq_bypass_register_producer+0x108/0x140 [kvm] (inexact)
> 0xffffffffa0338386 : vfio_msi_set_vector_signal+0x1b6/0x2c0 [vfio_pci] (inexact)
>> 
>>> +		} else if (trigger) {
>>> +			ret = update_irq_devid(irq,
>>> +					vdev->ctx[vector].trigger, trigger);
>>> +			if (unlikely(ret)) {
>>> +				dev_info(&pdev->dev,
>>> +					 "update_irq_devid %d (token %p) fails: %d\n",
>>> 
>> s/fails/failed/
>> 
>> 
>>> +					 irq, vdev->ctx[vector].trigger, ret);
>>> +				eventfd_ctx_put(trigger);
>>> +				return ret;
>>> +			}
>>> 
>> This lacks any form of comment why this is correct. dev_id is updated and
>> the producer with the old token is still registered.
>> 
> ok, I will add comment like below:
> 
> For KVM-VFIO case, once producer and consumer connected successfully, interrupt from passthrough
> 
> device will be directly delivered to VM instead of triggering interrupt process in HOST. If producer and
> 
> consumer are disconnected, this interrupt process will fall back to remap mode, the handler vfio_msihandler()
> 
> registered in corresponding irqaction will use the dev_id to find the right way to deliver the interrupt to VM.
> 
> So, it is safe to update dev_id before unregistration of irq-bypass producer, i.e. switch back from posted
> 
> mode to remap mode, since only in remap mode the 'dev_id' will be used by interrupt handler. To producer
> 
> and consumer, dev_id is only a token for pairing them togather.
>> 
>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>> 
>> Now it's unregistered.
>> 
>> 
>>> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
>>> +			vdev->ctx[vector].producer.token = trigger;
>>> 
>> The token is updated and then it's newly registered below.
>> 
>> 
>>> +			vdev->ctx[vector].trigger = trigger;
>>> -	vdev->ctx[vector].producer.token = trigger;
>>> -	vdev->ctx[vector].producer.irq = irq;
>>> +	/* always update irte for posted mode */
>>>  	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>>>  	if (unlikely(ret))
>>>  		dev_info(&pdev->dev,
>>>  		"irq bypass producer (token %p) registration fails: %d\n",
>>>  		vdev->ctx[vector].producer.token, ret);
>>> 
>> I know this code already existed, but again this looks inconsistent. If the
>> registration fails then a subsequent update will try to unregister a not
>> registered producer. Does not make any sense to me.
>> 
> irq_bypass_register_producer() also fails on duplicated registration, so maybe an unconditional try to
> 
> unregistration is a easy way to keep consistent. 
> 
> Maybe it's better to change these function names to irq_bypass_try_register_producer() and
> 
> irq_bypass_try_unregister_producer()  :)
> 
> 
Hi Ben,
	The point is that we shouldn’t blindly try to register again  if we fails to unregister a posted interrupt producer. By this way, the fast path (posted interrupt) may get disabled, but it’s safer than blindly ignoring the failure of ungistration.
Thanks,
Gerry 
> 
> Thanks,
> 
>     Ben


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops
  2019-08-20 15:22     ` Alex Williamson
@ 2019-08-20 15:51       ` luoben
  0 siblings, 0 replies; 15+ messages in thread
From: luoben @ 2019-08-20 15:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: tglx, linux-kernel, tao.ma, gerry, nanhai.zou, linyunsheng


在 2019/8/20 下午11:22, Alex Williamson 写道:
> On Tue, 20 Aug 2019 12:03:50 +0800
> luoben <luoben@linux.alibaba.com> wrote:
>
>> 在 2019/8/20 上午4:51, Alex Williamson 写道:
>>> On Thu, 15 Aug 2019 21:02:58 +0800
>>> Ben Luo <luoben@linux.alibaba.com> wrote:
>>>   
>>>> Currently, VFIO takes a lot of free-then-request-irq actions whenever
>>>> a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
>>>> irq. Those actions only change the cookie data of irqaction or even
>>>> change nothing. The free-then-request-irq not only adds more latency,
>>>> but also increases the risk of losing interrupt, which may lead to a
>>>> VM hung forever in waiting for IO completion
>>> What guest environment is generating this?  Typically I don't see that
>>> Windows or Linux guests bounce the interrupt configuration much.
>>> Thanks,
>>>
>>> Alex
>> By tracing centos5u8 on host, I found it keep masking and unmasking
>> interrupt like this:
>>
>> [1566032533709879] index:28 irte_hi:000000010004a601
>> irte_lo:adb54bc000b98001
>> [1566032533711242] index:28 irte_hi:0000000000000000
>> irte_lo:0000000000000000
>> [1566032533711258] index:28 irte_hi:000000000004a601
>> irte_lo:00003fff00ac002d
>> [1566032533711269] index:28 irte_hi:000000000004a601
>> irte_lo:00003fff00ac002d
> [snip]
>> "[1566032533720007]" is timestamp in μs, so centos5u8 tiggers 30+ irte
>> modification within 10ms
> Ok, that matches my understanding that only very old guests behave in
> this manner.  It's a curious case to optimize as RHEL5 is in extended
> life-cycle support, with regular maintenance releases ending 2+ years
> ago.  Thanks,
>
> Alex

But repeatedly set interrupt affinity in a new version guest can also 
cause the problem.


Thanks,

     Ben


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops
  2019-08-20 15:27       ` Liu, Jiang
@ 2019-08-22 13:41         ` luoben
  0 siblings, 0 replies; 15+ messages in thread
From: luoben @ 2019-08-22 13:41 UTC (permalink / raw)
  To: Liu, Jiang
  Cc: Thomas Gleixner, alex.williamson, linux-kernel, tao.ma,
	nanhai.zou, linyunsheng


在 2019/8/20 下午11:27, Liu, Jiang 写道:
>
>> On Aug 20, 2019, at 11:24 PM, luoben <luoben@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2019/8/16 上午12:45, Thomas Gleixner 写道:
>>> On Thu, 15 Aug 2019, Ben Luo wrote:
>>>
>>>>   	if (vdev->ctx[vector].trigger) {
>>>> -		free_irq(irq, vdev->ctx[vector].trigger);
>>>> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>> -		kfree(vdev->ctx[vector].name);
>>>> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
>>>> -		vdev->ctx[vector].trigger = NULL;
>>>> +		if (vdev->ctx[vector].trigger == trigger) {
>>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>>
>>> What's the point of unregistering the producer, just to register it again below?
>> Whether producer token changed or not, irq_bypass_register_producer() is a way (seems the
>>
>> only way) to update IRTE by VFIO for posted interrupt.
>>
>> When posted interrupt is in use, the target IRTE will be used by IOMMU directly to find the
>>
>> target CPU for an interrupt posted to VM, since hypervisor is bypassed.
>>
>> irq_bypass_register_producer() will modify IRTE based on the information retrieved from KVM,
>>
>>
>> 0xffffffff8150a920 : modify_irte+0x0/0x180 [kernel]
>> 0xffffffff8150ab94 : intel_ir_set_vcpu_affinity+0xf4/0x150 [kernel]
>> 0xffffffff81125f3c : irq_set_vcpu_affinity+0x5c/0xa0 [kernel]
>> 0xffffffffa0550818 : vmx_update_pi_irte+0x178/0x290 [kvm_intel]    // get pi_desc etc. from KVM
>> 0xffffffffa052b789 : kvm_arch_irq_bypass_add_producer+0x39/0x50 [kvm_intel] (inexact)
>> 0xffffffffa024a50b : __connect+0x7b/0xa0 [kvm] (inexact)
>> 0xffffffffa024a6a8 : irq_bypass_register_producer+0x108/0x140 [kvm] (inexact)
>> 0xffffffffa0338386 : vfio_msi_set_vector_signal+0x1b6/0x2c0 [vfio_pci] (inexact)
>>>> +		} else if (trigger) {
>>>> +			ret = update_irq_devid(irq,
>>>> +					vdev->ctx[vector].trigger, trigger);
>>>> +			if (unlikely(ret)) {
>>>> +				dev_info(&pdev->dev,
>>>> +					 "update_irq_devid %d (token %p) fails: %d\n",
>>>>
>>> s/fails/failed/
>>>
>>>
>>>> +					 irq, vdev->ctx[vector].trigger, ret);
>>>> +				eventfd_ctx_put(trigger);
>>>> +				return ret;
>>>> +			}
>>>>
>>> This lacks any form of comment why this is correct. dev_id is updated and
>>> the producer with the old token is still registered.
>>>
>> ok, I will add comment like below:
>>
>> For KVM-VFIO case, once producer and consumer connected successfully, interrupt from passthrough
>>
>> device will be directly delivered to VM instead of triggering interrupt process in HOST. If producer and
>>
>> consumer are disconnected, this interrupt process will fall back to remap mode, the handler vfio_msihandler()
>>
>> registered in corresponding irqaction will use the dev_id to find the right way to deliver the interrupt to VM.
>>
>> So, it is safe to update dev_id before unregistration of irq-bypass producer, i.e. switch back from posted
>>
>> mode to remap mode, since only in remap mode the 'dev_id' will be used by interrupt handler. To producer
>>
>> and consumer, dev_id is only a token for pairing them togather.
>>>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>>>
>>> Now it's unregistered.
>>>
>>>
>>>> +			eventfd_ctx_put(vdev->ctx[vector].trigger);
>>>> +			vdev->ctx[vector].producer.token = trigger;
>>>>
>>> The token is updated and then it's newly registered below.
>>>
>>>
>>>> +			vdev->ctx[vector].trigger = trigger;
>>>> -	vdev->ctx[vector].producer.token = trigger;
>>>> -	vdev->ctx[vector].producer.irq = irq;
>>>> +	/* always update irte for posted mode */
>>>>   	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>>>>   	if (unlikely(ret))
>>>>   		dev_info(&pdev->dev,
>>>>   		"irq bypass producer (token %p) registration fails: %d\n",
>>>>   		vdev->ctx[vector].producer.token, ret);
>>>>
>>> I know this code already existed, but again this looks inconsistent. If the
>>> registration fails then a subsequent update will try to unregister a not
>>> registered producer. Does not make any sense to me.
>>>
>> irq_bypass_register_producer() also fails on duplicated registration, so maybe an unconditional try to
>>
>> unregistration is a easy way to keep consistent.
>>
>> Maybe it's better to change these function names to irq_bypass_try_register_producer() and
>>
>> irq_bypass_try_unregister_producer()  :)
>>
>>
> Hi Ben,
> 	The point is that we shouldn’t blindly try to register again  if we fails to unregister a posted interrupt producer. By this way, the fast path (posted interrupt) may get disabled, but it’s safer than blindly ignoring the failure of ungistration.
> Thanks,
> Gerry

This may need another patchset to enhance the irq_bypass 
register/unregister functions first, at least to return some error code 
when irq_bypass_try_unregister_producer() fails. I would like to have a 
try later.

Thanks,

     Ben


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-08-22 13:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-15 13:02 [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO irq ops Ben Luo
2019-08-15 13:02 ` [PATCH v3 1/3] genirq: enhance error recovery code in free irq Ben Luo
2019-08-15 14:20   ` Thomas Gleixner
2019-08-15 13:03 ` [PATCH v3 2/3] genirq: introduce update_irq_devid() Ben Luo
2019-08-15 14:58   ` Thomas Gleixner
2019-08-19  5:31     ` luoben
2019-08-19  8:18       ` Thomas Gleixner
2019-08-15 13:03 ` [PATCH v3 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
2019-08-15 16:45   ` Thomas Gleixner
     [not found]     ` <7a3606ad-8fa6-45d5-b5a4-ee3f07893a25@linux.alibaba.com>
2019-08-20 15:27       ` Liu, Jiang
2019-08-22 13:41         ` luoben
2019-08-19 20:51 ` [PATCH v3 0/3] genirq/vfio: Introduce update_irq_devid and optimize VFIO " Alex Williamson
2019-08-20  4:03   ` luoben
2019-08-20 15:22     ` Alex Williamson
2019-08-20 15:51       ` luoben

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).