[PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops
@ 2019-08-08 12:07 Ben Luo
  2019-08-08 12:07 ` [PATCH 1/2] genirq: introduce update_irq_devid() Ben Luo
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Ben Luo @ 2019-08-08 12:07 UTC (permalink / raw)
  To: tglx, alex.williamson, linux-kernel; +Cc: tao.ma, gerry

Currently, VFIO takes a lot of free-then-request-irq actions whenever
a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
irq. Those actions only change the cookie data of irqaction or even
change nothing. The free-then-request-irq not only add more latency,
but also increases the risk of losing interrupt, this may lead to a
VM hung forever in waiting IO completion

This patchset solved this issue by:
patch 1 introduces update_irq_devid to only update dev_id of irqaction
patch 2 make use of update_irq_devid and optimize irq operations in VFIO

Ben Luo (2):
  genirq: introduce update_irq_devid()
  vfio_pci: make use of update_irq_devid and optimize irq ops

 drivers/vfio/pci/vfio_pci_intrs.c | 100 +++++++++++++++++++++++---------------
 include/linux/interrupt.h         |   3 ++
 kernel/irq/manage.c               |  74 ++++++++++++++++++++++++++++
 3 files changed, 139 insertions(+), 38 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] genirq: introduce update_irq_devid()
  2019-08-08 12:07 [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops Ben Luo
@ 2019-08-08 12:07 ` Ben Luo
  2019-08-08 19:56   ` Thomas Gleixner
  2019-08-08 12:07 ` [PATCH 2/2] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Ben Luo @ 2019-08-08 12:07 UTC (permalink / raw)
  To: tglx, alex.williamson, linux-kernel; +Cc: tao.ma, gerry

Sometimes, only the dev_id field of irqaction need to be changed.
E.g. KVM VM with device passthru via VFIO may switch irq injection
path between KVM irqfd and userspace eventfd. These two paths
share the same irq num and handler for the same vector of a device,
only with different dev_id referencing to different fds' contexts.

So, instead of free/request irq, only update dev_id of irqaction.
This narrows the gap for setting up new irq (and irte, if has)
and also gains some performance benefit.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
---
 include/linux/interrupt.h |  3 ++
 kernel/irq/manage.c       | 74 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a..2838113 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -172,6 +172,9 @@ struct irqaction {
 request_percpu_nmi(unsigned int irq, irq_handler_t handler,
 		   const char *devname, void __percpu *dev);
 
+extern int update_irq_devid(unsigned int irq, void *dev_id,
+		void *new_dev_id);
+
 extern const void *free_irq(unsigned int, void *);
 extern void free_percpu_irq(unsigned int, void __percpu *);
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f17..18dc10d 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2059,6 +2059,80 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 EXPORT_SYMBOL(request_threaded_irq);
 
 /**
+ *	update_irq_devid - update irq dev_id to a new one
+ *	@irq: Interrupt line to update
+ *	@dev_id: A cookie to find the irqaction to update
+ *	@new_dev_id: New cookie passed to the handler function
+ *
+ *	Sometimes, only the cookie data need to be changed.
+ *	Instead of free/request irq, only update dev_id here
+ *	Not only to gain some performance benefit, but also
+ *	reduce the risk of losing interrupt.
+ *
+ *	E.g. irq affinity setting in a VM with VFIO passthru
+ *	device is carried out in a free-then-request-irq way
+ *	since lack of this very function. The free_irq()
+ *	eventually triggers deactivation of IR domain, which
+ *	will cleanup IRTE. There is a gap before request_irq()
+ *	finally setup the IRTE again. In this gap, a in-flight
+ *	irq buffering in hardware layer may trigger DMAR fault
+ *	and be lost.
+ *
+ *	On failure, it returns a negative value. On success,
+ *	it returns 0
+ */
+int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irqaction *action, **action_ptr;
+	unsigned long flags;
+
+	WARN(in_interrupt(),
+			"Trying to update IRQ %d from IRQ context!\n", irq);
+
+	if (!desc)
+		return -EINVAL;
+
+	chip_bus_lock(desc);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	/*
+	 * There can be multiple actions per IRQ descriptor, find the right
+	 * one based on the dev_id:
+	 */
+	action_ptr = &desc->action;
+	for (;;) {
+		action = *action_ptr;
+
+		if (!action) {
+			WARN(1, "Trying to update already-free IRQ %d\n", irq);
+			raw_spin_unlock_irqrestore(&desc->lock, flags);
+			chip_bus_sync_unlock(desc);
+			return -ENXIO;
+		}
+
+		if (action->dev_id == dev_id) {
+			action->dev_id = new_dev_id;
+			break;
+		}
+		action_ptr = &action->next;
+	}
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	chip_bus_sync_unlock(desc);
+
+	/*
+	 * Make sure it's not being used on another CPU:
+	 * There is a risk of UAF for old *dev_id, if it is
+	 * freed in a short time after this func returns
+	 */
+	synchronize_irq(irq);
+
+	return 0;
+}
+EXPORT_SYMBOL(update_irq_devid);
+
+/**
  *	request_any_context_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
  *	@handler: Function to be called when the IRQ occurs.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] vfio_pci: make use of update_irq_devid and optimize irq ops
  2019-08-08 12:07 [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops Ben Luo
  2019-08-08 12:07 ` [PATCH 1/2] genirq: introduce update_irq_devid() Ben Luo
@ 2019-08-08 12:07 ` Ben Luo
  2019-08-12  5:44   ` Yunsheng Lin
  2019-08-12  8:52 ` [PATCH v2 0/3] introduce update_irq_devid and optimize VFIO " Ben Luo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Ben Luo @ 2019-08-08 12:07 UTC (permalink / raw)
  To: tglx, alex.williamson, linux-kernel; +Cc: tao.ma, gerry

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actully
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of update_irq_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 100 +++++++++++++++++++++++---------------
 1 file changed, 62 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..1323dc8 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -285,69 +285,93 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 				      int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
-	struct eventfd_ctx *trigger;
+	struct eventfd_ctx *trigger = NULL;
 	int irq, ret;
 
 	if (vector < 0 || vector >= vdev->num_ctx)
 		return -EINVAL;
 
+	if (fd >= 0) {
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger))
+			return PTR_ERR(trigger);
+	}
+
 	irq = pci_irq_vector(pdev, vector);
 
 	if (vdev->ctx[vector].trigger) {
-		free_irq(irq, vdev->ctx[vector].trigger);
-		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(vdev->ctx[vector].trigger);
-		vdev->ctx[vector].trigger = NULL;
+		if (vdev->ctx[vector].trigger != trigger) {
+			if (trigger) {
+				ret = update_irq_devid(irq,
+						vdev->ctx[vector].trigger, trigger);
+				if (unlikely(ret)) {
+					dev_info(&pdev->dev,
+							"update_irq_devid %d (token %p) fails: %d\n",
+							irq, vdev->ctx[vector].trigger, ret);
+					eventfd_ctx_put(trigger);
+					return ret;
+				}
+				irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+				eventfd_ctx_put(vdev->ctx[vector].trigger);
+				vdev->ctx[vector].producer.token = trigger;
+				vdev->ctx[vector].trigger = trigger;
+			} else {
+				free_irq(irq, vdev->ctx[vector].trigger);
+				irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+				kfree(vdev->ctx[vector].name);
+				eventfd_ctx_put(vdev->ctx[vector].trigger);
+				vdev->ctx[vector].trigger = NULL;
+			}
+		} else
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
 	}
 
 	if (fd < 0)
 		return 0;
 
-	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
-					   msix ? "x" : "", vector,
-					   pci_name(pdev));
-	if (!vdev->ctx[vector].name)
-		return -ENOMEM;
+	if (vdev->ctx[vector].trigger == NULL) {
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
+						   msix ? "x" : "", vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name) {
+			eventfd_ctx_put(trigger);
+			return -ENOMEM;
+		}
 
-	trigger = eventfd_ctx_fdget(fd);
-	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[vector].name);
-		return PTR_ERR(trigger);
-	}
+		/*
+		 * The MSIx vector table resides in device memory which may be cleared
+		 * via backdoor resets. We don't allow direct access to the vector
+		 * table so even if a userspace driver attempts to save/restore around
+		 * such a reset it would be unsuccessful. To avoid this, restore the
+		 * cached value of the message prior to enabling.
+		 */
+		if (msix) {
+			struct msi_msg msg;
 
-	/*
-	 * The MSIx vector table resides in device memory which may be cleared
-	 * via backdoor resets. We don't allow direct access to the vector
-	 * table so even if a userspace driver attempts to save/restore around
-	 * such a reset it would be unsuccessful. To avoid this, restore the
-	 * cached value of the message prior to enabling.
-	 */
-	if (msix) {
-		struct msi_msg msg;
+			get_cached_msi_msg(irq, &msg);
+			pci_write_msi_msg(irq, &msg);
+		}
 
-		get_cached_msi_msg(irq, &msg);
-		pci_write_msi_msg(irq, &msg);
-	}
+		ret = request_irq(irq, vfio_msihandler, 0,
+				  vdev->ctx[vector].name, trigger);
+		if (ret) {
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(trigger);
+			return ret;
+		}
 
-	ret = request_irq(irq, vfio_msihandler, 0,
-			  vdev->ctx[vector].name, trigger);
-	if (ret) {
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(trigger);
-		return ret;
+		vdev->ctx[vector].producer.token = trigger;
+		vdev->ctx[vector].producer.irq = irq;
+		vdev->ctx[vector].trigger = trigger;
 	}
 
-	vdev->ctx[vector].producer.token = trigger;
-	vdev->ctx[vector].producer.irq = irq;
+	/* always update irte for posted mode */
 	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
 	if (unlikely(ret))
 		dev_info(&pdev->dev,
 		"irq bypass producer (token %p) registration fails: %d\n",
 		vdev->ctx[vector].producer.token, ret);
 
-	vdev->ctx[vector].trigger = trigger;
-
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] genirq: introduce update_irq_devid()
  2019-08-08 12:07 ` [PATCH 1/2] genirq: introduce update_irq_devid() Ben Luo
@ 2019-08-08 19:56   ` Thomas Gleixner
  2019-08-12  2:59     ` luoben
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2019-08-08 19:56 UTC (permalink / raw)
  To: Ben Luo; +Cc: alex.williamson, linux-kernel, tao.ma, gerry

On Thu, 8 Aug 2019, Ben Luo wrote:
> +int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
> +{
> +	struct irq_desc *desc = irq_to_desc(irq);
> +	struct irqaction *action, **action_ptr;
> +	unsigned long flags;
> +
> +	WARN(in_interrupt(),
> +			"Trying to update IRQ %d from IRQ context!\n", irq);

This is broken. The function needs to return on that condition. Actually it
cannot even be called from non-preemptible code.

What's worse is that if the interrupt in question is handled concurrently,
then it will either see the old or the new dev_id and because the interrupt
handler loop runs with desc->lock dropped even more crap can happen because
dev_id can be subject to load and store tearing.

Staring at that, I see that there is the same issue in setup_irq() and
free_irq(). It's actually worse there. I'll have a look.

> +	/*
> +	 * There can be multiple actions per IRQ descriptor, find the right
> +	 * one based on the dev_id:
> +	 */
> +	action_ptr = &desc->action;
> +	for (;;) {
> +		action = *action_ptr;
> +
> +		if (!action) {
> +			WARN(1, "Trying to update already-free IRQ %d\n", irq);

That's wrong in two aspects:

       1) The warn should be outside of the locked region.

       2) Just having the irq number is not useful for debugging either
       	  when the interrupt is shared.

> +			raw_spin_unlock_irqrestore(&desc->lock, flags);
> +			chip_bus_sync_unlock(desc);
> +			return -ENXIO;
> +		}
> +
> +		if (action->dev_id == dev_id) {
> +			action->dev_id = new_dev_id;
> +			break;
> +		}
> +		action_ptr = &action->next;
> +	}
> +
> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
> +	chip_bus_sync_unlock(desc);
> +
> +	/*
> +	 * Make sure it's not being used on another CPU:
> +	 * There is a risk of UAF for old *dev_id, if it is
> +	 * freed in a short time after this func returns
> +	 */
> +	synchronize_irq(irq);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(update_irq_devid);

EXPORT_SYMBOL_GPL() please.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] genirq: introduce update_irq_devid()
  2019-08-08 19:56   ` Thomas Gleixner
@ 2019-08-12  2:59     ` luoben
  0 siblings, 0 replies; 11+ messages in thread
From: luoben @ 2019-08-12  2:59 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: alex.williamson, linux-kernel, tao.ma, gerry


在 2019/8/9 上午3:56, Thomas Gleixner 写道:
> On Thu, 8 Aug 2019, Ben Luo wrote:
>> +int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
>> +{
>> +	struct irq_desc *desc = irq_to_desc(irq);
>> +	struct irqaction *action, **action_ptr;
>> +	unsigned long flags;
>> +
>> +	WARN(in_interrupt(),
>> +			"Trying to update IRQ %d from IRQ context!\n", irq);
> This is broken. The function needs to return on that condition. Actually it
> cannot even be called from non-preemptible code.
>
> What's worse is that if the interrupt in question is handled concurrently,
> then it will either see the old or the new dev_id and because the interrupt
> handler loop runs with desc->lock dropped even more crap can happen because
> dev_id can be subject to load and store tearing.
>
> Staring at that, I see that there is the same issue in setup_irq() and
> free_irq(). It's actually worse there. I'll have a look.

ok,  will return with an error code in v2 in this case

>
>> +	/*
>> +	 * There can be multiple actions per IRQ descriptor, find the right
>> +	 * one based on the dev_id:
>> +	 */
>> +	action_ptr = &desc->action;
>> +	for (;;) {
>> +		action = *action_ptr;
>> +
>> +		if (!action) {
>> +			WARN(1, "Trying to update already-free IRQ %d\n", irq);
> That's wrong in two aspects:
>
>         1) The warn should be outside of the locked region.
>
>         2) Just having the irq number is not useful for debugging either
>         	  when the interrupt is shared.
will take care in v2
>> +			raw_spin_unlock_irqrestore(&desc->lock, flags);
>> +			chip_bus_sync_unlock(desc);
>> +			return -ENXIO;
>> +		}
>> +
>> +		if (action->dev_id == dev_id) {
>> +			action->dev_id = new_dev_id;
>> +			break;
>> +		}
>> +		action_ptr = &action->next;
>> +	}
>> +
>> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
>> +	chip_bus_sync_unlock(desc);
>> +
>> +	/*
>> +	 * Make sure it's not being used on another CPU:
>> +	 * There is a risk of UAF for old *dev_id, if it is
>> +	 * freed in a short time after this func returns
>> +	 */
>> +	synchronize_irq(irq);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(update_irq_devid);
> EXPORT_SYMBOL_GPL() please.
thanks, will use EXPORT_SYMBOL_GPL in v2
>
> Thanks,
>
> 	tglx

Thanks,

      Ben


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] vfio_pci: make use of update_irq_devid and optimize irq ops
  2019-08-08 12:07 ` [PATCH 2/2] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
@ 2019-08-12  5:44   ` Yunsheng Lin
  2019-08-12  6:37     ` luoben
  0 siblings, 1 reply; 11+ messages in thread
From: Yunsheng Lin @ 2019-08-12  5:44 UTC (permalink / raw)
  To: Ben Luo, tglx, alex.williamson, linux-kernel; +Cc: tao.ma, gerry

On 2019/8/8 20:07, Ben Luo wrote:
> When userspace (e.g. qemu) triggers a switch between KVM
> irqfd and userspace eventfd, only dev_id of irq action
> (i.e. the "trigger" in this patch's context) will be
> changed, but a free-then-request-irq action is taken in
> current code. And, irq affinity setting in VM will also
> trigger a free-then-request-irq action, which actully
> changes nothing, but only fires a producer re-registration
> to update irte in case that posted-interrupt is in use.
> 
> This patch makes use of update_irq_devid() and optimize
> both cases above, which reduces the risk of losing interrupt
> and also cuts some overhead.
> 
> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
> Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 100 +++++++++++++++++++++++---------------
>  1 file changed, 62 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3fa3f72..1323dc8 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -285,69 +285,93 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
>  				      int vector, int fd, bool msix)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> -	struct eventfd_ctx *trigger;
> +	struct eventfd_ctx *trigger = NULL;

struct eventfd_ctx *trigger = NULL;
struct pci_dev *pdev = vdev->pdev;

to maintain reverse christmas tree?

>  	int irq, ret;
>  
>  	if (vector < 0 || vector >= vdev->num_ctx)
>  		return -EINVAL;
>  
> +	if (fd >= 0) {
> +		trigger = eventfd_ctx_fdget(fd);
> +		if (IS_ERR(trigger))
> +			return PTR_ERR(trigger);

It seems vdev->ctx[vector].trigger is freed even if  fd < 0 before
this patch. If it return here, vdev->ctx[vector].trigger is not free?

> +	}
> +
>  	irq = pci_irq_vector(pdev, vector);
>  
>  	if (vdev->ctx[vector].trigger) {
> -		free_irq(irq, vdev->ctx[vector].trigger);
> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> -		kfree(vdev->ctx[vector].name);
> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
> -		vdev->ctx[vector].trigger = NULL;
> +		if (vdev->ctx[vector].trigger != trigger) {
> +			if (trigger) {
> +				ret = update_irq_devid(irq,
> +						vdev->ctx[vector].trigger, trigger);
> +				if (unlikely(ret)) {
> +					dev_info(&pdev->dev,
> +							"update_irq_devid %d (token %p) fails: %d\n",
> +							irq, vdev->ctx[vector].trigger, ret);
> +					eventfd_ctx_put(trigger);
> +					return ret;
> +				}
> +				irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> +				eventfd_ctx_put(vdev->ctx[vector].trigger);
> +				vdev->ctx[vector].producer.token = trigger;
> +				vdev->ctx[vector].trigger = trigger;
> +			} else {
> +				free_irq(irq, vdev->ctx[vector].trigger);
> +				irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> +				kfree(vdev->ctx[vector].name);
> +				eventfd_ctx_put(vdev->ctx[vector].trigger);
> +				vdev->ctx[vector].trigger = NULL;
> +			}
> +		} else
> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>  	}

Maybe adjust it a litte to reduce indent and to improve readability?

	if (vdev->ctx[vector].trigger && vdev->ctx[vector].trigger == trigger) {
		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
	} else if (vdev->ctx[vector].trigger && !trigger) {
		free_irq(irq, vdev->ctx[vector].trigger);
		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
		kfree(vdev->ctx[vector].name);
		eventfd_ctx_put(vdev->ctx[vector].trigger);
		vdev->ctx[vector].trigger = NULL;
	} else if (vdev->ctx[vector].trigger) {
		ret = update_irq_devid(irq, vdev->ctx[vector].trigger, trigger);
		if (unlikely(ret)) {
			dev_info(&pdev->dev,
				 "update_irq_devid %d (token %p) fails: %d\n",
				 irq, vdev->ctx[vector].trigger, ret);
				 eventfd_ctx_put(trigger);
				 return ret;
		}
		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
		eventfd_ctx_put(vdev->ctx[vector].trigger);
		vdev->ctx[vector].producer.token = trigger;
		vdev->ctx[vector].trigger = trigger;
	}


>  
>  	if (fd < 0)
>  		return 0;
>  
> -	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
> -					   msix ? "x" : "", vector,
> -					   pci_name(pdev));
> -	if (!vdev->ctx[vector].name)
> -		return -ENOMEM;
> +	if (vdev->ctx[vector].trigger == NULL) {

It may be common to use the below check to do NULL checking:
If (!vdev->ctx[vector].trigger)


> +		vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
> +						   msix ? "x" : "", vector,
> +						   pci_name(pdev));
> +		if (!vdev->ctx[vector].name) {
> +			eventfd_ctx_put(trigger);
> +			return -ENOMEM;
> +		}
>  
> -	trigger = eventfd_ctx_fdget(fd);
> -	if (IS_ERR(trigger)) {
> -		kfree(vdev->ctx[vector].name);
> -		return PTR_ERR(trigger);
> -	}
> +		/*
> +		 * The MSIx vector table resides in device memory which may be cleared
> +		 * via backdoor resets. We don't allow direct access to the vector
> +		 * table so even if a userspace driver attempts to save/restore around
> +		 * such a reset it would be unsuccessful. To avoid this, restore the
> +		 * cached value of the message prior to enabling.
> +		 */
> +		if (msix) {
> +			struct msi_msg msg;
>  
> -	/*
> -	 * The MSIx vector table resides in device memory which may be cleared
> -	 * via backdoor resets. We don't allow direct access to the vector
> -	 * table so even if a userspace driver attempts to save/restore around
> -	 * such a reset it would be unsuccessful. To avoid this, restore the
> -	 * cached value of the message prior to enabling.
> -	 */
> -	if (msix) {
> -		struct msi_msg msg;
> +			get_cached_msi_msg(irq, &msg);
> +			pci_write_msi_msg(irq, &msg);
> +		}
>  
> -		get_cached_msi_msg(irq, &msg);
> -		pci_write_msi_msg(irq, &msg);
> -	}
> +		ret = request_irq(irq, vfio_msihandler, 0,
> +				  vdev->ctx[vector].name, trigger);
> +		if (ret) {
> +			kfree(vdev->ctx[vector].name);
> +			eventfd_ctx_put(trigger);
> +			return ret;
> +		}
>  
> -	ret = request_irq(irq, vfio_msihandler, 0,
> -			  vdev->ctx[vector].name, trigger);
> -	if (ret) {
> -		kfree(vdev->ctx[vector].name);
> -		eventfd_ctx_put(trigger);
> -		return ret;
> +		vdev->ctx[vector].producer.token = trigger;
> +		vdev->ctx[vector].producer.irq = irq;
> +		vdev->ctx[vector].trigger = trigger;
>  	}
>  
> -	vdev->ctx[vector].producer.token = trigger;
> -	vdev->ctx[vector].producer.irq = irq;
> +	/* always update irte for posted mode */
>  	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>  	if (unlikely(ret))
>  		dev_info(&pdev->dev,
>  		"irq bypass producer (token %p) registration fails: %d\n",
>  		vdev->ctx[vector].producer.token, ret);
>  
> -	vdev->ctx[vector].trigger = trigger;
> -
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] vfio_pci: make use of update_irq_devid and optimize irq ops
  2019-08-12  5:44   ` Yunsheng Lin
@ 2019-08-12  6:37     ` luoben
  0 siblings, 0 replies; 11+ messages in thread
From: luoben @ 2019-08-12  6:37 UTC (permalink / raw)
  To: Yunsheng Lin, tglx, alex.williamson, linux-kernel; +Cc: tao.ma, gerry


在 2019/8/12 下午1:44, Yunsheng Lin 写道:
> On 2019/8/8 20:07, Ben Luo wrote:
>> When userspace (e.g. qemu) triggers a switch between KVM
>> irqfd and userspace eventfd, only dev_id of irq action
>> (i.e. the "trigger" in this patch's context) will be
>> changed, but a free-then-request-irq action is taken in
>> current code. And, irq affinity setting in VM will also
>> trigger a free-then-request-irq action, which actully
>> changes nothing, but only fires a producer re-registration
>> to update irte in case that posted-interrupt is in use.
>>
>> This patch makes use of update_irq_devid() and optimize
>> both cases above, which reduces the risk of losing interrupt
>> and also cuts some overhead.
>>
>> Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
>> Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
>> ---
>>   drivers/vfio/pci/vfio_pci_intrs.c | 100 +++++++++++++++++++++++---------------
>>   1 file changed, 62 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>> index 3fa3f72..1323dc8 100644
>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>> @@ -285,69 +285,93 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
>>   				      int vector, int fd, bool msix)
>>   {
>>   	struct pci_dev *pdev = vdev->pdev;
>> -	struct eventfd_ctx *trigger;
>> +	struct eventfd_ctx *trigger = NULL;
> struct eventfd_ctx *trigger = NULL;
> struct pci_dev *pdev = vdev->pdev;
>
> to maintain reverse christmas tree?
ok,  keep reverse christmas tree in v2
>
>>   	int irq, ret;
>>   
>>   	if (vector < 0 || vector >= vdev->num_ctx)
>>   		return -EINVAL;
>>   
>> +	if (fd >= 0) {
>> +		trigger = eventfd_ctx_fdget(fd);
>> +		if (IS_ERR(trigger))
>> +			return PTR_ERR(trigger);
> It seems vdev->ctx[vector].trigger is freed even if  fd < 0 before
> this patch. If it return here, vdev->ctx[vector].trigger is not free?

if fd < 0, it won't enter here

if fd >= 0,  I think it's better to return and leave everything as it 
was, and

let the caller to deal with this bad fd case and disable msi to free the 
resouces if it wants

>
>> +	}
>> +
>>   	irq = pci_irq_vector(pdev, vector);
>>   
>>   	if (vdev->ctx[vector].trigger) {
>> -		free_irq(irq, vdev->ctx[vector].trigger);
>> -		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>> -		kfree(vdev->ctx[vector].name);
>> -		eventfd_ctx_put(vdev->ctx[vector].trigger);
>> -		vdev->ctx[vector].trigger = NULL;
>> +		if (vdev->ctx[vector].trigger != trigger) {
>> +			if (trigger) {
>> +				ret = update_irq_devid(irq,
>> +						vdev->ctx[vector].trigger, trigger);
>> +				if (unlikely(ret)) {
>> +					dev_info(&pdev->dev,
>> +							"update_irq_devid %d (token %p) fails: %d\n",
>> +							irq, vdev->ctx[vector].trigger, ret);
>> +					eventfd_ctx_put(trigger);
>> +					return ret;
>> +				}
>> +				irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>> +				eventfd_ctx_put(vdev->ctx[vector].trigger);
>> +				vdev->ctx[vector].producer.token = trigger;
>> +				vdev->ctx[vector].trigger = trigger;
>> +			} else {
>> +				free_irq(irq, vdev->ctx[vector].trigger);
>> +				irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>> +				kfree(vdev->ctx[vector].name);
>> +				eventfd_ctx_put(vdev->ctx[vector].trigger);
>> +				vdev->ctx[vector].trigger = NULL;
>> +			}
>> +		} else
>> +			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>>   	}
> Maybe adjust it a litte to reduce indent and to improve readability?
>
> 	if (vdev->ctx[vector].trigger && vdev->ctx[vector].trigger == trigger) {
> 		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> 	} else if (vdev->ctx[vector].trigger && !trigger) {
> 		free_irq(irq, vdev->ctx[vector].trigger);
> 		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> 		kfree(vdev->ctx[vector].name);
> 		eventfd_ctx_put(vdev->ctx[vector].trigger);
> 		vdev->ctx[vector].trigger = NULL;
> 	} else if (vdev->ctx[vector].trigger) {
> 		ret = update_irq_devid(irq, vdev->ctx[vector].trigger, trigger);
> 		if (unlikely(ret)) {
> 			dev_info(&pdev->dev,
> 				 "update_irq_devid %d (token %p) fails: %d\n",
> 				 irq, vdev->ctx[vector].trigger, ret);
> 				 eventfd_ctx_put(trigger);
> 				 return ret;
> 		}
> 		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
> 		eventfd_ctx_put(vdev->ctx[vector].trigger);
> 		vdev->ctx[vector].producer.token = trigger;
> 		vdev->ctx[vector].trigger = trigger;
> 	}
>
I will reformat this chunk in v2
>>   
>>   	if (fd < 0)
>>   		return 0;
>>   
>> -	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
>> -					   msix ? "x" : "", vector,
>> -					   pci_name(pdev));
>> -	if (!vdev->ctx[vector].name)
>> -		return -ENOMEM;
>> +	if (vdev->ctx[vector].trigger == NULL) {
> It may be common to use the below check to do NULL checking:
> If (!vdev->ctx[vector].trigger)
ok, make it this way in v2
>
>> +		vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
>> +						   msix ? "x" : "", vector,
>> +						   pci_name(pdev));
>> +		if (!vdev->ctx[vector].name) {
>> +			eventfd_ctx_put(trigger);
>> +			return -ENOMEM;
>> +		}
>>   
>> -	trigger = eventfd_ctx_fdget(fd);
>> -	if (IS_ERR(trigger)) {
>> -		kfree(vdev->ctx[vector].name);
>> -		return PTR_ERR(trigger);
>> -	}
>> +		/*
>> +		 * The MSIx vector table resides in device memory which may be cleared
>> +		 * via backdoor resets. We don't allow direct access to the vector
>> +		 * table so even if a userspace driver attempts to save/restore around
>> +		 * such a reset it would be unsuccessful. To avoid this, restore the
>> +		 * cached value of the message prior to enabling.
>> +		 */
>> +		if (msix) {
>> +			struct msi_msg msg;
>>   
>> -	/*
>> -	 * The MSIx vector table resides in device memory which may be cleared
>> -	 * via backdoor resets. We don't allow direct access to the vector
>> -	 * table so even if a userspace driver attempts to save/restore around
>> -	 * such a reset it would be unsuccessful. To avoid this, restore the
>> -	 * cached value of the message prior to enabling.
>> -	 */
>> -	if (msix) {
>> -		struct msi_msg msg;
>> +			get_cached_msi_msg(irq, &msg);
>> +			pci_write_msi_msg(irq, &msg);
>> +		}
>>   
>> -		get_cached_msi_msg(irq, &msg);
>> -		pci_write_msi_msg(irq, &msg);
>> -	}
>> +		ret = request_irq(irq, vfio_msihandler, 0,
>> +				  vdev->ctx[vector].name, trigger);
>> +		if (ret) {
>> +			kfree(vdev->ctx[vector].name);
>> +			eventfd_ctx_put(trigger);
>> +			return ret;
>> +		}
>>   
>> -	ret = request_irq(irq, vfio_msihandler, 0,
>> -			  vdev->ctx[vector].name, trigger);
>> -	if (ret) {
>> -		kfree(vdev->ctx[vector].name);
>> -		eventfd_ctx_put(trigger);
>> -		return ret;
>> +		vdev->ctx[vector].producer.token = trigger;
>> +		vdev->ctx[vector].producer.irq = irq;
>> +		vdev->ctx[vector].trigger = trigger;
>>   	}
>>   
>> -	vdev->ctx[vector].producer.token = trigger;
>> -	vdev->ctx[vector].producer.irq = irq;
>> +	/* always update irte for posted mode */
>>   	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
>>   	if (unlikely(ret))
>>   		dev_info(&pdev->dev,
>>   		"irq bypass producer (token %p) registration fails: %d\n",
>>   		vdev->ctx[vector].producer.token, ret);
>>   
>> -	vdev->ctx[vector].trigger = trigger;
>> -
>>   	return 0;
>>   }
>>   

Thanks,

     Ben


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 0/3] introduce update_irq_devid and optimize VFIO irq ops
  2019-08-08 12:07 [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops Ben Luo
  2019-08-08 12:07 ` [PATCH 1/2] genirq: introduce update_irq_devid() Ben Luo
  2019-08-08 12:07 ` [PATCH 2/2] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
@ 2019-08-12  8:52 ` Ben Luo
  2019-08-12  8:52 ` [PATCH v2 1/3] genirq: enhance error recovery code in free irq Ben Luo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Ben Luo @ 2019-08-12  8:52 UTC (permalink / raw)
  To: tglx, alex.williamson, linux-kernel
  Cc: tao.ma, gerry, nanhai.zou, linyunsheng

Currently, VFIO takes a lot of free-then-request-irq actions whenever
a VM (with device passthru via VFIO) sets irq affinity or mask/unmask
irq. Those actions only change the cookie data of irqaction or even
change nothing. The free-then-request-irq not only add more latency,
but also increases the risk of losing interrupt, this may lead to a
VM hung forever in waiting IO completion

This patchset solved this issue by:
Patch 2 introduces update_irq_devid to only update dev_id of irqaction
Patch 3 make use of update_irq_devid and optimize irq operations in VFIO

changes from v1:
 - add Patch 1 to enhance error recovery etc. in free irq per tglx's comments
 - enhance error recovery code and debugging info in update_irq_devid
 - use __must_check in external referencing of update_irq_devid
 - use EXPORT_SYMBOL_GPL for update_irq_devid
 - reformat code of patch 3 for better readability

Ben Luo (3):
  genirq: enhance error recovery code in free irq
  genirq: introduce update_irq_devid()
  vfio_pci: make use of update_irq_devid and optimize irq ops

 drivers/vfio/pci/vfio_pci_intrs.c |  99 +++++++++++++++++++++-------------
 include/linux/interrupt.h         |   3 ++
 kernel/irq/manage.c               | 111 +++++++++++++++++++++++++++++++++-----
 3 files changed, 163 insertions(+), 50 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] genirq: enhance error recovery code in free irq
  2019-08-08 12:07 [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops Ben Luo
                   ` (2 preceding siblings ...)
  2019-08-12  8:52 ` [PATCH v2 0/3] introduce update_irq_devid and optimize VFIO " Ben Luo
@ 2019-08-12  8:52 ` Ben Luo
  2019-08-12  8:52 ` [PATCH v2 2/3] genirq: introduce update_irq_devid() Ben Luo
  2019-08-12  8:52 ` [PATCH v2 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
  5 siblings, 0 replies; 11+ messages in thread
From: Ben Luo @ 2019-08-12  8:52 UTC (permalink / raw)
  To: tglx, alex.williamson, linux-kernel
  Cc: tao.ma, gerry, nanhai.zou, linyunsheng

Per Thomas Gleixner's comments:
1) free_irq/free_percpu_irq returns if called from IRQ context
2) move WARN out of the locked region and print out dev_id

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
---
 kernel/irq/manage.c | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f17..b97ee5e 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1690,7 +1690,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 	struct irqaction *action, **action_ptr;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (in_interrupt()) {
+		WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+				irq, dev_id);
+		return NULL;
+	}
 
 	mutex_lock(&desc->request_mutex);
 	chip_bus_lock(desc);
@@ -1705,10 +1709,11 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
 		action = *action_ptr;
 
 		if (!action) {
-			WARN(1, "Trying to free already-free IRQ %d\n", irq);
 			raw_spin_unlock_irqrestore(&desc->lock, flags);
 			chip_bus_sync_unlock(desc);
 			mutex_unlock(&desc->request_mutex);
+			WARN(1, "Trying to free already-free IRQ %d (dev_id %p)\n",
+					irq, dev_id);
 			return NULL;
 		}
 
@@ -2286,7 +2291,11 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	struct irqaction *action;
 	unsigned long flags;
 
-	WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
+	if (in_interrupt()) {
+		WARN(1, "Trying to free IRQ %d (dev_id %p) from IRQ context!\n",
+				irq, dev_id);
+		return NULL;
+	}
 
 	if (!desc)
 		return NULL;
@@ -2295,14 +2304,17 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 
 	action = desc->action;
 	if (!action || action->percpu_dev_id != dev_id) {
-		WARN(1, "Trying to free already-free IRQ %d\n", irq);
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "Trying to free already-free IRQ (dev_id %p) %d\n",
+				irq, dev_id);
+		return NULL;
 	}
 
 	if (!cpumask_empty(desc->percpu_enabled)) {
-		WARN(1, "percpu IRQ %d still enabled on CPU%d!\n",
-		     irq, cpumask_first(desc->percpu_enabled));
-		goto bad;
+		raw_spin_unlock_irqrestore(&desc->lock, flags);
+		WARN(1, "percpu IRQ %d (dev_id %p) still enabled on CPU%d!\n",
+		     irq, dev_id, cpumask_first(desc->percpu_enabled));
+		return NULL;
 	}
 
 	/* Found it - now remove it from the list of entries: */
@@ -2317,10 +2329,6 @@ static struct irqaction *__free_percpu_irq(unsigned int irq, void __percpu *dev_
 	irq_chip_pm_put(&desc->irq_data);
 	module_put(desc->owner);
 	return action;
-
-bad:
-	raw_spin_unlock_irqrestore(&desc->lock, flags);
-	return NULL;
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] genirq: introduce update_irq_devid()
  2019-08-08 12:07 [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops Ben Luo
                   ` (3 preceding siblings ...)
  2019-08-12  8:52 ` [PATCH v2 1/3] genirq: enhance error recovery code in free irq Ben Luo
@ 2019-08-12  8:52 ` Ben Luo
  2019-08-12  8:52 ` [PATCH v2 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
  5 siblings, 0 replies; 11+ messages in thread
From: Ben Luo @ 2019-08-12  8:52 UTC (permalink / raw)
  To: tglx, alex.williamson, linux-kernel
  Cc: tao.ma, gerry, nanhai.zou, linyunsheng

Sometimes, only the dev_id field of irqaction need to be changed.
E.g. KVM VM with device passthru via VFIO may switch irq injection
path between KVM irqfd and userspace eventfd. These two paths
share the same irq num and handler for the same vector of a device,
only with different dev_id referencing to different fds' contexts.

So, instead of free/request irq, only update dev_id of irqaction.
This narrows the gap for setting up new irq (and irte, if has)
and also gains some performance benefit.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/interrupt.h |  3 ++
 kernel/irq/manage.c       | 79 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a..6060c5a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -172,6 +172,9 @@ struct irqaction {
 request_percpu_nmi(unsigned int irq, irq_handler_t handler,
 		   const char *devname, void __percpu *dev);
 
+extern int __must_check
+update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id);
+
 extern const void *free_irq(unsigned int, void *);
 extern void free_percpu_irq(unsigned int, void __percpu *);
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index b97ee5e..4c34a23 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2064,6 +2064,85 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 EXPORT_SYMBOL(request_threaded_irq);
 
 /**
+ *	update_irq_devid - update irq dev_id to a new one
+ *	@irq: Interrupt line to update
+ *	@dev_id: A cookie to find the irqaction to update
+ *	@new_dev_id: New cookie passed to the handler function
+ *
+ *	Sometimes, only the cookie data need to be changed.
+ *	Instead of free/request irq, only update dev_id here
+ *	Not only to gain some performance benefit, but also
+ *	reduce the risk of losing interrupt.
+ *
+ *	E.g. irq affinity setting in a VM with VFIO passthru
+ *	device is carried out in a free-then-request-irq way
+ *	since lack of this very function. The free_irq()
+ *	eventually triggers deactivation of IR domain, which
+ *	will cleanup IRTE. There is a gap before request_irq()
+ *	finally setup the IRTE again. In this gap, a in-flight
+ *	irq buffering in hardware layer may trigger DMAR fault
+ *	and be lost.
+ *
+ *	On failure, it returns a negative value. On success,
+ *	it returns 0
+ */
+int update_irq_devid(unsigned int irq, void *dev_id, void *new_dev_id)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irqaction *action, **action_ptr;
+	unsigned long flags;
+
+	if (in_interrupt()) {
+		WARN(1, "Trying to update IRQ %d (dev_id %p to %p) \
+				from IRQ context!\n", irq, dev_id, new_dev_id);
+		return -EPERM;
+	}
+
+	if (!desc)
+		return -EINVAL;
+
+	chip_bus_lock(desc);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	/*
+	 * There can be multiple actions per IRQ descriptor, find the right
+	 * one based on the dev_id:
+	 */
+	action_ptr = &desc->action;
+	for (;;) {
+		action = *action_ptr;
+
+		if (!action) {
+			raw_spin_unlock_irqrestore(&desc->lock, flags);
+			chip_bus_sync_unlock(desc);
+			WARN(1, "Trying to update already-free IRQ %d \
+					(dev_id %p to %p)\n",
+					irq, dev_id, new_dev_id);
+			return -ENXIO;
+		}
+
+		if (action->dev_id == dev_id) {
+			action->dev_id = new_dev_id;
+			break;
+		}
+		action_ptr = &action->next;
+	}
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	chip_bus_sync_unlock(desc);
+
+	/*
+	 * Make sure it's not being used on another CPU:
+	 * There is a risk of UAF for old *dev_id, if it is
+	 * freed in a short time after this func returns
+	 */
+	synchronize_irq(irq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(update_irq_devid);
+
+/**
  *	request_any_context_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
  *	@handler: Function to be called when the IRQ occurs.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops
  2019-08-08 12:07 [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops Ben Luo
                   ` (4 preceding siblings ...)
  2019-08-12  8:52 ` [PATCH v2 2/3] genirq: introduce update_irq_devid() Ben Luo
@ 2019-08-12  8:52 ` Ben Luo
  5 siblings, 0 replies; 11+ messages in thread
From: Ben Luo @ 2019-08-12  8:52 UTC (permalink / raw)
  To: tglx, alex.williamson, linux-kernel
  Cc: tao.ma, gerry, nanhai.zou, linyunsheng

When userspace (e.g. qemu) triggers a switch between KVM
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actully
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.

This patch makes use of update_irq_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.

Signed-off-by: Ben Luo <luoben@linux.alibaba.com>
Reviewed-by: Liu Jiang <gerry@linux.alibaba.com>
Reviewed-by: Zou Nanhai <nanhai.zou@linux.alibaba.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 99 ++++++++++++++++++++++++---------------
 1 file changed, 61 insertions(+), 38 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..541153d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,93 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 				      int vector, int fd, bool msix)
 {
+	struct eventfd_ctx *trigger = NULL;
 	struct pci_dev *pdev = vdev->pdev;
-	struct eventfd_ctx *trigger;
 	int irq, ret;
 
 	if (vector < 0 || vector >= vdev->num_ctx)
 		return -EINVAL;
 
+	if (fd >= 0) {
+		trigger = eventfd_ctx_fdget(fd);
+		if (IS_ERR(trigger))
+			return PTR_ERR(trigger);
+	}
+
 	irq = pci_irq_vector(pdev, vector);
 
 	if (vdev->ctx[vector].trigger) {
-		free_irq(irq, vdev->ctx[vector].trigger);
-		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(vdev->ctx[vector].trigger);
-		vdev->ctx[vector].trigger = NULL;
+		if (vdev->ctx[vector].trigger == trigger) {
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+		} else if (trigger) {
+			ret = update_irq_devid(irq,
+					vdev->ctx[vector].trigger, trigger);
+			if (unlikely(ret)) {
+				dev_info(&pdev->dev,
+						"update_irq_devid %d (token %p) fails: %d\n",
+						irq, vdev->ctx[vector].trigger, ret);
+				eventfd_ctx_put(trigger);
+				return ret;
+			}
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].producer.token = trigger;
+			vdev->ctx[vector].trigger = trigger;
+		} else {
+			free_irq(irq, vdev->ctx[vector].trigger);
+			irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(vdev->ctx[vector].trigger);
+			vdev->ctx[vector].trigger = NULL;
+		}
 	}
 
 	if (fd < 0)
 		return 0;
 
-	vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
-					   msix ? "x" : "", vector,
-					   pci_name(pdev));
-	if (!vdev->ctx[vector].name)
-		return -ENOMEM;
+	if (!vdev->ctx[vector].trigger) {
+		vdev->ctx[vector].name = kasprintf(GFP_KERNEL, "vfio-msi%s[%d](%s)",
+						   msix ? "x" : "", vector,
+						   pci_name(pdev));
+		if (!vdev->ctx[vector].name) {
+			eventfd_ctx_put(trigger);
+			return -ENOMEM;
+		}
 
-	trigger = eventfd_ctx_fdget(fd);
-	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[vector].name);
-		return PTR_ERR(trigger);
-	}
+		/*
+		 * The MSIx vector table resides in device memory which may be cleared
+		 * via backdoor resets. We don't allow direct access to the vector
+		 * table so even if a userspace driver attempts to save/restore around
+		 * such a reset it would be unsuccessful. To avoid this, restore the
+		 * cached value of the message prior to enabling.
+		 */
+		if (msix) {
+			struct msi_msg msg;
 
-	/*
-	 * The MSIx vector table resides in device memory which may be cleared
-	 * via backdoor resets. We don't allow direct access to the vector
-	 * table so even if a userspace driver attempts to save/restore around
-	 * such a reset it would be unsuccessful. To avoid this, restore the
-	 * cached value of the message prior to enabling.
-	 */
-	if (msix) {
-		struct msi_msg msg;
+			get_cached_msi_msg(irq, &msg);
+			pci_write_msi_msg(irq, &msg);
+		}
 
-		get_cached_msi_msg(irq, &msg);
-		pci_write_msi_msg(irq, &msg);
-	}
+		ret = request_irq(irq, vfio_msihandler, 0,
+				  vdev->ctx[vector].name, trigger);
+		if (ret) {
+			kfree(vdev->ctx[vector].name);
+			eventfd_ctx_put(trigger);
+			return ret;
+		}
 
-	ret = request_irq(irq, vfio_msihandler, 0,
-			  vdev->ctx[vector].name, trigger);
-	if (ret) {
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(trigger);
-		return ret;
+		vdev->ctx[vector].producer.token = trigger;
+		vdev->ctx[vector].producer.irq = irq;
+		vdev->ctx[vector].trigger = trigger;
 	}
 
-	vdev->ctx[vector].producer.token = trigger;
-	vdev->ctx[vector].producer.irq = irq;
+	/* always update irte for posted mode */
 	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
 	if (unlikely(ret))
 		dev_info(&pdev->dev,
 		"irq bypass producer (token %p) registration fails: %d\n",
 		vdev->ctx[vector].producer.token, ret);
 
-	vdev->ctx[vector].trigger = trigger;
-
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-08-12  8:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-08 12:07 [PATCH 0/2] introduce update_irq_devid and optimize VFIO irq ops Ben Luo
2019-08-08 12:07 ` [PATCH 1/2] genirq: introduce update_irq_devid() Ben Luo
2019-08-08 19:56   ` Thomas Gleixner
2019-08-12  2:59     ` luoben
2019-08-08 12:07 ` [PATCH 2/2] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo
2019-08-12  5:44   ` Yunsheng Lin
2019-08-12  6:37     ` luoben
2019-08-12  8:52 ` [PATCH v2 0/3] introduce update_irq_devid and optimize VFIO " Ben Luo
2019-08-12  8:52 ` [PATCH v2 1/3] genirq: enhance error recovery code in free irq Ben Luo
2019-08-12  8:52 ` [PATCH v2 2/3] genirq: introduce update_irq_devid() Ben Luo
2019-08-12  8:52 ` [PATCH v2 3/3] vfio_pci: make use of update_irq_devid and optimize irq ops Ben Luo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).