kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts
@ 2023-03-15 20:59 Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 1/8] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

== Cover letter ==

Qemu allocates interrupts incrementally at the time the guest unmasks an
interrupt, for example each time a Linux guest runs request_irq().

Dynamic allocation of MSI-X interrupts was not possible until v6.2 [1].
This prompted Qemu to, when allocating a new interrupt, first release all
previously allocated interrupts (including disable of MSI-X) followed
by re-allocation of all interrupts that includes the new interrupt.
Please see [2] for a detailed discussion about this issue.

Releasing and re-allocating interrupts may be acceptable if all
interrupts are unmasked during device initialization. If unmasking of
interrupts occur during runtime this may result in lost interrupts.
For example, consider an accelerator device with multiple work queues,
each work queue having a dedicated interrupt. A work queue can be
enabled at any time with its associated interrupt unmasked while other
work queues are already active. Having all interrupts released and MSI-X
disabled to enable the new work queue will impact active work queues.

This series builds on the recent interrupt sub-system core changes
that added support for dynamic MSI-X allocation after initial MSI-X
enabling.

Add support for dynamic MSI-X allocation to vfio-pci. A flag
indicating lack of support for dynamic allocation already exist:
VFIO_IRQ_INFO_NORESIZE and has always been set for MSI and MSI-X. With
support for dynamic MSI-X the flag is cleared for MSI-X, enabling
Qemu to modify its behavior.

== Why is this an RFC ? ==

vfio support for dynamic MSI-X needs to work with existing user space
as well as upcoming user space that takes advantage of this feature.
I would appreciate guidance on the expectations and requirements
surrounding error handling when considering existing user space.

For example, consider the following scenario:
Start: Consider a passthrough device that supports 10 MSI-X interrupts
	(0 to 9) and existing Qemu allocated interrupts 0 to 4.

Scenario: Qemu (hypothetically) attempts to associate a new action to
	interrupts 0 to 7. Early checking of this range in
	vfio_set_irqs_validate_and_prepare() will pass since it
	is a valid range for the device. What happens after the
	early checking is considered next:

Current behavior (before this series): Since the provided range, 0 - 7,
	exceeds the allocated range, no action will be taken on existing
	allocated interrupts 0 - 4 and the Qemu request will fail without
	making any state changes.

New behavior (with this series): Since vfio supports dynamic MSI-X new
	interrupts will be allocated for vectors 5, 6, and 7. Please note
	that this changes the behavior encountered by unmodified Qemu: new
	interrupts are allocated instead of returning an error. Even more,
	since the range provided by Qemu spans 0 - 7, a failure during
	allocation of 5, 6, or 7 will result in release of entire range.

This series aims to maintain existing error behavior for MSI (please see
"vfio/pci: Remove interrupt context counter") but I would appreciate your
guidance on whether existing error behavior should be maintained for MSI-X
and how to do so if it is a requirement.

Any feedback is appreciated

Reinette

[1] commit 34026364df8e ("PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X")
[2] https://lore.kernel.org/kvm/MWHPR11MB188603D0D809C1079F5817DC8C099@MWHPR11MB1886.namprd11.prod.outlook.com/#t

Reinette Chatre (8):
  vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable
  vfio/pci: Remove negative check on unsigned vector
  vfio/pci: Prepare for dynamic interrupt context storage
  vfio/pci: Use xarray for interrupt context storage
  vfio/pci: Remove interrupt context counter
  vfio/pci: Move to single error path
  vfio/pci: Support dynamic MSI-x
  vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

 drivers/vfio/pci/vfio_pci_core.c  |   3 +-
 drivers/vfio/pci/vfio_pci_intrs.c | 376 ++++++++++++++++++++++--------
 include/linux/vfio_pci_core.h     |   3 +-
 include/uapi/linux/vfio.h         |   3 +
 4 files changed, 286 insertions(+), 99 deletions(-)


base-commit: eeac8ede17557680855031c6f305ece2378af326
-- 
2.34.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/8] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 2/8] vfio/pci: Remove negative check on unsigned vector Reinette Chatre
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

vfio_msi_disable() releases all previously allocated state
associated with each interrupt before disabling MSI/MSI-X.

vfio_msi_disable() iterates twice over the interrupt state:
first directly with a for loop to do virqfd cleanup, followed
by another for loop within vfio_msi_set_block() that releases
the interrupt and its state using vfio_msi_set_vector_signal().

Simplify interrupt cleanup by iterating over allocated interrupts
once.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index bffb0741518b..6a9c6a143cc3 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -426,10 +426,9 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 	for (i = 0; i < vdev->num_ctx; i++) {
 		vfio_virqfd_disable(&vdev->ctx[i].unmask);
 		vfio_virqfd_disable(&vdev->ctx[i].mask);
+		vfio_msi_set_vector_signal(vdev, i, -1, msix);
 	}
 
-	vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);
-
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
 	pci_free_irq_vectors(pdev);
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 2/8] vfio/pci: Remove negative check on unsigned vector
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 1/8] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 3/8] vfio/pci: Prepare for dynamic interrupt context storage Reinette Chatre
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

User space provides the vector as an unsigned int that is checked
early for validity (vfio_set_irqs_validate_and_prepare()).

A later negative check of the provided vector is not necessary.

Remove the negative check and ensure the type used
for the vector is consistent as an unsigned int.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 6a9c6a143cc3..3f64ccdce69f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -317,14 +317,14 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 }
 
 static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
-				      int vector, int fd, bool msix)
+				      unsigned int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct eventfd_ctx *trigger;
 	int irq, ret;
 	u16 cmd;
 
-	if (vector < 0 || vector >= vdev->num_ctx)
+	if (vector >= vdev->num_ctx)
 		return -EINVAL;
 
 	irq = pci_irq_vector(pdev, vector);
@@ -399,7 +399,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 			      unsigned count, int32_t *fds, bool msix)
 {
-	int i, j, ret = 0;
+	int i, ret = 0;
+	unsigned int j;
 
 	if (start >= vdev->num_ctx || start + count > vdev->num_ctx)
 		return -EINVAL;
@@ -420,7 +421,7 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
-	int i;
+	unsigned int i;
 	u16 cmd;
 
 	for (i = 0; i < vdev->num_ctx; i++) {
@@ -542,7 +543,7 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
-	int i;
+	unsigned int i;
 	bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
 
 	if (irq_is(vdev, index) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 3/8] vfio/pci: Prepare for dynamic interrupt context storage
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 1/8] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 2/8] vfio/pci: Remove negative check on unsigned vector Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 4/8] vfio/pci: Use xarray for " Reinette Chatre
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Interrupt context storage is statically allocated at the time
interrupts are enabled. Following allocation, the interrupt
context is managed by directly accessing the elements of the
array using the vector as index.

It is possible to allocate additional MSI-X vectors after
MSI-X has been enabled. Dynamic storage of interrupt context
is needed to support adding new MSI-X vectors after initial
enabling.

Replace direct access of array elements with pointers to the
array elements. Doing so reduces impact of moving to a new data
structure.  Move interactions with the array to helpers to
mostly contain changes needed to transition to a dynamic
data structure.

No functional change intended.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 206 ++++++++++++++++++++----------
 1 file changed, 140 insertions(+), 66 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3f64ccdce69f..ece371ebea00 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -48,6 +48,31 @@ static bool is_irq_none(struct vfio_pci_core_device *vdev)
 		 vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX);
 }
 
+static
+struct vfio_pci_irq_ctx *vfio_irq_ctx_get(struct vfio_pci_core_device *vdev,
+					  unsigned long index)
+{
+	if (index >= vdev->num_ctx)
+		return NULL;
+	return &vdev->ctx[index];
+}
+
+static void vfio_irq_ctx_free_all(struct vfio_pci_core_device *vdev)
+{
+	kfree(vdev->ctx);
+}
+
+static int vfio_irq_ctx_alloc_num(struct vfio_pci_core_device *vdev,
+				  unsigned long num)
+{
+	vdev->ctx = kcalloc(num, sizeof(struct vfio_pci_irq_ctx),
+			    GFP_KERNEL_ACCOUNT);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	return 0;
+}
+
 /*
  * INTx
  */
@@ -55,17 +80,28 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
 {
 	struct vfio_pci_core_device *vdev = opaque;
 
-	if (likely(is_intx(vdev) && !vdev->virq_disabled))
-		eventfd_signal(vdev->ctx[0].trigger, 1);
+	if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
+		struct vfio_pci_irq_ctx *ctx;
+
+		ctx = vfio_irq_ctx_get(vdev, 0);
+		if (!ctx)
+			return;
+		eventfd_signal(ctx->trigger, 1);
+	}
 }
 
 /* Returns true if the INTx vfio_pci_irq_ctx.masked value is changed. */
 bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned long flags;
 	bool masked_changed = false;
 
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return masked_changed;
+
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
 	/*
@@ -77,7 +113,7 @@ bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 	if (unlikely(!is_intx(vdev))) {
 		if (vdev->pci_2_3)
 			pci_intx(pdev, 0);
-	} else if (!vdev->ctx[0].masked) {
+	} else if (!ctx->masked) {
 		/*
 		 * Can't use check_and_mask here because we always want to
 		 * mask, not just when something is pending.
@@ -87,7 +123,7 @@ bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 		else
 			disable_irq_nosync(pdev->irq);
 
-		vdev->ctx[0].masked = true;
+		ctx->masked = true;
 		masked_changed = true;
 	}
 
@@ -105,9 +141,14 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 {
 	struct vfio_pci_core_device *vdev = opaque;
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned long flags;
 	int ret = 0;
 
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return ret;
+
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
 	/*
@@ -117,7 +158,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 	if (unlikely(!is_intx(vdev))) {
 		if (vdev->pci_2_3)
 			pci_intx(pdev, 1);
-	} else if (vdev->ctx[0].masked && !vdev->virq_disabled) {
+	} else if (ctx->masked && !vdev->virq_disabled) {
 		/*
 		 * A pending interrupt here would immediately trigger,
 		 * but we can avoid that overhead by just re-sending
@@ -129,7 +170,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 		} else
 			enable_irq(pdev->irq);
 
-		vdev->ctx[0].masked = (ret > 0);
+		ctx->masked = (ret > 0);
 	}
 
 	spin_unlock_irqrestore(&vdev->irqlock, flags);
@@ -146,18 +187,23 @@ void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev)
 static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 {
 	struct vfio_pci_core_device *vdev = dev_id;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned long flags;
 	int ret = IRQ_NONE;
 
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return ret;
+
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
 	if (!vdev->pci_2_3) {
 		disable_irq_nosync(vdev->pdev->irq);
-		vdev->ctx[0].masked = true;
+		ctx->masked = true;
 		ret = IRQ_HANDLED;
-	} else if (!vdev->ctx[0].masked &&  /* may be shared */
+	} else if (!ctx->masked &&  /* may be shared */
 		   pci_check_and_mask_intx(vdev->pdev)) {
-		vdev->ctx[0].masked = true;
+		ctx->masked = true;
 		ret = IRQ_HANDLED;
 	}
 
@@ -171,15 +217,24 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 
 static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 {
+	struct vfio_pci_irq_ctx *ctx;
+	int ret;
+
 	if (!is_irq_none(vdev))
 		return -EINVAL;
 
 	if (!vdev->pdev->irq)
 		return -ENODEV;
 
-	vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL_ACCOUNT);
-	if (!vdev->ctx)
-		return -ENOMEM;
+	ret = vfio_irq_ctx_alloc_num(vdev, 1);
+	if (ret)
+		return ret;
+
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx) {
+		vfio_irq_ctx_free_all(vdev);
+		return -EINVAL;
+	}
 
 	vdev->num_ctx = 1;
 
@@ -189,9 +244,9 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 	 * here, non-PCI-2.3 devices will have to wait until the
 	 * interrupt is enabled.
 	 */
-	vdev->ctx[0].masked = vdev->virq_disabled;
+	ctx->masked = vdev->virq_disabled;
 	if (vdev->pci_2_3)
-		pci_intx(vdev->pdev, !vdev->ctx[0].masked);
+		pci_intx(vdev->pdev, !ctx->masked);
 
 	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
 
@@ -202,41 +257,46 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long irqflags = IRQF_SHARED;
+	struct vfio_pci_irq_ctx *ctx;
 	struct eventfd_ctx *trigger;
 	unsigned long flags;
 	int ret;
 
-	if (vdev->ctx[0].trigger) {
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return -EINVAL;
+
+	if (ctx->trigger) {
 		free_irq(pdev->irq, vdev);
-		kfree(vdev->ctx[0].name);
-		eventfd_ctx_put(vdev->ctx[0].trigger);
-		vdev->ctx[0].trigger = NULL;
+		kfree(ctx->name);
+		eventfd_ctx_put(ctx->trigger);
+		ctx->trigger = NULL;
 	}
 
 	if (fd < 0) /* Disable only */
 		return 0;
 
-	vdev->ctx[0].name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)",
-				      pci_name(pdev));
-	if (!vdev->ctx[0].name)
+	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)",
+			      pci_name(pdev));
+	if (!ctx->name)
 		return -ENOMEM;
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[0].name);
+		kfree(ctx->name);
 		return PTR_ERR(trigger);
 	}
 
-	vdev->ctx[0].trigger = trigger;
+	ctx->trigger = trigger;
 
 	if (!vdev->pci_2_3)
 		irqflags = 0;
 
 	ret = request_irq(pdev->irq, vfio_intx_handler,
-			  irqflags, vdev->ctx[0].name, vdev);
+			  irqflags, ctx->name, vdev);
 	if (ret) {
-		vdev->ctx[0].trigger = NULL;
-		kfree(vdev->ctx[0].name);
+		ctx->trigger = NULL;
+		kfree(ctx->name);
 		eventfd_ctx_put(trigger);
 		return ret;
 	}
@@ -246,7 +306,7 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 	 * disable_irq won't.
 	 */
 	spin_lock_irqsave(&vdev->irqlock, flags);
-	if (!vdev->pci_2_3 && vdev->ctx[0].masked)
+	if (!vdev->pci_2_3 && ctx->masked)
 		disable_irq_nosync(pdev->irq);
 	spin_unlock_irqrestore(&vdev->irqlock, flags);
 
@@ -255,12 +315,17 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 
 static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 {
-	vfio_virqfd_disable(&vdev->ctx[0].unmask);
-	vfio_virqfd_disable(&vdev->ctx[0].mask);
+	struct vfio_pci_irq_ctx *ctx;
+
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (ctx) {
+		vfio_virqfd_disable(&ctx->unmask);
+		vfio_virqfd_disable(&ctx->mask);
+	}
 	vfio_intx_set_signal(vdev, -1);
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	vdev->num_ctx = 0;
-	kfree(vdev->ctx);
+	vfio_irq_ctx_free_all(vdev);
 }
 
 /*
@@ -284,10 +349,9 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	if (!is_irq_none(vdev))
 		return -EINVAL;
 
-	vdev->ctx = kcalloc(nvec, sizeof(struct vfio_pci_irq_ctx),
-			    GFP_KERNEL_ACCOUNT);
-	if (!vdev->ctx)
-		return -ENOMEM;
+	ret = vfio_irq_ctx_alloc_num(vdev, nvec);
+	if (ret)
+		return ret;
 
 	/* return the number of supported vectors if we can't get all: */
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -296,7 +360,7 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 		if (ret > 0)
 			pci_free_irq_vectors(pdev);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
-		kfree(vdev->ctx);
+		vfio_irq_ctx_free_all(vdev);
 		return ret;
 	}
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
@@ -320,6 +384,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 				      unsigned int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	struct eventfd_ctx *trigger;
 	int irq, ret;
 	u16 cmd;
@@ -327,33 +392,33 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	if (vector >= vdev->num_ctx)
 		return -EINVAL;
 
+	ctx = vfio_irq_ctx_get(vdev, vector);
+	if (!ctx)
+		return -EINVAL;
 	irq = pci_irq_vector(pdev, vector);
 
-	if (vdev->ctx[vector].trigger) {
-		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+	if (ctx->trigger) {
+		irq_bypass_unregister_producer(&ctx->producer);
 
 		cmd = vfio_pci_memory_lock_and_enable(vdev);
-		free_irq(irq, vdev->ctx[vector].trigger);
+		free_irq(irq, ctx->trigger);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
-
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(vdev->ctx[vector].trigger);
-		vdev->ctx[vector].trigger = NULL;
+		kfree(ctx->name);
+		eventfd_ctx_put(ctx->trigger);
+		ctx->trigger = NULL;
 	}
 
 	if (fd < 0)
 		return 0;
 
-	vdev->ctx[vector].name = kasprintf(GFP_KERNEL_ACCOUNT,
-					   "vfio-msi%s[%d](%s)",
-					   msix ? "x" : "", vector,
-					   pci_name(pdev));
-	if (!vdev->ctx[vector].name)
+	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
+			      msix ? "x" : "", vector, pci_name(pdev));
+	if (!ctx->name)
 		return -ENOMEM;
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[vector].name);
+		kfree(ctx->name);
 		return PTR_ERR(trigger);
 	}
 
@@ -372,26 +437,25 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 		pci_write_msi_msg(irq, &msg);
 	}
 
-	ret = request_irq(irq, vfio_msihandler, 0,
-			  vdev->ctx[vector].name, trigger);
+	ret = request_irq(irq, vfio_msihandler, 0, ctx->name, trigger);
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
 	if (ret) {
-		kfree(vdev->ctx[vector].name);
+		kfree(ctx->name);
 		eventfd_ctx_put(trigger);
 		return ret;
 	}
 
-	vdev->ctx[vector].producer.token = trigger;
-	vdev->ctx[vector].producer.irq = irq;
-	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
+	ctx->producer.token = trigger;
+	ctx->producer.irq = irq;
+	ret = irq_bypass_register_producer(&ctx->producer);
 	if (unlikely(ret)) {
 		dev_info(&pdev->dev,
 		"irq bypass producer (token %p) registration fails: %d\n",
-		vdev->ctx[vector].producer.token, ret);
+		ctx->producer.token, ret);
 
-		vdev->ctx[vector].producer.token = NULL;
+		ctx->producer.token = NULL;
 	}
-	vdev->ctx[vector].trigger = trigger;
+	ctx->trigger = trigger;
 
 	return 0;
 }
@@ -421,13 +485,17 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned int i;
 	u16 cmd;
 
 	for (i = 0; i < vdev->num_ctx; i++) {
-		vfio_virqfd_disable(&vdev->ctx[i].unmask);
-		vfio_virqfd_disable(&vdev->ctx[i].mask);
-		vfio_msi_set_vector_signal(vdev, i, -1, msix);
+		ctx = vfio_irq_ctx_get(vdev, i);
+		if (ctx) {
+			vfio_virqfd_disable(&ctx->unmask);
+			vfio_virqfd_disable(&ctx->mask);
+			vfio_msi_set_vector_signal(vdev, i, -1, msix);
+		}
 	}
 
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -443,7 +511,7 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	vdev->num_ctx = 0;
-	kfree(vdev->ctx);
+	vfio_irq_ctx_free_all(vdev);
 }
 
 /*
@@ -463,14 +531,18 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev,
 		if (unmask)
 			vfio_pci_intx_unmask(vdev);
 	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		struct vfio_pci_irq_ctx *ctx = vfio_irq_ctx_get(vdev, 0);
 		int32_t fd = *(int32_t *)data;
+
+		if (!ctx)
+			return -EINVAL;
 		if (fd >= 0)
 			return vfio_virqfd_enable((void *) vdev,
 						  vfio_pci_intx_unmask_handler,
 						  vfio_send_intx_eventfd, NULL,
-						  &vdev->ctx[0].unmask, fd);
+						  &ctx->unmask, fd);
 
-		vfio_virqfd_disable(&vdev->ctx[0].unmask);
+		vfio_virqfd_disable(&ctx->unmask);
 	}
 
 	return 0;
@@ -543,6 +615,7 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned int i;
 	bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
 
@@ -577,14 +650,15 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 		return -EINVAL;
 
 	for (i = start; i < start + count; i++) {
-		if (!vdev->ctx[i].trigger)
+		ctx = vfio_irq_ctx_get(vdev, i);
+		if (!ctx || !ctx->trigger)
 			continue;
 		if (flags & VFIO_IRQ_SET_DATA_NONE) {
-			eventfd_signal(vdev->ctx[i].trigger, 1);
+			eventfd_signal(ctx->trigger, 1);
 		} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
 			uint8_t *bools = data;
 			if (bools[i - start])
-				eventfd_signal(vdev->ctx[i].trigger, 1);
+				eventfd_signal(ctx->trigger, 1);
 		}
 	}
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 4/8] vfio/pci: Use xarray for interrupt context storage
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (2 preceding siblings ...)
  2023-03-15 20:59 ` [RFC PATCH 3/8] vfio/pci: Prepare for dynamic interrupt context storage Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 5/8] vfio/pci: Remove interrupt context counter Reinette Chatre
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Interrupt context is statically allocated at the time interrupts
are enabled. Following allocation, the context is managed by
directly accessing the elements of the array using the vector
as index. The storage is released when interrupts are disabled.

It is possible to dynamically allocate a single MSI-X index
after MSI-X has been enabled. A dynamic storage for interrupt context
is needed to support this. Replace the interrupt context array with an
xarray (similar to what the core uses as store for MSI descriptors)
that can support the dynamic expansion while maintaining the
custom that uses the vector as index.

Use the new data storage to allocate all elements once and free all
elements together. Dynamic usage follows.

Create helpers with understanding that it is only possible
to (after initial MSI-X enabling) allocate a single MSI-X index at
a time. The only time multiple MSI-X are allocated is during initial
MSI-X enabling where failure results in no allocations.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c  |  1 +
 drivers/vfio/pci/vfio_pci_intrs.c | 63 +++++++++++++++++++++++--------
 include/linux/vfio_pci_core.h     |  2 +-
 3 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..ae0e161c7fc9 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2102,6 +2102,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	INIT_LIST_HEAD(&vdev->vma_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
 	init_rwsem(&vdev->memory_lock);
+	xa_init(&vdev->ctx);
 
 	return 0;
 }
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index ece371ebea00..bfcf5cb6b435 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -52,25 +52,59 @@ static
 struct vfio_pci_irq_ctx *vfio_irq_ctx_get(struct vfio_pci_core_device *vdev,
 					  unsigned long index)
 {
-	if (index >= vdev->num_ctx)
-		return NULL;
-	return &vdev->ctx[index];
+	return xa_load(&vdev->ctx, index);
 }
 
 static void vfio_irq_ctx_free_all(struct vfio_pci_core_device *vdev)
 {
-	kfree(vdev->ctx);
+	struct vfio_pci_irq_ctx *ctx;
+	unsigned long index;
+
+	xa_for_each(&vdev->ctx, index, ctx) {
+		xa_erase(&vdev->ctx, index);
+		kfree(ctx);
+	}
 }
 
+static int vfio_irq_ctx_alloc_single(struct vfio_pci_core_device *vdev,
+				     unsigned long index)
+{
+	struct vfio_pci_irq_ctx *ctx;
+	int ret;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+	if (!ctx)
+		return -ENOMEM;
+
+	ret = xa_insert(&vdev->ctx, index, ctx, GFP_KERNEL_ACCOUNT);
+	if (ret) {
+		kfree(ctx);
+		return ret;
+	}
+
+	return 0;
+}
+
+/* Only called during initial interrupt enabling. Never after.  */
 static int vfio_irq_ctx_alloc_num(struct vfio_pci_core_device *vdev,
 				  unsigned long num)
 {
-	vdev->ctx = kcalloc(num, sizeof(struct vfio_pci_irq_ctx),
-			    GFP_KERNEL_ACCOUNT);
-	if (!vdev->ctx)
-		return -ENOMEM;
+	unsigned long index;
+	int ret;
+
+	WARN_ON(!xa_empty(&vdev->ctx));
+
+	for (index = 0; index < num; index++) {
+		ret = vfio_irq_ctx_alloc_single(vdev, index);
+		if (ret)
+			goto err;
+	}
 
 	return 0;
+
+err:
+	vfio_irq_ctx_free_all(vdev);
+	return ret;
 }
 
 /*
@@ -486,16 +520,13 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_irq_ctx *ctx;
-	unsigned int i;
+	unsigned long i;
 	u16 cmd;
 
-	for (i = 0; i < vdev->num_ctx; i++) {
-		ctx = vfio_irq_ctx_get(vdev, i);
-		if (ctx) {
-			vfio_virqfd_disable(&ctx->unmask);
-			vfio_virqfd_disable(&ctx->mask);
-			vfio_msi_set_vector_signal(vdev, i, -1, msix);
-		}
+	xa_for_each(&vdev->ctx, i, ctx) {
+		vfio_virqfd_disable(&ctx->unmask);
+		vfio_virqfd_disable(&ctx->mask);
+		vfio_msi_set_vector_signal(vdev, i, -1, msix);
 	}
 
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 367fd79226a3..61d7873a3973 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -59,7 +59,7 @@ struct vfio_pci_core_device {
 	struct perm_bits	*msi_perm;
 	spinlock_t		irqlock;
 	struct mutex		igate;
-	struct vfio_pci_irq_ctx	*ctx;
+	struct xarray		ctx;
 	int			num_ctx;
 	int			irq_type;
 	int			num_regions;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 5/8] vfio/pci: Remove interrupt context counter
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (3 preceding siblings ...)
  2023-03-15 20:59 ` [RFC PATCH 4/8] vfio/pci: Use xarray for " Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 6/8] vfio/pci: Move to single error path Reinette Chatre
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

struct vfio_pci_core_device::num_ctx counts how many interrupt
contexts have been allocated. When all interrupt contexts are
allocated simultaneously num_ctx provides the upper bound of all
vectors that can be used as indices into the interrupt context
array.

With the upcoming support for dynamic MSI-X the number of
interrupt contexts does not necessarily span the range of allocated
interrupts. Consequently, num_ctx is no longer a trusted upper bound
for valid indices.

Stop using num_ctx to determine if a provided vector is valid, use
the existence of interrupt context directly.

Introduce a helper that ensures a provided interrupt range is
allocated before any user requested action is taken. This maintains
existing behavior (early exit without modifications) when user
space attempts to modify a range of vectors that includes unallocated
interrupts.

The checks that ensure that user space provides a range of vectors
that is valid for the device are untouched.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---

Existing behavior on error paths is not maintained for MSI-X
when adding support for dynamic MSI-X. Please see maintainer
comments associated with "vfio/pci: Support dynamic MSI-x".

 drivers/vfio/pci/vfio_pci_intrs.c | 30 ++++++++++++++++++++----------
 include/linux/vfio_pci_core.h     |  1 -
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index bfcf5cb6b435..187a1ba34a16 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -107,6 +107,21 @@ static int vfio_irq_ctx_alloc_num(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static bool vfio_irq_ctx_range_allocated(struct vfio_pci_core_device *vdev,
+					 unsigned int start, unsigned int count)
+{
+	struct vfio_pci_irq_ctx *ctx;
+	unsigned int i;
+
+	for (i = start; i < start + count; i++) {
+		ctx = xa_load(&vdev->ctx, i);
+		if (!ctx)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * INTx
  */
@@ -270,8 +285,6 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 		return -EINVAL;
 	}
 
-	vdev->num_ctx = 1;
-
 	/*
 	 * If the virtual interrupt is masked, restore it.  Devices
 	 * supporting DisINTx can be masked at the hardware level
@@ -358,7 +371,6 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 	}
 	vfio_intx_set_signal(vdev, -1);
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
-	vdev->num_ctx = 0;
 	vfio_irq_ctx_free_all(vdev);
 }
 
@@ -399,7 +411,6 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	}
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
 
-	vdev->num_ctx = nvec;
 	vdev->irq_type = msix ? VFIO_PCI_MSIX_IRQ_INDEX :
 				VFIO_PCI_MSI_IRQ_INDEX;
 
@@ -423,9 +434,6 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	int irq, ret;
 	u16 cmd;
 
-	if (vector >= vdev->num_ctx)
-		return -EINVAL;
-
 	ctx = vfio_irq_ctx_get(vdev, vector);
 	if (!ctx)
 		return -EINVAL;
@@ -500,7 +508,7 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 	int i, ret = 0;
 	unsigned int j;
 
-	if (start >= vdev->num_ctx || start + count > vdev->num_ctx)
+	if (!vfio_irq_ctx_range_allocated(vdev, start, count))
 		return -EINVAL;
 
 	for (i = 0, j = start; i < count && !ret; i++, j++) {
@@ -541,7 +549,6 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 		pci_intx(pdev, 0);
 
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
-	vdev->num_ctx = 0;
 	vfio_irq_ctx_free_all(vdev);
 }
 
@@ -677,7 +684,10 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 		return ret;
 	}
 
-	if (!irq_is(vdev, index) || start + count > vdev->num_ctx)
+	if (!irq_is(vdev, index))
+		return -EINVAL;
+
+	if (!vfio_irq_ctx_range_allocated(vdev, start, count))
 		return -EINVAL;
 
 	for (i = start; i < start + count; i++) {
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 61d7873a3973..148fd1ae6c1c 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -60,7 +60,6 @@ struct vfio_pci_core_device {
 	spinlock_t		irqlock;
 	struct mutex		igate;
 	struct xarray		ctx;
-	int			num_ctx;
 	int			irq_type;
 	int			num_regions;
 	struct vfio_pci_region	*region;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 6/8] vfio/pci: Move to single error path
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (4 preceding siblings ...)
  2023-03-15 20:59 ` [RFC PATCH 5/8] vfio/pci: Remove interrupt context counter Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-15 20:59 ` [RFC PATCH 7/8] vfio/pci: Support dynamic MSI-x Reinette Chatre
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

The creation and release of interrupt context involves several
steps that can fail. Cleanup after failure is done when the error
is encountered, resulting in some repetitive code.

Support for dynamic MSI-X will introduce more steps during
interrupt context creation and release.

Transition to centralized exit path in preparation for dynamic
MSI-X to eliminate duplicate error handling code. Ensure
no remaining state refers to freed memory.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 187a1ba34a16..b375a12885ba 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -460,8 +460,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
-		kfree(ctx->name);
-		return PTR_ERR(trigger);
+		ret = PTR_ERR(trigger);
+		goto out_free_name;
 	}
 
 	/*
@@ -481,11 +481,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 
 	ret = request_irq(irq, vfio_msihandler, 0, ctx->name, trigger);
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
-	if (ret) {
-		kfree(ctx->name);
-		eventfd_ctx_put(trigger);
-		return ret;
-	}
+	if (ret)
+		goto out_put_eventfd_ctx;
 
 	ctx->producer.token = trigger;
 	ctx->producer.irq = irq;
@@ -500,6 +497,13 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	ctx->trigger = trigger;
 
 	return 0;
+
+out_put_eventfd_ctx:
+	eventfd_ctx_put(trigger);
+out_free_name:
+	kfree(ctx->name);
+	ctx->name = NULL;
+	return ret;
 }
 
 static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 7/8] vfio/pci: Support dynamic MSI-x
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (5 preceding siblings ...)
  2023-03-15 20:59 ` [RFC PATCH 6/8] vfio/pci: Move to single error path Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-17 21:58   ` Alex Williamson
  2023-03-15 20:59 ` [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
  2023-03-16 21:56 ` [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Alex Williamson
  8 siblings, 1 reply; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
enables an individual MSI-X index to be allocated and freed after
MSI-X enabling.

Support dynamic MSI-X by keeping the association between allocated
interrupt and vfio interrupt context. Allocate new context together
with the new interrupt if no interrupt context exist for an MSI-X
interrupt. Similarly, release an interrupt with its context.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---

Guidance is appreciated on expectations regarding maintaining
existing error behavior. Earlier patch introduced the
vfio_irq_ctx_range_allocated() helper to maintain existing error
behavior. Now, this helper needs to be disabled for MSI-X. User
space not wanting to dynamically allocate MSI-X interrupts, but
providing invalid range when providing a new ACTION will now
obtain new interrupts or new failures (potentially including freeing
of existing interrupts) if the allocation of the new interrupts fail.

 drivers/vfio/pci/vfio_pci_intrs.c | 101 ++++++++++++++++++++++++------
 1 file changed, 83 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index b375a12885ba..954a70575802 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -55,6 +55,18 @@ struct vfio_pci_irq_ctx *vfio_irq_ctx_get(struct vfio_pci_core_device *vdev,
 	return xa_load(&vdev->ctx, index);
 }
 
+static void vfio_irq_ctx_free(struct vfio_pci_core_device *vdev,
+			      unsigned long index)
+{
+	struct vfio_pci_irq_ctx *ctx;
+
+	ctx = xa_load(&vdev->ctx, index);
+	if (ctx) {
+		xa_erase(&vdev->ctx, index);
+		kfree(ctx);
+	}
+}
+
 static void vfio_irq_ctx_free_all(struct vfio_pci_core_device *vdev)
 {
 	struct vfio_pci_irq_ctx *ctx;
@@ -430,33 +442,63 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_irq_ctx *ctx;
+	struct msi_map msix_map = {};
 	struct eventfd_ctx *trigger;
+	bool new_ctx;
 	int irq, ret;
 	u16 cmd;
 
 	ctx = vfio_irq_ctx_get(vdev, vector);
-	if (!ctx)
+	/* Only MSI-X allows dynamic allocation. */
+	if (!msix && !ctx)
 		return -EINVAL;
+
 	irq = pci_irq_vector(pdev, vector);
+	/* Context and interrupt are always allocated together. */
+	WARN_ON((ctx && irq == -EINVAL) || (!ctx && irq != -EINVAL));
 
-	if (ctx->trigger) {
+	if (ctx && ctx->trigger) {
 		irq_bypass_unregister_producer(&ctx->producer);
 
 		cmd = vfio_pci_memory_lock_and_enable(vdev);
 		free_irq(irq, ctx->trigger);
+		if (msix) {
+			msix_map.index = vector;
+			msix_map.virq = irq;
+			pci_msix_free_irq(pdev, msix_map);
+			irq = -EINVAL;
+		}
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
 		kfree(ctx->name);
 		eventfd_ctx_put(ctx->trigger);
 		ctx->trigger = NULL;
+		if (msix) {
+			vfio_irq_ctx_free(vdev, vector);
+			ctx = NULL;
+		}
 	}
 
 	if (fd < 0)
 		return 0;
 
+	if (!ctx) {
+		ret = vfio_irq_ctx_alloc_single(vdev, vector);
+		if (ret)
+			return ret;
+		ctx = vfio_irq_ctx_get(vdev, vector);
+		if (!ctx) {
+			ret = -EINVAL;
+			goto out_free_ctx;
+		}
+		new_ctx = true;
+	}
+
 	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
 			      msix ? "x" : "", vector, pci_name(pdev));
-	if (!ctx->name)
-		return -ENOMEM;
+	if (!ctx->name) {
+		ret = -ENOMEM;
+		goto out_free_ctx;
+	}
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
@@ -464,25 +506,38 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 		goto out_free_name;
 	}
 
-	/*
-	 * The MSIx vector table resides in device memory which may be cleared
-	 * via backdoor resets. We don't allow direct access to the vector
-	 * table so even if a userspace driver attempts to save/restore around
-	 * such a reset it would be unsuccessful. To avoid this, restore the
-	 * cached value of the message prior to enabling.
-	 */
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
 	if (msix) {
-		struct msi_msg msg;
-
-		get_cached_msi_msg(irq, &msg);
-		pci_write_msi_msg(irq, &msg);
+		if (irq == -EINVAL) {
+			msix_map = pci_msix_alloc_irq_at(pdev, vector, NULL);
+			if (msix_map.index < 0) {
+				vfio_pci_memory_unlock_and_restore(vdev, cmd);
+				ret = msix_map.index;
+				goto out_put_eventfd_ctx;
+			}
+			irq = msix_map.virq;
+		} else {
+			/*
+			 * The MSIx vector table resides in device memory which
+			 * may be cleared via backdoor resets. We don't allow
+			 * direct access to the vector table so even if a
+			 * userspace driver attempts to save/restore around
+			 * such a reset it would be unsuccessful. To avoid
+			 * this, restore the cached value of the message prior
+			 * to enabling.
+			 */
+			struct msi_msg msg;
+
+			get_cached_msi_msg(irq, &msg);
+			pci_write_msi_msg(irq, &msg);
+		}
 	}
 
 	ret = request_irq(irq, vfio_msihandler, 0, ctx->name, trigger);
-	vfio_pci_memory_unlock_and_restore(vdev, cmd);
 	if (ret)
-		goto out_put_eventfd_ctx;
+		goto out_free_irq_locked;
+
+	vfio_pci_memory_unlock_and_restore(vdev, cmd);
 
 	ctx->producer.token = trigger;
 	ctx->producer.irq = irq;
@@ -498,11 +553,21 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 
 	return 0;
 
+out_free_irq_locked:
+	if (msix && new_ctx) {
+		msix_map.index = vector;
+		msix_map.virq = irq;
+		pci_msix_free_irq(pdev, msix_map);
+	}
+	vfio_pci_memory_unlock_and_restore(vdev, cmd);
 out_put_eventfd_ctx:
 	eventfd_ctx_put(trigger);
 out_free_name:
 	kfree(ctx->name);
 	ctx->name = NULL;
+out_free_ctx:
+	if (msix && new_ctx)
+		vfio_irq_ctx_free(vdev, vector);
 	return ret;
 }
 
@@ -512,7 +577,7 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 	int i, ret = 0;
 	unsigned int j;
 
-	if (!vfio_irq_ctx_range_allocated(vdev, start, count))
+	if (!msix && !vfio_irq_ctx_range_allocated(vdev, start, count))
 		return -EINVAL;
 
 	for (i = 0, j = start; i < count && !ret; i++, j++) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (6 preceding siblings ...)
  2023-03-15 20:59 ` [RFC PATCH 7/8] vfio/pci: Support dynamic MSI-x Reinette Chatre
@ 2023-03-15 20:59 ` Reinette Chatre
  2023-03-17 21:05   ` Alex Williamson
  2023-03-16 21:56 ` [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Alex Williamson
  8 siblings, 1 reply; 20+ messages in thread
From: Reinette Chatre @ 2023-03-15 20:59 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
to provide guidance to user space.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 2 +-
 include/uapi/linux/vfio.h        | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ae0e161c7fc9..1d071ee212a7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1111,7 +1111,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
 	if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
 		info.flags |=
 			(VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
-	else
+	else if (info.index != VFIO_PCI_MSIX_IRQ_INDEX)
 		info.flags |= VFIO_IRQ_INFO_NORESIZE;
 
 	return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0552e8dcf0cb..1a36134cae5c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -511,6 +511,9 @@ struct vfio_region_info_cap_nvlink2_lnkspd {
  * then add and unmask vectors, it's up to userspace to make the decision
  * whether to allocate the maximum supported number of vectors or tear
  * down setup and incrementally increase the vectors as each is enabled.
+ * Absence of the NORESIZE flag indicates that vectors can be enabled
+ * and disabled dynamically without impacting other vectors within the
+ * index.
  */
 struct vfio_irq_info {
 	__u32	argsz;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts
  2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (7 preceding siblings ...)
  2023-03-15 20:59 ` [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
@ 2023-03-16 21:56 ` Alex Williamson
  2023-03-16 23:38   ` Reinette Chatre
  8 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2023-03-16 21:56 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

On Wed, 15 Mar 2023 13:59:20 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> == Cover letter ==
> 
> Qemu allocates interrupts incrementally at the time the guest unmasks an
> interrupt, for example each time a Linux guest runs request_irq().
> 
> Dynamic allocation of MSI-X interrupts was not possible until v6.2 [1].
> This prompted Qemu to, when allocating a new interrupt, first release all
> previously allocated interrupts (including disable of MSI-X) followed
> by re-allocation of all interrupts that includes the new interrupt.
> Please see [2] for a detailed discussion about this issue.
> 
> Releasing and re-allocating interrupts may be acceptable if all
> interrupts are unmasked during device initialization. If unmasking of
> interrupts occur during runtime this may result in lost interrupts.
> For example, consider an accelerator device with multiple work queues,
> each work queue having a dedicated interrupt. A work queue can be
> enabled at any time with its associated interrupt unmasked while other
> work queues are already active. Having all interrupts released and MSI-X
> disabled to enable the new work queue will impact active work queues.
> 
> This series builds on the recent interrupt sub-system core changes
> that added support for dynamic MSI-X allocation after initial MSI-X
> enabling.
> 
> Add support for dynamic MSI-X allocation to vfio-pci. A flag
> indicating lack of support for dynamic allocation already exist:
> VFIO_IRQ_INFO_NORESIZE and has always been set for MSI and MSI-X. With
> support for dynamic MSI-X the flag is cleared for MSI-X, enabling
> Qemu to modify its behavior.
> 
> == Why is this an RFC ? ==
> 
> vfio support for dynamic MSI-X needs to work with existing user space
> as well as upcoming user space that takes advantage of this feature.
> I would appreciate guidance on the expectations and requirements
> surrounding error handling when considering existing user space.
> 
> For example, consider the following scenario:
> Start: Consider a passthrough device that supports 10 MSI-X interrupts
> 	(0 to 9) and existing Qemu allocated interrupts 0 to 4.
> 
> Scenario: Qemu (hypothetically) attempts to associate a new action to
> 	interrupts 0 to 7. Early checking of this range in
> 	vfio_set_irqs_validate_and_prepare() will pass since it
> 	is a valid range for the device. What happens after the
> 	early checking is considered next:
> 
> Current behavior (before this series): Since the provided range, 0 - 7,
> 	exceeds the allocated range, no action will be taken on existing
> 	allocated interrupts 0 - 4 and the Qemu request will fail without
> 	making any state changes.

I must be missing something, it was described correctly earlier that
QEMU will disable MSI-X and re-enable with a new vector count.  Not
only does QEMU not really have a way to fail this change, pretty much
nothing would work if we did.

What happens in this case is that the QEMU vfio-pci driver gets a
vector_use callback for one of these new vectors {5,6,7},
vfio_msix_vector_do_use() checks that's greater than we have enabled,
disables MSI-X, and re-enables it for the new vector count.  Worst case
we'll do that 3 times if the vectors are presented in ascending order.

> New behavior (with this series): Since vfio supports dynamic MSI-X new
> 	interrupts will be allocated for vectors 5, 6, and 7. Please note
> 	that this changes the behavior encountered by unmodified Qemu: new
> 	interrupts are allocated instead of returning an error. Even more,
> 	since the range provided by Qemu spans 0 - 7, a failure during
> 	allocation of 5, 6, or 7 will result in release of entire range.

As above, QEMU doesn't currently return an error here, nor is there
actually any means to return an error.  MSI-X is not paravirtualized
and the PCI spec definition of MSI-X does not define that a driver
needs to check whether the device accepted unmasking a vector.  All we
can do in QEMU if the host fails to configure the device as directed is
print an error message as a breadcrumb to help debug why the device
stopped working.  In practice, I don't think I've ever seen this.

So really the difference should be transparent to the guest.  The risk
of an error allocating vectors on the host is the same, the only
significant difference should be the amount of disruption at the device
that we can better approximate the request of the guest.
 
> This series aims to maintain existing error behavior for MSI (please see
> "vfio/pci: Remove interrupt context counter") but I would appreciate your
> guidance on whether existing error behavior should be maintained for MSI-X
> and how to do so if it is a requirement.

Based on above, there really can never be an error if we expect the
device to work, so I think there's a misread of the current status.
Dynamic MSI-X support should simply reduce the disruption and chance of
lost interrupts at the device, but the points where we risk that the
host cannot provide the configuration we need are the same.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts
  2023-03-16 21:56 ` [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Alex Williamson
@ 2023-03-16 23:38   ` Reinette Chatre
  2023-03-16 23:52     ` Tian, Kevin
  0 siblings, 1 reply; 20+ messages in thread
From: Reinette Chatre @ 2023-03-16 23:38 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

Hi Alex,

On 3/16/2023 2:56 PM, Alex Williamson wrote:
> On Wed, 15 Mar 2023 13:59:20 -0700 Reinette Chatre
> <reinette.chatre@intel.com> wrote:
> 

...

>> == Why is this an RFC ? ==
>> 
>> vfio support for dynamic MSI-X needs to work with existing user
>> space as well as upcoming user space that takes advantage of this
>> feature. I would appreciate guidance on the expectations and
>> requirements surrounding error handling when considering existing
>> user space.
>> 
>> For example, consider the following scenario: Start: Consider a
>> passthrough device that supports 10 MSI-X interrupts (0 to 9) and
>> existing Qemu allocated interrupts 0 to 4.
>> 
>> Scenario: Qemu (hypothetically) attempts to associate a new action
>> to interrupts 0 to 7. Early checking of this range in 
>> vfio_set_irqs_validate_and_prepare() will pass since it is a valid
>> range for the device. What happens after the early checking is
>> considered next:
>> 
>> Current behavior (before this series): Since the provided range, 0
>> - 7, exceeds the allocated range, no action will be taken on
>> existing allocated interrupts 0 - 4 and the Qemu request will fail
>> without making any state changes.
> 
> I must be missing something, it was described correctly earlier that 
> QEMU will disable MSI-X and re-enable with a new vector count.  Not 
> only does QEMU not really have a way to fail this change, pretty
> much nothing would work if we did.

Thank you very much for confirming Qemu behavior. 

One of my goals is to ensure that these kernel changes do not break
existing user space in any way. The only area I could find where
existing user space may encounter new behavior is if user space makes
the (pre dynamic MSI-X) mistake of attempting to associate triggers
to interrupts that have not been allocated. Thank you for confirming
that this is not a valid scenario for Qemu.

> What happens in this case is that the QEMU vfio-pci driver gets a 
> vector_use callback for one of these new vectors {5,6,7}, 
> vfio_msix_vector_do_use() checks that's greater than we have
> enabled, disables MSI-X, and re-enables it for the new vector count.
> Worst case we'll do that 3 times if the vectors are presented in
> ascending order.

Indeed. While testing this work I modified Qemu's vfio_msix_vector_do_use() to
consider VFIO_IRQ_INFO_NORESIZE. I saw the vector_use callback arriving
for each new vector and having Qemu just associate a new action with the
new vector (call vfio_set_irq_signaling()) instead of disabling MSI-X followed
by enabling all vectors worked well with the changes I present here. A single
interrupt was dynamically allocated and enabled without impacting others.

> Based on above, there really can never be an error if we expect the 
> device to work, so I think there's a misread of the current status. 
> Dynamic MSI-X support should simply reduce the disruption and chance
> of lost interrupts at the device, but the points where we risk that
> the host cannot provide the configuration we need are the same.

Thank you very much Alex. In this case, please do consider this
submission as a submission for inclusion. I'd be happy to resubmit
without the "RFC" prefix if that is preferred.

Reinette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts
  2023-03-16 23:38   ` Reinette Chatre
@ 2023-03-16 23:52     ` Tian, Kevin
  2023-03-17 16:48       ` Reinette Chatre
  0 siblings, 1 reply; 20+ messages in thread
From: Tian, Kevin @ 2023-03-16 23:52 UTC (permalink / raw)
  To: Chatre, Reinette, Alex Williamson
  Cc: jgg, yishaih, shameerali.kolothum.thodi, tglx, darwi, kvm, Jiang,
	Dave, Liu, Jing2, Raj, Ashok, Yu, Fenghua, tom.zanussi,
	linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, March 17, 2023 7:38 AM
> 
> > Based on above, there really can never be an error if we expect the
> > device to work, so I think there's a misread of the current status.
> > Dynamic MSI-X support should simply reduce the disruption and chance
> > of lost interrupts at the device, but the points where we risk that
> > the host cannot provide the configuration we need are the same.
> 
> Thank you very much Alex. In this case, please do consider this
> submission as a submission for inclusion. I'd be happy to resubmit
> without the "RFC" prefix if that is preferred.
> 

With that do we still want to keep the error behavior for MSI?

If no patch5 can be simplified e.g. no need of vfio_irq_ctx_range_allocated()
and MSI/MSI-X error behaviors become consistent.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts
  2023-03-16 23:52     ` Tian, Kevin
@ 2023-03-17 16:48       ` Reinette Chatre
  0 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-17 16:48 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: jgg, yishaih, shameerali.kolothum.thodi, tglx, darwi, kvm, Jiang,
	Dave, Liu, Jing2, Raj, Ashok, Yu, Fenghua, tom.zanussi,
	linux-kernel

Hi Kevin,

On 3/16/2023 4:52 PM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Friday, March 17, 2023 7:38 AM
>>
>>> Based on above, there really can never be an error if we expect the
>>> device to work, so I think there's a misread of the current status.
>>> Dynamic MSI-X support should simply reduce the disruption and chance
>>> of lost interrupts at the device, but the points where we risk that
>>> the host cannot provide the configuration we need are the same.
>>
>> Thank you very much Alex. In this case, please do consider this
>> submission as a submission for inclusion. I'd be happy to resubmit
>> without the "RFC" prefix if that is preferred.
>>
> 
> With that do we still want to keep the error behavior for MSI?
> 
> If no patch5 can be simplified e.g. no need of vfio_irq_ctx_range_allocated()
> and MSI/MSI-X error behaviors become consistent.

Thank you. Yes, if I understand correctly MSI and MSI-X error handling
can become consistent. I'll modify patch 5 to remove
vfio_irq_ctx_range_allocated().

Reinette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-03-15 20:59 ` [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
@ 2023-03-17 21:05   ` Alex Williamson
  2023-03-17 22:21     ` Reinette Chatre
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2023-03-17 21:05 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

On Wed, 15 Mar 2023 13:59:28 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
> to provide guidance to user space.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 2 +-
>  include/uapi/linux/vfio.h        | 3 +++
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index ae0e161c7fc9..1d071ee212a7 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1111,7 +1111,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
>  	if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
>  		info.flags |=
>  			(VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
> -	else
> +	else if (info.index != VFIO_PCI_MSIX_IRQ_INDEX)
>  		info.flags |= VFIO_IRQ_INFO_NORESIZE;
>  

I think we need to check pci_msix_can_alloc_dyn(), right?  Thanks,

Alex

>  	return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 0552e8dcf0cb..1a36134cae5c 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -511,6 +511,9 @@ struct vfio_region_info_cap_nvlink2_lnkspd {
>   * then add and unmask vectors, it's up to userspace to make the decision
>   * whether to allocate the maximum supported number of vectors or tear
>   * down setup and incrementally increase the vectors as each is enabled.
> + * Absence of the NORESIZE flag indicates that vectors can be enabled
> + * and disabled dynamically without impacting other vectors within the
> + * index.
>   */
>  struct vfio_irq_info {
>  	__u32	argsz;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 7/8] vfio/pci: Support dynamic MSI-x
  2023-03-15 20:59 ` [RFC PATCH 7/8] vfio/pci: Support dynamic MSI-x Reinette Chatre
@ 2023-03-17 21:58   ` Alex Williamson
  2023-03-17 22:54     ` Reinette Chatre
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2023-03-17 21:58 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

On Wed, 15 Mar 2023 13:59:27 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
> enables an individual MSI-X index to be allocated and freed after
> MSI-X enabling.
> 
> Support dynamic MSI-X by keeping the association between allocated
> interrupt and vfio interrupt context. Allocate new context together
> with the new interrupt if no interrupt context exist for an MSI-X
> interrupt. Similarly, release an interrupt with its context.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> 
> Guidance is appreciated on expectations regarding maintaining
> existing error behavior. Earlier patch introduced the
> vfio_irq_ctx_range_allocated() helper to maintain existing error
> behavior. Now, this helper needs to be disabled for MSI-X. User
> space not wanting to dynamically allocate MSI-X interrupts, but
> providing invalid range when providing a new ACTION will now
> obtain new interrupts or new failures (potentially including freeing
> of existing interrupts) if the allocation of the new interrupts fail.
> 
>  drivers/vfio/pci/vfio_pci_intrs.c | 101 ++++++++++++++++++++++++------
>  1 file changed, 83 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index b375a12885ba..954a70575802 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -55,6 +55,18 @@ struct vfio_pci_irq_ctx *vfio_irq_ctx_get(struct vfio_pci_core_device *vdev,
>  	return xa_load(&vdev->ctx, index);
>  }
>  
> +static void vfio_irq_ctx_free(struct vfio_pci_core_device *vdev,
> +			      unsigned long index)
> +{
> +	struct vfio_pci_irq_ctx *ctx;
> +
> +	ctx = xa_load(&vdev->ctx, index);
> +	if (ctx) {
> +		xa_erase(&vdev->ctx, index);
> +		kfree(ctx);
> +	}
> +}

The only places calling this have a known valid ctx, so it seems
redundant that we xa_load it again.  Should ctx be a function arg to
reduce this to simply xa_erase() + kfree()?

> +
>  static void vfio_irq_ctx_free_all(struct vfio_pci_core_device *vdev)
>  {
>  	struct vfio_pci_irq_ctx *ctx;
> @@ -430,33 +442,63 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>  {
>  	struct pci_dev *pdev = vdev->pdev;
>  	struct vfio_pci_irq_ctx *ctx;
> +	struct msi_map msix_map = {};
>  	struct eventfd_ctx *trigger;
> +	bool new_ctx;
>  	int irq, ret;
>  	u16 cmd;
>  
>  	ctx = vfio_irq_ctx_get(vdev, vector);
> -	if (!ctx)
> +	/* Only MSI-X allows dynamic allocation. */
> +	if (!msix && !ctx)
>  		return -EINVAL;
> +
>  	irq = pci_irq_vector(pdev, vector);
> +	/* Context and interrupt are always allocated together. */
> +	WARN_ON((ctx && irq == -EINVAL) || (!ctx && irq != -EINVAL));
>  
> -	if (ctx->trigger) {
> +	if (ctx && ctx->trigger) {
>  		irq_bypass_unregister_producer(&ctx->producer);
>  
>  		cmd = vfio_pci_memory_lock_and_enable(vdev);
>  		free_irq(irq, ctx->trigger);
> +		if (msix) {
> +			msix_map.index = vector;
> +			msix_map.virq = irq;
> +			pci_msix_free_irq(pdev, msix_map);
> +			irq = -EINVAL;
> +		}
>  		vfio_pci_memory_unlock_and_restore(vdev, cmd);
>  		kfree(ctx->name);
>  		eventfd_ctx_put(ctx->trigger);
>  		ctx->trigger = NULL;
> +		if (msix) {
> +			vfio_irq_ctx_free(vdev, vector);
> +			ctx = NULL;
> +		}
>  	}
>  
>  	if (fd < 0)
>  		return 0;
>  
> +	if (!ctx) {
> +		ret = vfio_irq_ctx_alloc_single(vdev, vector);
> +		if (ret)
> +			return ret;
> +		ctx = vfio_irq_ctx_get(vdev, vector);

This suggests vfio_irq_ctx_alloc_single() should return ctx.

> +		if (!ctx) {
> +			ret = -EINVAL;
> +			goto out_free_ctx;
> +		}
> +		new_ctx = true;
> +	}
> +
>  	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
>  			      msix ? "x" : "", vector, pci_name(pdev));
> -	if (!ctx->name)
> -		return -ENOMEM;
> +	if (!ctx->name) {
> +		ret = -ENOMEM;
> +		goto out_free_ctx;
> +	}
>  
>  	trigger = eventfd_ctx_fdget(fd);
>  	if (IS_ERR(trigger)) {
> @@ -464,25 +506,38 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>  		goto out_free_name;
>  	}
>  
> -	/*
> -	 * The MSIx vector table resides in device memory which may be cleared
> -	 * via backdoor resets. We don't allow direct access to the vector
> -	 * table so even if a userspace driver attempts to save/restore around
> -	 * such a reset it would be unsuccessful. To avoid this, restore the
> -	 * cached value of the message prior to enabling.
> -	 */
>  	cmd = vfio_pci_memory_lock_and_enable(vdev);
>  	if (msix) {
> -		struct msi_msg msg;
> -
> -		get_cached_msi_msg(irq, &msg);
> -		pci_write_msi_msg(irq, &msg);
> +		if (irq == -EINVAL) {
> +			msix_map = pci_msix_alloc_irq_at(pdev, vector, NULL);

It looks to me like we need to support MSI-X with both NORESIZE
behavior and dynamic allocation based on pci_msix_can_alloc_dyn().
It's not entirely clear to me where this is and isn't supported, but
the existence of the test helper suggests we can't assume support.


> +			if (msix_map.index < 0) {
> +				vfio_pci_memory_unlock_and_restore(vdev, cmd);
> +				ret = msix_map.index;
> +				goto out_put_eventfd_ctx;
> +			}
> +			irq = msix_map.virq;
> +		} else {
> +			/*
> +			 * The MSIx vector table resides in device memory which
> +			 * may be cleared via backdoor resets. We don't allow
> +			 * direct access to the vector table so even if a
> +			 * userspace driver attempts to save/restore around
> +			 * such a reset it would be unsuccessful. To avoid
> +			 * this, restore the cached value of the message prior
> +			 * to enabling.
> +			 */
> +			struct msi_msg msg;
> +
> +			get_cached_msi_msg(irq, &msg);
> +			pci_write_msi_msg(irq, &msg);
> +		}

I don't follow when this latter branch is ever taken in the new flow.
It's stated earlier that ctx and irq are coupled, and I believe so is
trigger.  So if we had a previous ctx and irq (and trigger), we removed
it and irq is now always -EINVAL here.  Thanks,

Alex

>  	}
>  
>  	ret = request_irq(irq, vfio_msihandler, 0, ctx->name, trigger);
> -	vfio_pci_memory_unlock_and_restore(vdev, cmd);
>  	if (ret)
> -		goto out_put_eventfd_ctx;
> +		goto out_free_irq_locked;
> +
> +	vfio_pci_memory_unlock_and_restore(vdev, cmd);
>  
>  	ctx->producer.token = trigger;
>  	ctx->producer.irq = irq;
> @@ -498,11 +553,21 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>  
>  	return 0;
>  
> +out_free_irq_locked:
> +	if (msix && new_ctx) {
> +		msix_map.index = vector;
> +		msix_map.virq = irq;
> +		pci_msix_free_irq(pdev, msix_map);
> +	}
> +	vfio_pci_memory_unlock_and_restore(vdev, cmd);
>  out_put_eventfd_ctx:
>  	eventfd_ctx_put(trigger);
>  out_free_name:
>  	kfree(ctx->name);
>  	ctx->name = NULL;
> +out_free_ctx:
> +	if (msix && new_ctx)
> +		vfio_irq_ctx_free(vdev, vector);
>  	return ret;
>  }
>  
> @@ -512,7 +577,7 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
>  	int i, ret = 0;
>  	unsigned int j;
>  
> -	if (!vfio_irq_ctx_range_allocated(vdev, start, count))
> +	if (!msix && !vfio_irq_ctx_range_allocated(vdev, start, count))
>  		return -EINVAL;
>  
>  	for (i = 0, j = start; i < count && !ret; i++, j++) {


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-03-17 21:05   ` Alex Williamson
@ 2023-03-17 22:21     ` Reinette Chatre
  2023-03-17 23:01       ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Reinette Chatre @ 2023-03-17 22:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

Hi Alex,

On 3/17/2023 2:05 PM, Alex Williamson wrote:
> On Wed, 15 Mar 2023 13:59:28 -0700
> Reinette Chatre <reinette.chatre@intel.com> wrote:
> 
>> Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
>> to provide guidance to user space.
>>
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 2 +-
>>  include/uapi/linux/vfio.h        | 3 +++
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index ae0e161c7fc9..1d071ee212a7 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1111,7 +1111,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
>>  	if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
>>  		info.flags |=
>>  			(VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
>> -	else
>> +	else if (info.index != VFIO_PCI_MSIX_IRQ_INDEX)
>>  		info.flags |= VFIO_IRQ_INFO_NORESIZE;
>>  
> 
> I think we need to check pci_msix_can_alloc_dyn(), right?  Thanks,

Can pci_msix_can_alloc_dyn() ever return false?

I cannot see how pci_msix_can_alloc_dyn() can return false when
considering the VFIO PCI MSI-X flow:

vfio_msi_enable(..., ..., msix == true) 
  pci_alloc_irq_vectors(..., ..., ..., flag == PCI_IRQ_MSIX) 
    pci_alloc_irq_vectors_affinity() 
      __pci_enable_msix_range() 
        pci_setup_msix_device_domain() 
          pci_create_device_domain(..., &pci_msix_template, ...)

The template used above, pci_msix_template, has MSI_FLAG_PCI_MSIX_ALLOC_DYN
hardcoded. This is the flag that pci_msix_can_alloc_dyn() tests for.

If indeed pci_msix_can_alloc_dyn() can return false in the VFIO PCI MSI-X
usage then this series needs to be reworked to continue supporting
existing allocation mechanism as well as dynamic MSI-X allocation. Which
allocation mechanism to use would be depend on pci_msix_can_alloc_dyn().

Alternatively, if you agree that VFIO PCI MSI-X can expect
pci_msix_can_alloc_dyn() to always return true then I should perhaps
add a test in vfio_msi_enable() that confirms this is the case. This would
cause vfio_msi_enable() to fail if dynamic MSI-X is not possible, perhaps even
complain loudly with a WARN.

Reinette

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 7/8] vfio/pci: Support dynamic MSI-x
  2023-03-17 21:58   ` Alex Williamson
@ 2023-03-17 22:54     ` Reinette Chatre
  0 siblings, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-17 22:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

Hi Alex,

On 3/17/2023 2:58 PM, Alex Williamson wrote:
> On Wed, 15 Mar 2023 13:59:27 -0700
> Reinette Chatre <reinette.chatre@intel.com> wrote:
> 

...
 
>> +static void vfio_irq_ctx_free(struct vfio_pci_core_device *vdev,
>> +			      unsigned long index)
>> +{
>> +	struct vfio_pci_irq_ctx *ctx;
>> +
>> +	ctx = xa_load(&vdev->ctx, index);
>> +	if (ctx) {
>> +		xa_erase(&vdev->ctx, index);
>> +		kfree(ctx);
>> +	}
>> +}
> 
> The only places calling this have a known valid ctx, so it seems
> redundant that we xa_load it again.  Should ctx be a function arg to
> reduce this to simply xa_erase() + kfree()?

Good point. Will do.

...

>> +	if (!ctx) {
>> +		ret = vfio_irq_ctx_alloc_single(vdev, vector);
>> +		if (ret)
>> +			return ret;
>> +		ctx = vfio_irq_ctx_get(vdev, vector);
> 
> This suggests vfio_irq_ctx_alloc_single() should return ctx.
> 

Thank you. Yes, will do.

>> @@ -464,25 +506,38 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>>  		goto out_free_name;
>>  	}
>>  
>> -	/*
>> -	 * The MSIx vector table resides in device memory which may be cleared
>> -	 * via backdoor resets. We don't allow direct access to the vector
>> -	 * table so even if a userspace driver attempts to save/restore around
>> -	 * such a reset it would be unsuccessful. To avoid this, restore the
>> -	 * cached value of the message prior to enabling.
>> -	 */
>>  	cmd = vfio_pci_memory_lock_and_enable(vdev);
>>  	if (msix) {
>> -		struct msi_msg msg;
>> -
>> -		get_cached_msi_msg(irq, &msg);
>> -		pci_write_msi_msg(irq, &msg);
>> +		if (irq == -EINVAL) {
>> +			msix_map = pci_msix_alloc_irq_at(pdev, vector, NULL);
> 
> It looks to me like we need to support MSI-X with both NORESIZE
> behavior and dynamic allocation based on pci_msix_can_alloc_dyn().
> It's not entirely clear to me where this is and isn't supported, but
> the existence of the test helper suggests we can't assume support.

As I mentioned in my other response ([1]) I cannot see how pci_msix_can_alloc_dyn()
can return false. Even so, yes, I can rework this series to support both the
original and dynamic MSI-x allocation mechanisms.

>> +			if (msix_map.index < 0) {
>> +				vfio_pci_memory_unlock_and_restore(vdev, cmd);
>> +				ret = msix_map.index;
>> +				goto out_put_eventfd_ctx;
>> +			}
>> +			irq = msix_map.virq;
>> +		} else {
>> +			/*
>> +			 * The MSIx vector table resides in device memory which
>> +			 * may be cleared via backdoor resets. We don't allow
>> +			 * direct access to the vector table so even if a
>> +			 * userspace driver attempts to save/restore around
>> +			 * such a reset it would be unsuccessful. To avoid
>> +			 * this, restore the cached value of the message prior
>> +			 * to enabling.
>> +			 */
>> +			struct msi_msg msg;
>> +
>> +			get_cached_msi_msg(irq, &msg);
>> +			pci_write_msi_msg(irq, &msg);
>> +		}
> 
> I don't follow when this latter branch is ever taken in the new flow.
> It's stated earlier that ctx and irq are coupled, and I believe so is
> trigger.  So if we had a previous ctx and irq (and trigger), we removed
> it and irq is now always -EINVAL here.  Thanks,

From what I understand MSI-X can be enabled without providing any triggers.
That will result in the ctx and irq existing, but not trigger. When a trigger
is assigned later, it will run the latter branch.

Thank you very much

Reinette


[1] https://lore.kernel.org/lkml/61296e93-6268-05cd-e876-680e07645a16@intel.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-03-17 22:21     ` Reinette Chatre
@ 2023-03-17 23:01       ` Alex Williamson
  2023-03-20 15:49         ` Reinette Chatre
  2023-03-20 16:12         ` Jason Gunthorpe
  0 siblings, 2 replies; 20+ messages in thread
From: Alex Williamson @ 2023-03-17 23:01 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

On Fri, 17 Mar 2023 15:21:09 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Hi Alex,
> 
> On 3/17/2023 2:05 PM, Alex Williamson wrote:
> > On Wed, 15 Mar 2023 13:59:28 -0700
> > Reinette Chatre <reinette.chatre@intel.com> wrote:
> >   
> >> Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
> >> to provide guidance to user space.
> >>
> >> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> >> ---
> >>  drivers/vfio/pci/vfio_pci_core.c | 2 +-
> >>  include/uapi/linux/vfio.h        | 3 +++
> >>  2 files changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index ae0e161c7fc9..1d071ee212a7 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -1111,7 +1111,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
> >>  	if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> >>  		info.flags |=
> >>  			(VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
> >> -	else
> >> +	else if (info.index != VFIO_PCI_MSIX_IRQ_INDEX)
> >>  		info.flags |= VFIO_IRQ_INFO_NORESIZE;
> >>    
> > 
> > I think we need to check pci_msix_can_alloc_dyn(), right?  Thanks,  
> 
> Can pci_msix_can_alloc_dyn() ever return false?
> 
> I cannot see how pci_msix_can_alloc_dyn() can return false when
> considering the VFIO PCI MSI-X flow:
> 
> vfio_msi_enable(..., ..., msix == true) 
>   pci_alloc_irq_vectors(..., ..., ..., flag == PCI_IRQ_MSIX) 
>     pci_alloc_irq_vectors_affinity() 
>       __pci_enable_msix_range() 
>         pci_setup_msix_device_domain() 
>           pci_create_device_domain(..., &pci_msix_template, ...)
> 
> The template used above, pci_msix_template, has MSI_FLAG_PCI_MSIX_ALLOC_DYN
> hardcoded. This is the flag that pci_msix_can_alloc_dyn() tests for.
> 
> If indeed pci_msix_can_alloc_dyn() can return false in the VFIO PCI MSI-X
> usage then this series needs to be reworked to continue supporting
> existing allocation mechanism as well as dynamic MSI-X allocation. Which
> allocation mechanism to use would be depend on pci_msix_can_alloc_dyn().
> 
> Alternatively, if you agree that VFIO PCI MSI-X can expect
> pci_msix_can_alloc_dyn() to always return true then I should perhaps
> add a test in vfio_msi_enable() that confirms this is the case. This would
> cause vfio_msi_enable() to fail if dynamic MSI-X is not possible, perhaps even
> complain loudly with a WARN.

pci_setup_msix_device_domain() says it returns true if:

 *  True when:
 *      - The device does not have a MSI parent irq domain associated,
 *        which keeps the legacy architecture specific and the global
 *        PCI/MSI domain models working
 *      - The MSI-X domain exists already
 *      - The MSI-X domain was successfully allocated

That first one seems concerning, dynamic allocation only works on irq
domain configurations.  What does that exclude and do we care about any
of them for vfio-pci?  Minimally this suggests a dependency on
IRQ_DOMAIN, which we don't currently have, but I'm not sure if
supporting irq domains is the same as having irq domains.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-03-17 23:01       ` Alex Williamson
@ 2023-03-20 15:49         ` Reinette Chatre
  2023-03-20 16:12         ` Jason Gunthorpe
  1 sibling, 0 replies; 20+ messages in thread
From: Reinette Chatre @ 2023-03-20 15:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, tglx, darwi,
	kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu, tom.zanussi,
	linux-kernel

Hi Alex,

On 3/17/2023 4:01 PM, Alex Williamson wrote:
> On Fri, 17 Mar 2023 15:21:09 -0700
> Reinette Chatre <reinette.chatre@intel.com> wrote:
> 
>> Hi Alex,
>>
>> On 3/17/2023 2:05 PM, Alex Williamson wrote:
>>> On Wed, 15 Mar 2023 13:59:28 -0700
>>> Reinette Chatre <reinette.chatre@intel.com> wrote:
>>>   
>>>> Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
>>>> to provide guidance to user space.
>>>>
>>>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci_core.c | 2 +-
>>>>  include/uapi/linux/vfio.h        | 3 +++
>>>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index ae0e161c7fc9..1d071ee212a7 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -1111,7 +1111,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
>>>>  	if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
>>>>  		info.flags |=
>>>>  			(VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
>>>> -	else
>>>> +	else if (info.index != VFIO_PCI_MSIX_IRQ_INDEX)
>>>>  		info.flags |= VFIO_IRQ_INFO_NORESIZE;
>>>>    
>>>
>>> I think we need to check pci_msix_can_alloc_dyn(), right?  Thanks,  
>>
>> Can pci_msix_can_alloc_dyn() ever return false?
>>
>> I cannot see how pci_msix_can_alloc_dyn() can return false when
>> considering the VFIO PCI MSI-X flow:
>>
>> vfio_msi_enable(..., ..., msix == true) 
>>   pci_alloc_irq_vectors(..., ..., ..., flag == PCI_IRQ_MSIX) 
>>     pci_alloc_irq_vectors_affinity() 
>>       __pci_enable_msix_range() 
>>         pci_setup_msix_device_domain() 
>>           pci_create_device_domain(..., &pci_msix_template, ...)
>>
>> The template used above, pci_msix_template, has MSI_FLAG_PCI_MSIX_ALLOC_DYN
>> hardcoded. This is the flag that pci_msix_can_alloc_dyn() tests for.
>>
>> If indeed pci_msix_can_alloc_dyn() can return false in the VFIO PCI MSI-X
>> usage then this series needs to be reworked to continue supporting
>> existing allocation mechanism as well as dynamic MSI-X allocation. Which
>> allocation mechanism to use would be depend on pci_msix_can_alloc_dyn().
>>
>> Alternatively, if you agree that VFIO PCI MSI-X can expect
>> pci_msix_can_alloc_dyn() to always return true then I should perhaps
>> add a test in vfio_msi_enable() that confirms this is the case. This would
>> cause vfio_msi_enable() to fail if dynamic MSI-X is not possible, perhaps even
>> complain loudly with a WARN.
> 
> pci_setup_msix_device_domain() says it returns true if:
> 
>  *  True when:
>  *      - The device does not have a MSI parent irq domain associated,
>  *        which keeps the legacy architecture specific and the global
>  *        PCI/MSI domain models working
>  *      - The MSI-X domain exists already
>  *      - The MSI-X domain was successfully allocated
> 
> That first one seems concerning, dynamic allocation only works on irq
> domain configurations.  What does that exclude and do we care about any
> of them for vfio-pci?  Minimally this suggests a dependency on
> IRQ_DOMAIN, which we don't currently have, but I'm not sure if
> supporting irq domains is the same as having irq domains.  Thanks,

Just to confirm, as I mentioned in [1] I do plan to rework this solution
to support both allocation mechanisms, using pci_msix_can_alloc_dyn()
to pick which one to use. Thank you very much for pointing out this
gap to me.

Reinette

[1] https://lore.kernel.org/lkml/e2d3f5a6-0a36-f19d-8f19-748197c3308d@intel.com/



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-03-17 23:01       ` Alex Williamson
  2023-03-20 15:49         ` Reinette Chatre
@ 2023-03-20 16:12         ` Jason Gunthorpe
  1 sibling, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2023-03-20 16:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Reinette Chatre, yishaih, shameerali.kolothum.thodi, kevin.tian,
	tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, linux-kernel

On Fri, Mar 17, 2023 at 05:01:49PM -0600, Alex Williamson wrote:

> pci_setup_msix_device_domain() says it returns true if:
> 
>  *  True when:
>  *      - The device does not have a MSI parent irq domain associated,
>  *        which keeps the legacy architecture specific and the global
>  *        PCI/MSI domain models working
>  *      - The MSI-X domain exists already
>  *      - The MSI-X domain was successfully allocated
> 
> That first one seems concerning, dynamic allocation only works on irq
> domain configurations.  What does that exclude and do we care about any
> of them for vfio-pci?  

Several archs and other weird things override the MSI-X programming. Eg
by turning it into a hypervisor call or something. These were not
converted to dynamic mode.

So at the VFIO level you can get end up with MSI support but done
through legacy paths that don't support dynamic allocation. Eg on
POWER, xen, etc.

Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-03-20 16:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-15 20:59 [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 1/8] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 2/8] vfio/pci: Remove negative check on unsigned vector Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 3/8] vfio/pci: Prepare for dynamic interrupt context storage Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 4/8] vfio/pci: Use xarray for " Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 5/8] vfio/pci: Remove interrupt context counter Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 6/8] vfio/pci: Move to single error path Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 7/8] vfio/pci: Support dynamic MSI-x Reinette Chatre
2023-03-17 21:58   ` Alex Williamson
2023-03-17 22:54     ` Reinette Chatre
2023-03-15 20:59 ` [RFC PATCH 8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
2023-03-17 21:05   ` Alex Williamson
2023-03-17 22:21     ` Reinette Chatre
2023-03-17 23:01       ` Alex Williamson
2023-03-20 15:49         ` Reinette Chatre
2023-03-20 16:12         ` Jason Gunthorpe
2023-03-16 21:56 ` [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Alex Williamson
2023-03-16 23:38   ` Reinette Chatre
2023-03-16 23:52     ` Tian, Kevin
2023-03-17 16:48       ` Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).