linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts
@ 2023-04-27 17:35 Reinette Chatre
  2023-04-27 17:35 ` [PATCH V4 01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
                   ` (10 more replies)
  0 siblings, 11 replies; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:35 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Changes since V3:
- V3: https://lore.kernel.org/lkml/cover.1681837892.git.reinette.chatre@intel.com/
- Be considerate about layout and size with changes to
  struct vfio_pci_core_device. Keep flags together and transition all to
  use bitfields. (Alex and Jason)
- Do not free dynamically allocated interrupts on error path. (Alex)
- Please refer to individual patches for localized changes.

Changes since V2:
- V2: https://lore.kernel.org/lkml/cover.1680038771.git.reinette.chatre@intel.com/
- During testing of V2 "kernel test robot" reported issues resulting from
  include/linux/pci.h missing a stub for pci_msix_can_alloc_dyn() when
  CONFIG_PCI_MSI=n. A separate fix was sent to address this. The fix can
  be found in the kernel (since v6.3-rc7) as
  commit 195d8e5da3ac ("PCI/MSI: Provide missing stub for pci_msix_can_alloc_dyn()")
- Biggest change is the transition to "active contexts" for both MSI and MSI-X.
  Interrupt contexts have always been allocated when the interrupts are
  allocated while they are only used while interrupts are
  enabled. In this series interrupt contexts are made dynamic, while doing
  so their allocation is moved to match how they are used: allocated when
  interrupts are enabled. Whether a Linux interrupt number exists determines
  whether an interrupt can be enabled.
  Previous policy (up to V2) that an allocated interrupt has an interrupt
  context no longer applies. Instead, an interrupt context has a
  handler/trigger, aka "active contexts". (Alex)
- Re-ordered patches in support of "active contexts".
- Only free interrupts on MSI-X teardown and otherwise use the
  allocated interrupts as a cache. (Alex)
- Using unsigned int for the vector broke the unwind loop within
  vfio_msi_set_block(). (Alex)
- Introduce new "has_dyn_msix" property of virtual device instead of
  querying support every time. (Alex)
- Some smaller changes, please refer to individual patches.

Changes since RFC V1:
- RFC V1: https://lore.kernel.org/lkml/cover.1678911529.git.reinette.chatre@intel.com/
- Improved changelogs.
- Simplify interface so that vfio_irq_ctx_alloc_single() returns pointer to
  allocated context. (Alex)
- Remove vfio_irq_ctx_range_allocated() and associated attempts to maintain
  invalid error path behavior. (Alex and Kevin)
- Add pointer to interrupt context as function parameter to
  vfio_irq_ctx_free(). (Alex)
- Ensure variables are initialized. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)

Qemu allocates interrupts incrementally at the time the guest unmasks an
interrupt, for example each time a Linux guest runs request_irq().

Dynamic allocation of MSI-X interrupts was not possible until v6.2 [1].
This prompted Qemu to, when allocating a new interrupt, first release all
previously allocated interrupts (including disable of MSI-X) followed
by re-allocation of all interrupts that includes the new interrupt.
Please see [2] for a detailed discussion about this issue.

Releasing and re-allocating interrupts may be acceptable if all
interrupts are unmasked during device initialization. If unmasking of
interrupts occur during runtime this may result in lost interrupts.
For example, consider an accelerator device with multiple work queues,
each work queue having a dedicated interrupt. A work queue can be
enabled at any time with its associated interrupt unmasked while other
work queues are already active. Having all interrupts released and MSI-X
disabled to enable the new work queue will impact active work queues.

This series builds on the recent interrupt sub-system core changes
that added support for dynamic MSI-X allocation after initial MSI-X
enabling.

Add support for dynamic MSI-X allocation to vfio-pci. A flag
indicating lack of support for dynamic allocation already exist:
VFIO_IRQ_INFO_NORESIZE and has always been set for MSI and MSI-X. With
support for dynamic MSI-X the flag is cleared for MSI-X when supported,
enabling Qemu to modify its behavior.

Any feedback is appreciated

Reinette

[1] commit 34026364df8e ("PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X")
[2] https://lore.kernel.org/kvm/MWHPR11MB188603D0D809C1079F5817DC8C099@MWHPR11MB1886.namprd11.prod.outlook.com/#t

Reinette Chatre (11):
  vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable
  vfio/pci: Remove negative check on unsigned vector
  vfio/pci: Prepare for dynamic interrupt context storage
  vfio/pci: Move to single error path
  vfio/pci: Use xarray for interrupt context storage
  vfio/pci: Remove interrupt context counter
  vfio/pci: Update stale comment
  vfio/pci: Use bitfield for struct vfio_pci_core_device flags
  vfio/pci: Probe and store ability to support dynamic MSI-X
  vfio/pci: Support dynamic MSI-X
  vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

 drivers/vfio/pci/vfio_pci_core.c  |   8 +-
 drivers/vfio/pci/vfio_pci_intrs.c | 298 ++++++++++++++++++++----------
 include/linux/vfio_pci_core.h     |  26 +--
 include/uapi/linux/vfio.h         |   3 +
 4 files changed, 221 insertions(+), 114 deletions(-)


base-commit: 6a8f57ae2eb07ab39a6f0ccad60c760743051026
-- 
2.34.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V4 01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
@ 2023-04-27 17:35 ` Reinette Chatre
  2023-04-28  6:28   ` Tian, Kevin
  2023-04-27 17:35 ` [PATCH V4 02/11] vfio/pci: Remove negative check on unsigned vector Reinette Chatre
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:35 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

vfio_msi_disable() releases all previously allocated state
associated with each interrupt before disabling MSI/MSI-X.

vfio_msi_disable() iterates twice over the interrupt state:
first directly with a for loop to do virqfd cleanup, followed
by another for loop within vfio_msi_set_block() that removes
the interrupt handler and its associated state using
vfio_msi_set_vector_signal().

Simplify interrupt cleanup by iterating over allocated interrupts
once.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
No changes since V3.

Changes since V2:
- Improve accuracy of changelog.

 drivers/vfio/pci/vfio_pci_intrs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index bffb0741518b..6a9c6a143cc3 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -426,10 +426,9 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 	for (i = 0; i < vdev->num_ctx; i++) {
 		vfio_virqfd_disable(&vdev->ctx[i].unmask);
 		vfio_virqfd_disable(&vdev->ctx[i].mask);
+		vfio_msi_set_vector_signal(vdev, i, -1, msix);
 	}
 
-	vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix);
-
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
 	pci_free_irq_vectors(pdev);
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 02/11] vfio/pci: Remove negative check on unsigned vector
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
  2023-04-27 17:35 ` [PATCH V4 01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
@ 2023-04-27 17:35 ` Reinette Chatre
  2023-04-28  6:29   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage Reinette Chatre
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:35 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

User space provides the vector as an unsigned int that is checked
early for validity (vfio_set_irqs_validate_and_prepare()).

A later negative check of the provided vector is not necessary.

Remove the negative check and ensure the type used
for the vector is consistent as an unsigned int.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
No changes since V3.

Changes since V2:
- Rework unwind loop within vfio_msi_set_block() that required j
  to be an int. Rework results in both i and j used for vector, both
  moved to be unsigned int. (Alex)

 drivers/vfio/pci/vfio_pci_intrs.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 6a9c6a143cc3..258de57ef956 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -317,14 +317,14 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 }
 
 static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
-				      int vector, int fd, bool msix)
+				      unsigned int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct eventfd_ctx *trigger;
 	int irq, ret;
 	u16 cmd;
 
-	if (vector < 0 || vector >= vdev->num_ctx)
+	if (vector >= vdev->num_ctx)
 		return -EINVAL;
 
 	irq = pci_irq_vector(pdev, vector);
@@ -399,7 +399,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 			      unsigned count, int32_t *fds, bool msix)
 {
-	int i, j, ret = 0;
+	unsigned int i, j;
+	int ret = 0;
 
 	if (start >= vdev->num_ctx || start + count > vdev->num_ctx)
 		return -EINVAL;
@@ -410,8 +411,8 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 	}
 
 	if (ret) {
-		for (--j; j >= (int)start; j--)
-			vfio_msi_set_vector_signal(vdev, j, -1, msix);
+		for (i = start; i < j; i++)
+			vfio_msi_set_vector_signal(vdev, i, -1, msix);
 	}
 
 	return ret;
@@ -420,7 +421,7 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
-	int i;
+	unsigned int i;
 	u16 cmd;
 
 	for (i = 0; i < vdev->num_ctx; i++) {
@@ -542,7 +543,7 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
-	int i;
+	unsigned int i;
 	bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
 
 	if (irq_is(vdev, index) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
  2023-04-27 17:35 ` [PATCH V4 01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
  2023-04-27 17:35 ` [PATCH V4 02/11] vfio/pci: Remove negative check on unsigned vector Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:33   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 04/11] vfio/pci: Move to single error path Reinette Chatre
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Interrupt context storage is statically allocated at the time
interrupts are allocated. Following allocation, the interrupt
context is managed by directly accessing the elements of the
array using the vector as index.

It is possible to allocate additional MSI-X vectors after
MSI-X has been enabled. Dynamic storage of interrupt context
is needed to support adding new MSI-X vectors after initial
allocation.

Replace direct access of array elements with pointers to the
array elements. Doing so reduces impact of moving to a new data
structure. Move interactions with the array to helpers to
mostly contain changes needed to transition to a dynamic
data structure.

No functional change intended.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
No changes since V2.

Changes since RFC V1:
- Improve accuracy of changelog.

 drivers/vfio/pci/vfio_pci_intrs.c | 206 ++++++++++++++++++++----------
 1 file changed, 140 insertions(+), 66 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 258de57ef956..b664fbb6d2f2 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -48,6 +48,31 @@ static bool is_irq_none(struct vfio_pci_core_device *vdev)
 		 vdev->irq_type == VFIO_PCI_MSIX_IRQ_INDEX);
 }
 
+static
+struct vfio_pci_irq_ctx *vfio_irq_ctx_get(struct vfio_pci_core_device *vdev,
+					  unsigned long index)
+{
+	if (index >= vdev->num_ctx)
+		return NULL;
+	return &vdev->ctx[index];
+}
+
+static void vfio_irq_ctx_free_all(struct vfio_pci_core_device *vdev)
+{
+	kfree(vdev->ctx);
+}
+
+static int vfio_irq_ctx_alloc_num(struct vfio_pci_core_device *vdev,
+				  unsigned long num)
+{
+	vdev->ctx = kcalloc(num, sizeof(struct vfio_pci_irq_ctx),
+			    GFP_KERNEL_ACCOUNT);
+	if (!vdev->ctx)
+		return -ENOMEM;
+
+	return 0;
+}
+
 /*
  * INTx
  */
@@ -55,17 +80,28 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
 {
 	struct vfio_pci_core_device *vdev = opaque;
 
-	if (likely(is_intx(vdev) && !vdev->virq_disabled))
-		eventfd_signal(vdev->ctx[0].trigger, 1);
+	if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
+		struct vfio_pci_irq_ctx *ctx;
+
+		ctx = vfio_irq_ctx_get(vdev, 0);
+		if (!ctx)
+			return;
+		eventfd_signal(ctx->trigger, 1);
+	}
 }
 
 /* Returns true if the INTx vfio_pci_irq_ctx.masked value is changed. */
 bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned long flags;
 	bool masked_changed = false;
 
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return masked_changed;
+
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
 	/*
@@ -77,7 +113,7 @@ bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 	if (unlikely(!is_intx(vdev))) {
 		if (vdev->pci_2_3)
 			pci_intx(pdev, 0);
-	} else if (!vdev->ctx[0].masked) {
+	} else if (!ctx->masked) {
 		/*
 		 * Can't use check_and_mask here because we always want to
 		 * mask, not just when something is pending.
@@ -87,7 +123,7 @@ bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 		else
 			disable_irq_nosync(pdev->irq);
 
-		vdev->ctx[0].masked = true;
+		ctx->masked = true;
 		masked_changed = true;
 	}
 
@@ -105,9 +141,14 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 {
 	struct vfio_pci_core_device *vdev = opaque;
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned long flags;
 	int ret = 0;
 
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return ret;
+
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
 	/*
@@ -117,7 +158,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 	if (unlikely(!is_intx(vdev))) {
 		if (vdev->pci_2_3)
 			pci_intx(pdev, 1);
-	} else if (vdev->ctx[0].masked && !vdev->virq_disabled) {
+	} else if (ctx->masked && !vdev->virq_disabled) {
 		/*
 		 * A pending interrupt here would immediately trigger,
 		 * but we can avoid that overhead by just re-sending
@@ -129,7 +170,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 		} else
 			enable_irq(pdev->irq);
 
-		vdev->ctx[0].masked = (ret > 0);
+		ctx->masked = (ret > 0);
 	}
 
 	spin_unlock_irqrestore(&vdev->irqlock, flags);
@@ -146,18 +187,23 @@ void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev)
 static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 {
 	struct vfio_pci_core_device *vdev = dev_id;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned long flags;
 	int ret = IRQ_NONE;
 
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return ret;
+
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
 	if (!vdev->pci_2_3) {
 		disable_irq_nosync(vdev->pdev->irq);
-		vdev->ctx[0].masked = true;
+		ctx->masked = true;
 		ret = IRQ_HANDLED;
-	} else if (!vdev->ctx[0].masked &&  /* may be shared */
+	} else if (!ctx->masked &&  /* may be shared */
 		   pci_check_and_mask_intx(vdev->pdev)) {
-		vdev->ctx[0].masked = true;
+		ctx->masked = true;
 		ret = IRQ_HANDLED;
 	}
 
@@ -171,15 +217,24 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 
 static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 {
+	struct vfio_pci_irq_ctx *ctx;
+	int ret;
+
 	if (!is_irq_none(vdev))
 		return -EINVAL;
 
 	if (!vdev->pdev->irq)
 		return -ENODEV;
 
-	vdev->ctx = kzalloc(sizeof(struct vfio_pci_irq_ctx), GFP_KERNEL_ACCOUNT);
-	if (!vdev->ctx)
-		return -ENOMEM;
+	ret = vfio_irq_ctx_alloc_num(vdev, 1);
+	if (ret)
+		return ret;
+
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx) {
+		vfio_irq_ctx_free_all(vdev);
+		return -EINVAL;
+	}
 
 	vdev->num_ctx = 1;
 
@@ -189,9 +244,9 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 	 * here, non-PCI-2.3 devices will have to wait until the
 	 * interrupt is enabled.
 	 */
-	vdev->ctx[0].masked = vdev->virq_disabled;
+	ctx->masked = vdev->virq_disabled;
 	if (vdev->pci_2_3)
-		pci_intx(vdev->pdev, !vdev->ctx[0].masked);
+		pci_intx(vdev->pdev, !ctx->masked);
 
 	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
 
@@ -202,41 +257,46 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long irqflags = IRQF_SHARED;
+	struct vfio_pci_irq_ctx *ctx;
 	struct eventfd_ctx *trigger;
 	unsigned long flags;
 	int ret;
 
-	if (vdev->ctx[0].trigger) {
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (!ctx)
+		return -EINVAL;
+
+	if (ctx->trigger) {
 		free_irq(pdev->irq, vdev);
-		kfree(vdev->ctx[0].name);
-		eventfd_ctx_put(vdev->ctx[0].trigger);
-		vdev->ctx[0].trigger = NULL;
+		kfree(ctx->name);
+		eventfd_ctx_put(ctx->trigger);
+		ctx->trigger = NULL;
 	}
 
 	if (fd < 0) /* Disable only */
 		return 0;
 
-	vdev->ctx[0].name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)",
-				      pci_name(pdev));
-	if (!vdev->ctx[0].name)
+	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-intx(%s)",
+			      pci_name(pdev));
+	if (!ctx->name)
 		return -ENOMEM;
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[0].name);
+		kfree(ctx->name);
 		return PTR_ERR(trigger);
 	}
 
-	vdev->ctx[0].trigger = trigger;
+	ctx->trigger = trigger;
 
 	if (!vdev->pci_2_3)
 		irqflags = 0;
 
 	ret = request_irq(pdev->irq, vfio_intx_handler,
-			  irqflags, vdev->ctx[0].name, vdev);
+			  irqflags, ctx->name, vdev);
 	if (ret) {
-		vdev->ctx[0].trigger = NULL;
-		kfree(vdev->ctx[0].name);
+		ctx->trigger = NULL;
+		kfree(ctx->name);
 		eventfd_ctx_put(trigger);
 		return ret;
 	}
@@ -246,7 +306,7 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 	 * disable_irq won't.
 	 */
 	spin_lock_irqsave(&vdev->irqlock, flags);
-	if (!vdev->pci_2_3 && vdev->ctx[0].masked)
+	if (!vdev->pci_2_3 && ctx->masked)
 		disable_irq_nosync(pdev->irq);
 	spin_unlock_irqrestore(&vdev->irqlock, flags);
 
@@ -255,12 +315,17 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 
 static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 {
-	vfio_virqfd_disable(&vdev->ctx[0].unmask);
-	vfio_virqfd_disable(&vdev->ctx[0].mask);
+	struct vfio_pci_irq_ctx *ctx;
+
+	ctx = vfio_irq_ctx_get(vdev, 0);
+	if (ctx) {
+		vfio_virqfd_disable(&ctx->unmask);
+		vfio_virqfd_disable(&ctx->mask);
+	}
 	vfio_intx_set_signal(vdev, -1);
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	vdev->num_ctx = 0;
-	kfree(vdev->ctx);
+	vfio_irq_ctx_free_all(vdev);
 }
 
 /*
@@ -284,10 +349,9 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	if (!is_irq_none(vdev))
 		return -EINVAL;
 
-	vdev->ctx = kcalloc(nvec, sizeof(struct vfio_pci_irq_ctx),
-			    GFP_KERNEL_ACCOUNT);
-	if (!vdev->ctx)
-		return -ENOMEM;
+	ret = vfio_irq_ctx_alloc_num(vdev, nvec);
+	if (ret)
+		return ret;
 
 	/* return the number of supported vectors if we can't get all: */
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -296,7 +360,7 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 		if (ret > 0)
 			pci_free_irq_vectors(pdev);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
-		kfree(vdev->ctx);
+		vfio_irq_ctx_free_all(vdev);
 		return ret;
 	}
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
@@ -320,6 +384,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 				      unsigned int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	struct eventfd_ctx *trigger;
 	int irq, ret;
 	u16 cmd;
@@ -327,33 +392,33 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	if (vector >= vdev->num_ctx)
 		return -EINVAL;
 
+	ctx = vfio_irq_ctx_get(vdev, vector);
+	if (!ctx)
+		return -EINVAL;
 	irq = pci_irq_vector(pdev, vector);
 
-	if (vdev->ctx[vector].trigger) {
-		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
+	if (ctx->trigger) {
+		irq_bypass_unregister_producer(&ctx->producer);
 
 		cmd = vfio_pci_memory_lock_and_enable(vdev);
-		free_irq(irq, vdev->ctx[vector].trigger);
+		free_irq(irq, ctx->trigger);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
-
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(vdev->ctx[vector].trigger);
-		vdev->ctx[vector].trigger = NULL;
+		kfree(ctx->name);
+		eventfd_ctx_put(ctx->trigger);
+		ctx->trigger = NULL;
 	}
 
 	if (fd < 0)
 		return 0;
 
-	vdev->ctx[vector].name = kasprintf(GFP_KERNEL_ACCOUNT,
-					   "vfio-msi%s[%d](%s)",
-					   msix ? "x" : "", vector,
-					   pci_name(pdev));
-	if (!vdev->ctx[vector].name)
+	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
+			      msix ? "x" : "", vector, pci_name(pdev));
+	if (!ctx->name)
 		return -ENOMEM;
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
-		kfree(vdev->ctx[vector].name);
+		kfree(ctx->name);
 		return PTR_ERR(trigger);
 	}
 
@@ -372,26 +437,25 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 		pci_write_msi_msg(irq, &msg);
 	}
 
-	ret = request_irq(irq, vfio_msihandler, 0,
-			  vdev->ctx[vector].name, trigger);
+	ret = request_irq(irq, vfio_msihandler, 0, ctx->name, trigger);
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
 	if (ret) {
-		kfree(vdev->ctx[vector].name);
+		kfree(ctx->name);
 		eventfd_ctx_put(trigger);
 		return ret;
 	}
 
-	vdev->ctx[vector].producer.token = trigger;
-	vdev->ctx[vector].producer.irq = irq;
-	ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
+	ctx->producer.token = trigger;
+	ctx->producer.irq = irq;
+	ret = irq_bypass_register_producer(&ctx->producer);
 	if (unlikely(ret)) {
 		dev_info(&pdev->dev,
 		"irq bypass producer (token %p) registration fails: %d\n",
-		vdev->ctx[vector].producer.token, ret);
+		ctx->producer.token, ret);
 
-		vdev->ctx[vector].producer.token = NULL;
+		ctx->producer.token = NULL;
 	}
-	vdev->ctx[vector].trigger = trigger;
+	ctx->trigger = trigger;
 
 	return 0;
 }
@@ -421,13 +485,17 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned int i;
 	u16 cmd;
 
 	for (i = 0; i < vdev->num_ctx; i++) {
-		vfio_virqfd_disable(&vdev->ctx[i].unmask);
-		vfio_virqfd_disable(&vdev->ctx[i].mask);
-		vfio_msi_set_vector_signal(vdev, i, -1, msix);
+		ctx = vfio_irq_ctx_get(vdev, i);
+		if (ctx) {
+			vfio_virqfd_disable(&ctx->unmask);
+			vfio_virqfd_disable(&ctx->mask);
+			vfio_msi_set_vector_signal(vdev, i, -1, msix);
+		}
 	}
 
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -443,7 +511,7 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	vdev->num_ctx = 0;
-	kfree(vdev->ctx);
+	vfio_irq_ctx_free_all(vdev);
 }
 
 /*
@@ -463,14 +531,18 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev,
 		if (unmask)
 			vfio_pci_intx_unmask(vdev);
 	} else if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+		struct vfio_pci_irq_ctx *ctx = vfio_irq_ctx_get(vdev, 0);
 		int32_t fd = *(int32_t *)data;
+
+		if (!ctx)
+			return -EINVAL;
 		if (fd >= 0)
 			return vfio_virqfd_enable((void *) vdev,
 						  vfio_pci_intx_unmask_handler,
 						  vfio_send_intx_eventfd, NULL,
-						  &vdev->ctx[0].unmask, fd);
+						  &ctx->unmask, fd);
 
-		vfio_virqfd_disable(&vdev->ctx[0].unmask);
+		vfio_virqfd_disable(&ctx->unmask);
 	}
 
 	return 0;
@@ -543,6 +615,7 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 				    unsigned index, unsigned start,
 				    unsigned count, uint32_t flags, void *data)
 {
+	struct vfio_pci_irq_ctx *ctx;
 	unsigned int i;
 	bool msix = (index == VFIO_PCI_MSIX_IRQ_INDEX) ? true : false;
 
@@ -577,14 +650,15 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 		return -EINVAL;
 
 	for (i = start; i < start + count; i++) {
-		if (!vdev->ctx[i].trigger)
+		ctx = vfio_irq_ctx_get(vdev, i);
+		if (!ctx || !ctx->trigger)
 			continue;
 		if (flags & VFIO_IRQ_SET_DATA_NONE) {
-			eventfd_signal(vdev->ctx[i].trigger, 1);
+			eventfd_signal(ctx->trigger, 1);
 		} else if (flags & VFIO_IRQ_SET_DATA_BOOL) {
 			uint8_t *bools = data;
 			if (bools[i - start])
-				eventfd_signal(vdev->ctx[i].trigger, 1);
+				eventfd_signal(ctx->trigger, 1);
 		}
 	}
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 04/11] vfio/pci: Move to single error path
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (2 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:34   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 05/11] vfio/pci: Use xarray for interrupt context storage Reinette Chatre
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Enabling and disabling of an interrupt involves several steps
that can fail. Cleanup after failure is done when the error
is encountered, resulting in some repetitive code.

Support for dynamic contexts will introduce more steps during
interrupt enabling and disabling.

Transition to centralized exit path in preparation for dynamic
contexts to eliminate duplicate error handling code.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
No changes since V3.

Changes since V2:
- Move patch to earlier in series in support of the change to
  dynamic context management.
- Do not add the "ctx->name = NULL" in error path. It is not done
  in baseline and will not be needed when transitioning to
  dynamic context management.
- Update changelog to not make this change specific to dynamic
  MSI-X.

Changes since RFC V1:
- Improve changelog.

 drivers/vfio/pci/vfio_pci_intrs.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index b664fbb6d2f2..9e17e59a4d60 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -418,8 +418,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
-		kfree(ctx->name);
-		return PTR_ERR(trigger);
+		ret = PTR_ERR(trigger);
+		goto out_free_name;
 	}
 
 	/*
@@ -439,11 +439,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 
 	ret = request_irq(irq, vfio_msihandler, 0, ctx->name, trigger);
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
-	if (ret) {
-		kfree(ctx->name);
-		eventfd_ctx_put(trigger);
-		return ret;
-	}
+	if (ret)
+		goto out_put_eventfd_ctx;
 
 	ctx->producer.token = trigger;
 	ctx->producer.irq = irq;
@@ -458,6 +455,12 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	ctx->trigger = trigger;
 
 	return 0;
+
+out_put_eventfd_ctx:
+	eventfd_ctx_put(trigger);
+out_free_name:
+	kfree(ctx->name);
+	return ret;
 }
 
 static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 05/11] vfio/pci: Use xarray for interrupt context storage
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (3 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 04/11] vfio/pci: Move to single error path Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:35   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 06/11] vfio/pci: Remove interrupt context counter Reinette Chatre
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Interrupt context is statically allocated at the time interrupts
are allocated. Following allocation, the context is managed by
directly accessing the elements of the array using the vector
as index. The storage is released when interrupts are disabled.

It is possible to dynamically allocate a single MSI-X interrupt
after MSI-X is enabled. A dynamic storage for interrupt context
is needed to support this. Replace the interrupt context array with an
xarray (similar to what the core uses as store for MSI descriptors)
that can support the dynamic expansion while maintaining the
custom that uses the vector as index.

With a dynamic storage it is no longer required to pre-allocate
interrupt contexts at the time the interrupts are allocated.
MSI and MSI-X interrupt contexts are only used when interrupts are
enabled. Their allocation can thus be delayed until interrupt enabling.
Only enabled interrupts will have associated interrupt contexts.
Whether an interrupt has been allocated (a Linux irq number exists
for it) becomes the criteria for whether an interrupt can be enabled.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/lkml/20230404122444.59e36a99.alex.williamson@redhat.com/
---
No changes since V3.

Changes since V2:
- Only allocate contexts as they are used, or "active". (Alex)
- Move vfio_irq_ctx_free() from later patch to prevent open-coding
  the same within vfio_irq_ctx_free_all(). This evolved into
  vfio_irq_ctx_free() used for dynamic context allocation and
  vfio_irq_ctx_free_all() removed because of it. (Alex)
- With vfio_irq_ctx_alloc_num() removed, rename vfio_irq_ctx_alloc_single()
  to vfio_irq_ctx_alloc().

Changes since RFC V1:
- Let vfio_irq_ctx_alloc_single() return pointer to allocated
  context. (Alex)
- Transition INTx allocation to simplified vfio_irq_ctx_alloc_single().
- Improve accuracy of changelog.

 drivers/vfio/pci/vfio_pci_core.c  |  1 +
 drivers/vfio/pci/vfio_pci_intrs.c | 91 ++++++++++++++++---------------
 include/linux/vfio_pci_core.h     |  2 +-
 3 files changed, 48 insertions(+), 46 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..ae0e161c7fc9 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2102,6 +2102,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	INIT_LIST_HEAD(&vdev->vma_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
 	init_rwsem(&vdev->memory_lock);
+	xa_init(&vdev->ctx);
 
 	return 0;
 }
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9e17e59a4d60..117cd384b3ad 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -52,25 +52,33 @@ static
 struct vfio_pci_irq_ctx *vfio_irq_ctx_get(struct vfio_pci_core_device *vdev,
 					  unsigned long index)
 {
-	if (index >= vdev->num_ctx)
-		return NULL;
-	return &vdev->ctx[index];
+	return xa_load(&vdev->ctx, index);
 }
 
-static void vfio_irq_ctx_free_all(struct vfio_pci_core_device *vdev)
+static void vfio_irq_ctx_free(struct vfio_pci_core_device *vdev,
+			      struct vfio_pci_irq_ctx *ctx, unsigned long index)
 {
-	kfree(vdev->ctx);
+	xa_erase(&vdev->ctx, index);
+	kfree(ctx);
 }
 
-static int vfio_irq_ctx_alloc_num(struct vfio_pci_core_device *vdev,
-				  unsigned long num)
+static struct vfio_pci_irq_ctx *
+vfio_irq_ctx_alloc(struct vfio_pci_core_device *vdev, unsigned long index)
 {
-	vdev->ctx = kcalloc(num, sizeof(struct vfio_pci_irq_ctx),
-			    GFP_KERNEL_ACCOUNT);
-	if (!vdev->ctx)
-		return -ENOMEM;
+	struct vfio_pci_irq_ctx *ctx;
+	int ret;
 
-	return 0;
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+	if (!ctx)
+		return NULL;
+
+	ret = xa_insert(&vdev->ctx, index, ctx, GFP_KERNEL_ACCOUNT);
+	if (ret) {
+		kfree(ctx);
+		return NULL;
+	}
+
+	return ctx;
 }
 
 /*
@@ -218,7 +226,6 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 {
 	struct vfio_pci_irq_ctx *ctx;
-	int ret;
 
 	if (!is_irq_none(vdev))
 		return -EINVAL;
@@ -226,15 +233,9 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 	if (!vdev->pdev->irq)
 		return -ENODEV;
 
-	ret = vfio_irq_ctx_alloc_num(vdev, 1);
-	if (ret)
-		return ret;
-
-	ctx = vfio_irq_ctx_get(vdev, 0);
-	if (!ctx) {
-		vfio_irq_ctx_free_all(vdev);
-		return -EINVAL;
-	}
+	ctx = vfio_irq_ctx_alloc(vdev, 0);
+	if (!ctx)
+		return -ENOMEM;
 
 	vdev->num_ctx = 1;
 
@@ -325,7 +326,7 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 	vfio_intx_set_signal(vdev, -1);
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	vdev->num_ctx = 0;
-	vfio_irq_ctx_free_all(vdev);
+	vfio_irq_ctx_free(vdev, ctx, 0);
 }
 
 /*
@@ -349,10 +350,6 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	if (!is_irq_none(vdev))
 		return -EINVAL;
 
-	ret = vfio_irq_ctx_alloc_num(vdev, nvec);
-	if (ret)
-		return ret;
-
 	/* return the number of supported vectors if we can't get all: */
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
 	ret = pci_alloc_irq_vectors(pdev, 1, nvec, flag);
@@ -360,7 +357,6 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 		if (ret > 0)
 			pci_free_irq_vectors(pdev);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
-		vfio_irq_ctx_free_all(vdev);
 		return ret;
 	}
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
@@ -392,12 +388,13 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	if (vector >= vdev->num_ctx)
 		return -EINVAL;
 
-	ctx = vfio_irq_ctx_get(vdev, vector);
-	if (!ctx)
-		return -EINVAL;
 	irq = pci_irq_vector(pdev, vector);
+	if (irq < 0)
+		return -EINVAL;
 
-	if (ctx->trigger) {
+	ctx = vfio_irq_ctx_get(vdev, vector);
+
+	if (ctx) {
 		irq_bypass_unregister_producer(&ctx->producer);
 
 		cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -405,16 +402,22 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
 		kfree(ctx->name);
 		eventfd_ctx_put(ctx->trigger);
-		ctx->trigger = NULL;
+		vfio_irq_ctx_free(vdev, ctx, vector);
 	}
 
 	if (fd < 0)
 		return 0;
 
+	ctx = vfio_irq_ctx_alloc(vdev, vector);
+	if (!ctx)
+		return -ENOMEM;
+
 	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
 			      msix ? "x" : "", vector, pci_name(pdev));
-	if (!ctx->name)
-		return -ENOMEM;
+	if (!ctx->name) {
+		ret = -ENOMEM;
+		goto out_free_ctx;
+	}
 
 	trigger = eventfd_ctx_fdget(fd);
 	if (IS_ERR(trigger)) {
@@ -460,6 +463,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	eventfd_ctx_put(trigger);
 out_free_name:
 	kfree(ctx->name);
+out_free_ctx:
+	vfio_irq_ctx_free(vdev, ctx, vector);
 	return ret;
 }
 
@@ -489,16 +494,13 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_irq_ctx *ctx;
-	unsigned int i;
+	unsigned long i;
 	u16 cmd;
 
-	for (i = 0; i < vdev->num_ctx; i++) {
-		ctx = vfio_irq_ctx_get(vdev, i);
-		if (ctx) {
-			vfio_virqfd_disable(&ctx->unmask);
-			vfio_virqfd_disable(&ctx->mask);
-			vfio_msi_set_vector_signal(vdev, i, -1, msix);
-		}
+	xa_for_each(&vdev->ctx, i, ctx) {
+		vfio_virqfd_disable(&ctx->unmask);
+		vfio_virqfd_disable(&ctx->mask);
+		vfio_msi_set_vector_signal(vdev, i, -1, msix);
 	}
 
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
@@ -514,7 +516,6 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
 	vdev->num_ctx = 0;
-	vfio_irq_ctx_free_all(vdev);
 }
 
 /*
@@ -654,7 +655,7 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 
 	for (i = start; i < start + count; i++) {
 		ctx = vfio_irq_ctx_get(vdev, i);
-		if (!ctx || !ctx->trigger)
+		if (!ctx)
 			continue;
 		if (flags & VFIO_IRQ_SET_DATA_NONE) {
 			eventfd_signal(ctx->trigger, 1);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 367fd79226a3..61d7873a3973 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -59,7 +59,7 @@ struct vfio_pci_core_device {
 	struct perm_bits	*msi_perm;
 	spinlock_t		irqlock;
 	struct mutex		igate;
-	struct vfio_pci_irq_ctx	*ctx;
+	struct xarray		ctx;
 	int			num_ctx;
 	int			irq_type;
 	int			num_regions;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 06/11] vfio/pci: Remove interrupt context counter
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (4 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 05/11] vfio/pci: Use xarray for interrupt context storage Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:36   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 07/11] vfio/pci: Update stale comment Reinette Chatre
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

struct vfio_pci_core_device::num_ctx counts how many interrupt
contexts have been allocated. When all interrupt contexts are
allocated simultaneously num_ctx provides the upper bound of all
vectors that can be used as indices into the interrupt context
array.

With the upcoming support for dynamic MSI-X the number of
interrupt contexts does not necessarily span the range of allocated
interrupts. Consequently, num_ctx is no longer a trusted upper bound
for valid indices.

Stop using num_ctx to determine if a provided vector is valid. Use
the existence of allocated interrupt.

This changes behavior on the error path when user space provides
an invalid vector range. Behavior changes from early exit without
any modifications to possible modifications to valid vectors within
the invalid range. This is acceptable considering that an invalid
range is not a valid scenario, see link to discussion.

The checks that ensure that user space provides a range of vectors
that is valid for the device are untouched.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/lkml/20230316155646.07ae266f.alex.williamson@redhat.com/
---
No changes since V3.

Changes since V2:
- Update changelog to reflect change in policy that existence of
  allocated interrupt is validity check, not existence of context
  (which is now dynamically allocated).

Changes since RFC V1:
- Remove vfio_irq_ctx_range_allocated(). (Alex and Kevin).

 drivers/vfio/pci/vfio_pci_intrs.c | 13 +------------
 include/linux/vfio_pci_core.h     |  1 -
 2 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 117cd384b3ad..5e3de004f4cb 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -237,8 +237,6 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev)
 	if (!ctx)
 		return -ENOMEM;
 
-	vdev->num_ctx = 1;
-
 	/*
 	 * If the virtual interrupt is masked, restore it.  Devices
 	 * supporting DisINTx can be masked at the hardware level
@@ -325,7 +323,6 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 	}
 	vfio_intx_set_signal(vdev, -1);
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
-	vdev->num_ctx = 0;
 	vfio_irq_ctx_free(vdev, ctx, 0);
 }
 
@@ -361,7 +358,6 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	}
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
 
-	vdev->num_ctx = nvec;
 	vdev->irq_type = msix ? VFIO_PCI_MSIX_IRQ_INDEX :
 				VFIO_PCI_MSI_IRQ_INDEX;
 
@@ -385,9 +381,6 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	int irq, ret;
 	u16 cmd;
 
-	if (vector >= vdev->num_ctx)
-		return -EINVAL;
-
 	irq = pci_irq_vector(pdev, vector);
 	if (irq < 0)
 		return -EINVAL;
@@ -474,9 +467,6 @@ static int vfio_msi_set_block(struct vfio_pci_core_device *vdev, unsigned start,
 	unsigned int i, j;
 	int ret = 0;
 
-	if (start >= vdev->num_ctx || start + count > vdev->num_ctx)
-		return -EINVAL;
-
 	for (i = 0, j = start; i < count && !ret; i++, j++) {
 		int fd = fds ? fds[i] : -1;
 		ret = vfio_msi_set_vector_signal(vdev, j, fd, msix);
@@ -515,7 +505,6 @@ static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 		pci_intx(pdev, 0);
 
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
-	vdev->num_ctx = 0;
 }
 
 /*
@@ -650,7 +639,7 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 		return ret;
 	}
 
-	if (!irq_is(vdev, index) || start + count > vdev->num_ctx)
+	if (!irq_is(vdev, index))
 		return -EINVAL;
 
 	for (i = start; i < start + count; i++) {
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 61d7873a3973..148fd1ae6c1c 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -60,7 +60,6 @@ struct vfio_pci_core_device {
 	spinlock_t		irqlock;
 	struct mutex		igate;
 	struct xarray		ctx;
-	int			num_ctx;
 	int			irq_type;
 	int			num_regions;
 	struct vfio_pci_region	*region;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 07/11] vfio/pci: Update stale comment
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (5 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 06/11] vfio/pci: Remove interrupt context counter Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:42   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 08/11] vfio/pci: Use bitfield for struct vfio_pci_core_device flags Reinette Chatre
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

In preparation for surrounding code change it is helpful to
ensure that existing comments are accurate.

Remove inaccurate comment about direct access and update
the rest of the comment to reflect the purpose of writing
the cached MSI message to the device.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/lkml/20230330164050.0069e2a5.alex.williamson@redhat.com/
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
No changes since V3.

Changes since V2:
- New patch.

 drivers/vfio/pci/vfio_pci_intrs.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 5e3de004f4cb..bdda7f46c2be 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -419,11 +419,9 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	}
 
 	/*
-	 * The MSIx vector table resides in device memory which may be cleared
-	 * via backdoor resets. We don't allow direct access to the vector
-	 * table so even if a userspace driver attempts to save/restore around
-	 * such a reset it would be unsuccessful. To avoid this, restore the
-	 * cached value of the message prior to enabling.
+	 * If the vector was previously allocated, refresh the on-device
+	 * message data before enabling in case it had been cleared or
+	 * corrupted since writing.
 	 */
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
 	if (msix) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 08/11] vfio/pci: Use bitfield for struct vfio_pci_core_device flags
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (6 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 07/11] vfio/pci: Update stale comment Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:43   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 09/11] vfio/pci: Probe and store ability to support dynamic MSI-X Reinette Chatre
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

struct vfio_pci_core_device contains eleven boolean flags.
Boolean flags clearly indicate their usage but space usage
starts to be a concern when there are many.

An upcoming change adds another boolean flag to
struct vfio_pci_core_device, thereby increasing the concern
that the boolean flags are consuming unnecessary space.

Transition the boolean flags to use bitfields. On a system that
uses one byte per boolean this reduces the space consumed
by existing flags from 11 bytes to 2 bytes with room for
a few more flags without increasing the structure's size.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V3:
- New patch. (Jason)

 include/linux/vfio_pci_core.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 148fd1ae6c1c..adb47e2914d7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -68,17 +68,17 @@ struct vfio_pci_core_device {
 	u16			msix_size;
 	u32			msix_offset;
 	u32			rbar[7];
-	bool			pci_2_3;
-	bool			virq_disabled;
-	bool			reset_works;
-	bool			extended_caps;
-	bool			bardirty;
-	bool			has_vga;
-	bool			needs_reset;
-	bool			nointx;
-	bool			needs_pm_restore;
-	bool			pm_intx_masked;
-	bool			pm_runtime_engaged;
+	bool			pci_2_3:1;
+	bool			virq_disabled:1;
+	bool			reset_works:1;
+	bool			extended_caps:1;
+	bool			bardirty:1;
+	bool			has_vga:1;
+	bool			needs_reset:1;
+	bool			nointx:1;
+	bool			needs_pm_restore:1;
+	bool			pm_intx_masked:1;
+	bool			pm_runtime_engaged:1;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 09/11] vfio/pci: Probe and store ability to support dynamic MSI-X
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (7 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 08/11] vfio/pci: Use bitfield for struct vfio_pci_core_device flags Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:43   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 10/11] vfio/pci: Support " Reinette Chatre
  2023-04-27 17:36 ` [PATCH V4 11/11] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Not all MSI-X devices support dynamic MSI-X allocation. Whether
a device supports dynamic MSI-X should be queried using
pci_msix_can_alloc_dyn().

Instead of scattering code with pci_msix_can_alloc_dyn(),
probe this ability once and store it as a property of the
virtual device.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V3:
- Move field to improve structure layout. (Alex)
- Use bitfield. (Jason)

Changes since V2:
- New patch. (Alex)

 drivers/vfio/pci/vfio_pci_core.c | 5 ++++-
 include/linux/vfio_pci_core.h    | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ae0e161c7fc9..a3635a8e54c8 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -530,8 +530,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 		vdev->msix_bar = table & PCI_MSIX_TABLE_BIR;
 		vdev->msix_offset = table & PCI_MSIX_TABLE_OFFSET;
 		vdev->msix_size = ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) * 16;
-	} else
+		vdev->has_dyn_msix = pci_msix_can_alloc_dyn(pdev);
+	} else {
 		vdev->msix_bar = 0xFF;
+		vdev->has_dyn_msix = false;
+	}
 
 	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index adb47e2914d7..562e8754869d 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -68,6 +68,7 @@ struct vfio_pci_core_device {
 	u16			msix_size;
 	u32			msix_offset;
 	u32			rbar[7];
+	bool			has_dyn_msix:1;
 	bool			pci_2_3:1;
 	bool			virq_disabled:1;
 	bool			reset_works:1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (8 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 09/11] vfio/pci: Probe and store ability to support dynamic MSI-X Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:50   ` Tian, Kevin
  2023-04-27 17:36 ` [PATCH V4 11/11] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
allocated after MSI-X enabling.

Use dynamic MSI-X (if supported by the device) to allocate an interrupt
after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
the time a valid eventfd is assigned. This is different behavior from
a range provided during MSI-X enabling where interrupts are allocated
for the entire range whether a valid eventfd is provided for each
interrupt or not.

The PCI-MSIX API requires that some number of irqs are allocated for
an initial set of vectors when enabling MSI-X on the device. When
dynamic MSIX allocation is not supported, the vector table, and thus
the allocated irq set can only be resized by disabling and re-enabling
MSI-X with a different range. In that case the irq allocation is
essentially a cache for configuring vectors within the previously
allocated vector range. When dynamic MSI-X allocation is supported,
the API still requires some initial set of irqs to be allocated, but
also supports allocating and freeing specific irq vectors both
within and beyond the initially allocated range.

For consistency between modes, as well as to reduce latency and improve
reliability of allocations, and also simplicity, this implementation
only releases irqs via pci_free_irq_vectors() when either the interrupt
mode changes or the device is released.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/
---
Changes since V3:
- Remove vfio_msi_free_irq(). (Alex)
- Rework changelog. (Alex)

Changes since V2:
- Move vfio_irq_ctx_free() to earlier in series to support
  earlier usage. (Alex)
- Use consistent terms in changelog: MSI-x changed to MSI-X.
- Make dynamic interrupt context creation generic across all
  MSI/MSI-X interrupts. This resulted in code moving to earlier
  in series as part of xarray introduction patch. (Alex)
- Remove the local allow_dyn_alloc and direct calling of
  pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
  introduced earlier instead. (Alex)
- Stop tracking new allocations (remove "new_ctx"). (Alex)
- Introduce new wrapper that returns Linux interrupt number or
  dynamically allocate a new interrupt. Wrapper can be used for
  all interrupt cases. (Alex)
- Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)

Changes since RFC V1:
- Add pointer to interrupt context as function parameter to
  vfio_irq_ctx_free(). (Alex)
- Initialize new_ctx to false. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)

 drivers/vfio/pci/vfio_pci_intrs.c | 47 +++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index bdda7f46c2be..8340135b09fa 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -372,27 +372,56 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	return 0;
 }
 
+/*
+ * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
+ * If a Linux IRQ number is not available then a new interrupt will be
+ * allocated if dynamic MSI-X is supported.
+ */
+static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
+			      unsigned int vector, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct msi_map map;
+	int irq;
+	u16 cmd;
+
+	irq = pci_irq_vector(pdev, vector);
+	if (irq > 0 || !msix || !vdev->has_dyn_msix)
+		return irq;
+
+	cmd = vfio_pci_memory_lock_and_enable(vdev);
+	map = pci_msix_alloc_irq_at(pdev, vector, NULL);
+	vfio_pci_memory_unlock_and_restore(vdev, cmd);
+
+	return map.index < 0 ? map.index : map.virq;
+}
+
+/*
+ * Where is vfio_msi_free_irq() ?
+ *
+ * Allocated interrupts are maintained, essentially forming a cache that
+ * subsequent allocations can draw from. Interrupts are freed using
+ * pci_free_irq_vectors() when MSI/MSI-X is disabled.
+ */
+
 static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 				      unsigned int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_irq_ctx *ctx;
 	struct eventfd_ctx *trigger;
-	int irq, ret;
+	int irq = -EINVAL, ret;
 	u16 cmd;
 
-	irq = pci_irq_vector(pdev, vector);
-	if (irq < 0)
-		return -EINVAL;
-
 	ctx = vfio_irq_ctx_get(vdev, vector);
 
 	if (ctx) {
 		irq_bypass_unregister_producer(&ctx->producer);
-
+		irq = pci_irq_vector(pdev, vector);
 		cmd = vfio_pci_memory_lock_and_enable(vdev);
 		free_irq(irq, ctx->trigger);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
+		/* Interrupt stays allocated, will be freed at MSI-X disable. */
 		kfree(ctx->name);
 		eventfd_ctx_put(ctx->trigger);
 		vfio_irq_ctx_free(vdev, ctx, vector);
@@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	if (fd < 0)
 		return 0;
 
+	if (irq == -EINVAL) {
+		irq = vfio_msi_alloc_irq(vdev, vector, msix);
+		if (irq < 0)
+			return irq;
+	}
+
 	ctx = vfio_irq_ctx_alloc(vdev, vector);
 	if (!ctx)
 		return -ENOMEM;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V4 11/11] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
                   ` (9 preceding siblings ...)
  2023-04-27 17:36 ` [PATCH V4 10/11] vfio/pci: Support " Reinette Chatre
@ 2023-04-27 17:36 ` Reinette Chatre
  2023-04-28  6:50   ` Tian, Kevin
  10 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-27 17:36 UTC (permalink / raw)
  To: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, alex.williamson
  Cc: tglx, darwi, kvm, dave.jiang, jing2.liu, ashok.raj, fenghua.yu,
	tom.zanussi, reinette.chatre, linux-kernel

Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
to provide guidance to user space.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V3:
- Remove unnecessary test from condition. (Alex)

Changes since V2:
- Use new vdev->has_dyn_msix property instead of calling
  pci_msix_can_alloc_dyn() directly. (Alex)

Changes since RFC V1:
- Only advertise VFIO_IRQ_INFO_NORESIZE for MSI-X devices that
  can actually support dynamic allocation. (Alex)

 drivers/vfio/pci/vfio_pci_core.c | 2 +-
 include/uapi/linux/vfio.h        | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a3635a8e54c8..ec7e662de033 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1114,7 +1114,7 @@ static int vfio_pci_ioctl_get_irq_info(struct vfio_pci_core_device *vdev,
 	if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
 		info.flags |=
 			(VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED);
-	else
+	else if (info.index != VFIO_PCI_MSIX_IRQ_INDEX || !vdev->has_dyn_msix)
 		info.flags |= VFIO_IRQ_INFO_NORESIZE;
 
 	return copy_to_user(arg, &info, minsz) ? -EFAULT : 0;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0552e8dcf0cb..1a36134cae5c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -511,6 +511,9 @@ struct vfio_region_info_cap_nvlink2_lnkspd {
  * then add and unmask vectors, it's up to userspace to make the decision
  * whether to allocate the maximum supported number of vectors or tear
  * down setup and incrementally increase the vectors as each is enabled.
+ * Absence of the NORESIZE flag indicates that vectors can be enabled
+ * and disabled dynamically without impacting other vectors within the
+ * index.
  */
 struct vfio_irq_info {
 	__u32	argsz;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable
  2023-04-27 17:35 ` [PATCH V4 01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
@ 2023-04-28  6:28   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:28 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> vfio_msi_disable() releases all previously allocated state
> associated with each interrupt before disabling MSI/MSI-X.
> 
> vfio_msi_disable() iterates twice over the interrupt state:
> first directly with a for loop to do virqfd cleanup, followed
> by another for loop within vfio_msi_set_block() that removes
> the interrupt handler and its associated state using
> vfio_msi_set_vector_signal().
> 
> Simplify interrupt cleanup by iterating over allocated interrupts
> once.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 02/11] vfio/pci: Remove negative check on unsigned vector
  2023-04-27 17:35 ` [PATCH V4 02/11] vfio/pci: Remove negative check on unsigned vector Reinette Chatre
@ 2023-04-28  6:29   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:29 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> User space provides the vector as an unsigned int that is checked
> early for validity (vfio_set_irqs_validate_and_prepare()).
> 
> A later negative check of the provided vector is not necessary.
> 
> Remove the negative check and ensure the type used
> for the vector is consistent as an unsigned int.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage
  2023-04-27 17:36 ` [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage Reinette Chatre
@ 2023-04-28  6:33   ` Tian, Kevin
  2023-04-28 18:24     ` Reinette Chatre
  0 siblings, 1 reply; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:33 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
>
> @@ -55,17 +80,28 @@ static void vfio_send_intx_eventfd(void *opaque,
> void *unused)
>  {
>  	struct vfio_pci_core_device *vdev = opaque;
> 
> -	if (likely(is_intx(vdev) && !vdev->virq_disabled))
> -		eventfd_signal(vdev->ctx[0].trigger, 1);
> +	if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
> +		struct vfio_pci_irq_ctx *ctx;
> +
> +		ctx = vfio_irq_ctx_get(vdev, 0);
> +		if (!ctx)
> +			return;

if this error happens it implies a kernel bug since the same check
has been done in vfio_intx_enable(). Then should be a WARN_ON().

ditto for other intx functions which can be called only after intx
is enabled.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 04/11] vfio/pci: Move to single error path
  2023-04-27 17:36 ` [PATCH V4 04/11] vfio/pci: Move to single error path Reinette Chatre
@ 2023-04-28  6:34   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:34 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> Enabling and disabling of an interrupt involves several steps
> that can fail. Cleanup after failure is done when the error
> is encountered, resulting in some repetitive code.
> 
> Support for dynamic contexts will introduce more steps during
> interrupt enabling and disabling.
> 
> Transition to centralized exit path in preparation for dynamic
> contexts to eliminate duplicate error handling code.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 05/11] vfio/pci: Use xarray for interrupt context storage
  2023-04-27 17:36 ` [PATCH V4 05/11] vfio/pci: Use xarray for interrupt context storage Reinette Chatre
@ 2023-04-28  6:35   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:35 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> Interrupt context is statically allocated at the time interrupts
> are allocated. Following allocation, the context is managed by
> directly accessing the elements of the array using the vector
> as index. The storage is released when interrupts are disabled.
> 
> It is possible to dynamically allocate a single MSI-X interrupt
> after MSI-X is enabled. A dynamic storage for interrupt context
> is needed to support this. Replace the interrupt context array with an
> xarray (similar to what the core uses as store for MSI descriptors)
> that can support the dynamic expansion while maintaining the
> custom that uses the vector as index.
> 
> With a dynamic storage it is no longer required to pre-allocate
> interrupt contexts at the time the interrupts are allocated.
> MSI and MSI-X interrupt contexts are only used when interrupts are
> enabled. Their allocation can thus be delayed until interrupt enabling.
> Only enabled interrupts will have associated interrupt contexts.
> Whether an interrupt has been allocated (a Linux irq number exists
> for it) becomes the criteria for whether an interrupt can be enabled.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 06/11] vfio/pci: Remove interrupt context counter
  2023-04-27 17:36 ` [PATCH V4 06/11] vfio/pci: Remove interrupt context counter Reinette Chatre
@ 2023-04-28  6:36   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:36 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> struct vfio_pci_core_device::num_ctx counts how many interrupt
> contexts have been allocated. When all interrupt contexts are
> allocated simultaneously num_ctx provides the upper bound of all
> vectors that can be used as indices into the interrupt context
> array.
> 
> With the upcoming support for dynamic MSI-X the number of
> interrupt contexts does not necessarily span the range of allocated
> interrupts. Consequently, num_ctx is no longer a trusted upper bound
> for valid indices.
> 
> Stop using num_ctx to determine if a provided vector is valid. Use
> the existence of allocated interrupt.
> 
> This changes behavior on the error path when user space provides
> an invalid vector range. Behavior changes from early exit without
> any modifications to possible modifications to valid vectors within
> the invalid range. This is acceptable considering that an invalid
> range is not a valid scenario, see link to discussion.
> 
> The checks that ensure that user space provides a range of vectors
> that is valid for the device are untouched.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 07/11] vfio/pci: Update stale comment
  2023-04-27 17:36 ` [PATCH V4 07/11] vfio/pci: Update stale comment Reinette Chatre
@ 2023-04-28  6:42   ` Tian, Kevin
  2023-04-28 18:24     ` Reinette Chatre
  0 siblings, 1 reply; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:42 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> @@ -419,11 +419,9 @@ static int vfio_msi_set_vector_signal(struct
> vfio_pci_core_device *vdev,
>  	}
> 
>  	/*
> -	 * The MSIx vector table resides in device memory which may be
> cleared
> -	 * via backdoor resets. We don't allow direct access to the vector
> -	 * table so even if a userspace driver attempts to save/restore
> around
> -	 * such a reset it would be unsuccessful. To avoid this, restore the
> -	 * cached value of the message prior to enabling.
> +	 * If the vector was previously allocated, refresh the on-device
> +	 * message data before enabling in case it had been cleared or
> +	 * corrupted since writing.
>  	 */
>  	cmd = vfio_pci_memory_lock_and_enable(vdev);
>  	if (msix) {

What about keeping backdoor reset as an example, e.g.

"in case it had been cleared or corrupted (e.g. due to backdoor resets) ..."

otherwise one may doubt whether it is a more severe problem causing
the corruption inadvertently w/o userspace driver's awareness and then
whether just restoring the data is sufficient to move on.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 08/11] vfio/pci: Use bitfield for struct vfio_pci_core_device flags
  2023-04-27 17:36 ` [PATCH V4 08/11] vfio/pci: Use bitfield for struct vfio_pci_core_device flags Reinette Chatre
@ 2023-04-28  6:43   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:43 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> struct vfio_pci_core_device contains eleven boolean flags.
> Boolean flags clearly indicate their usage but space usage
> starts to be a concern when there are many.
> 
> An upcoming change adds another boolean flag to
> struct vfio_pci_core_device, thereby increasing the concern
> that the boolean flags are consuming unnecessary space.
> 
> Transition the boolean flags to use bitfields. On a system that
> uses one byte per boolean this reduces the space consumed
> by existing flags from 11 bytes to 2 bytes with room for
> a few more flags without increasing the structure's size.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 09/11] vfio/pci: Probe and store ability to support dynamic MSI-X
  2023-04-27 17:36 ` [PATCH V4 09/11] vfio/pci: Probe and store ability to support dynamic MSI-X Reinette Chatre
@ 2023-04-28  6:43   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:43 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> Not all MSI-X devices support dynamic MSI-X allocation. Whether
> a device supports dynamic MSI-X should be queried using
> pci_msix_can_alloc_dyn().
> 
> Instead of scattering code with pci_msix_can_alloc_dyn(),
> probe this ability once and store it as a property of the
> virtual device.
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-04-27 17:36 ` [PATCH V4 10/11] vfio/pci: Support " Reinette Chatre
@ 2023-04-28  6:50   ` Tian, Kevin
  2023-04-28 18:35     ` Reinette Chatre
  0 siblings, 1 reply; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:50 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
> allocated after MSI-X enabling.
> 
> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> the time a valid eventfd is assigned. This is different behavior from
> a range provided during MSI-X enabling where interrupts are allocated
> for the entire range whether a valid eventfd is provided for each
> interrupt or not.
> 
> The PCI-MSIX API requires that some number of irqs are allocated for
> an initial set of vectors when enabling MSI-X on the device. When
> dynamic MSIX allocation is not supported, the vector table, and thus
> the allocated irq set can only be resized by disabling and re-enabling
> MSI-X with a different range. In that case the irq allocation is
> essentially a cache for configuring vectors within the previously
> allocated vector range. When dynamic MSI-X allocation is supported,
> the API still requires some initial set of irqs to be allocated, but
> also supports allocating and freeing specific irq vectors both
> within and beyond the initially allocated range.
> 
> For consistency between modes, as well as to reduce latency and improve
> reliability of allocations, and also simplicity, this implementation
> only releases irqs via pci_free_irq_vectors() when either the interrupt
> mode changes or the device is released.

It improves the reliability of allocations from the calling device p.o.v.

But system-wide this is not efficient use of irqs and not releasing them
timely may affect the reliability of allocations for other devices.

Should this behavior be something configurable?

> 
> +/*
> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
> + * If a Linux IRQ number is not available then a new interrupt will be
> + * allocated if dynamic MSI-X is supported.
> + */
> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
> +			      unsigned int vector, bool msix)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct msi_map map;
> +	int irq;
> +	u16 cmd;
> +
> +	irq = pci_irq_vector(pdev, vector);
> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
> +		return irq;

if (irq >= 0 || ...)

> +
> +/*
> + * Where is vfio_msi_free_irq() ?
> + *
> + * Allocated interrupts are maintained, essentially forming a cache that
> + * subsequent allocations can draw from. Interrupts are freed using
> + * pci_free_irq_vectors() when MSI/MSI-X is disabled.
> + */

Probably merge it with the comment of vfio_msi_alloc_irq()?

> @@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct
> vfio_pci_core_device *vdev,
>  	if (fd < 0)
>  		return 0;
> 
> +	if (irq == -EINVAL) {
> +		irq = vfio_msi_alloc_irq(vdev, vector, msix);
> +		if (irq < 0)
> +			return irq;
> +	}
> +
>  	ctx = vfio_irq_ctx_alloc(vdev, vector);
>  	if (!ctx)
>  		return -ENOMEM;

This doesn't read clean that an irq is allocated but not released
in the error unwind.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 11/11] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X
  2023-04-27 17:36 ` [PATCH V4 11/11] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
@ 2023-04-28  6:50   ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-04-28  6:50 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> Dynamic MSI-X is supported. Clear VFIO_IRQ_INFO_NORESIZE
> to provide guidance to user space.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage
  2023-04-28  6:33   ` Tian, Kevin
@ 2023-04-28 18:24     ` Reinette Chatre
  2023-05-05  7:21       ` Tian, Kevin
  0 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-28 18:24 UTC (permalink / raw)
  To: Tian, Kevin, jgg, yishaih, shameerali.kolothum.thodi, alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

Hi Kevin,

On 4/27/2023 11:33 PM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Friday, April 28, 2023 1:36 AM
>>
>> @@ -55,17 +80,28 @@ static void vfio_send_intx_eventfd(void *opaque,
>> void *unused)
>>  {
>>  	struct vfio_pci_core_device *vdev = opaque;
>>
>> -	if (likely(is_intx(vdev) && !vdev->virq_disabled))
>> -		eventfd_signal(vdev->ctx[0].trigger, 1);
>> +	if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
>> +		struct vfio_pci_irq_ctx *ctx;
>> +
>> +		ctx = vfio_irq_ctx_get(vdev, 0);
>> +		if (!ctx)
>> +			return;
> 
> if this error happens it implies a kernel bug since the same check
> has been done in vfio_intx_enable(). Then should be a WARN_ON().

Sure. Considering that if these are triggered it may result
in many instances, so perhaps WARN_ON_ONCE()?
 
> ditto for other intx functions which can be called only after intx
> is enabled.

It seems the instances in this category can be identified as the places
where the array contents is currently used without any checks.

I am planning on the following changes:

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index d8dae99cf6d9..b549f5c97cb8 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -92,7 +92,7 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
 		struct vfio_pci_irq_ctx *ctx;
 
 		ctx = vfio_irq_ctx_get(vdev, 0);
-		if (!ctx)
+		if (WARN_ON_ONCE(!ctx))
 			return;
 		eventfd_signal(ctx->trigger, 1);
 	}
@@ -107,7 +107,7 @@ bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 	bool masked_changed = false;
 
 	ctx = vfio_irq_ctx_get(vdev, 0);
-	if (!ctx)
+	if (WARN_ON_ONCE(!ctx))
 		return masked_changed;
 
 	spin_lock_irqsave(&vdev->irqlock, flags);
@@ -154,7 +154,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 	int ret = 0;
 
 	ctx = vfio_irq_ctx_get(vdev, 0);
-	if (!ctx)
+	if (WARN_ON_ONCE(!ctx))
 		return ret;
 
 	spin_lock_irqsave(&vdev->irqlock, flags);
@@ -200,7 +200,7 @@ static irqreturn_t vfio_intx_handler(int irq, void *dev_id)
 	int ret = IRQ_NONE;
 
 	ctx = vfio_irq_ctx_get(vdev, 0);
-	if (!ctx)
+	if (WARN_ON_ONCE(!ctx))
 		return ret;
 
 	spin_lock_irqsave(&vdev->irqlock, flags);
@@ -264,7 +264,7 @@ static int vfio_intx_set_signal(struct vfio_pci_core_device *vdev, int fd)
 	int ret;
 
 	ctx = vfio_irq_ctx_get(vdev, 0);
-	if (!ctx)
+	if (WARN_ON_ONCE(!ctx))
 		return -EINVAL;
 
 	if (ctx->trigger) {
@@ -320,6 +320,7 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 
 	dev_dbg(&vdev->pdev->dev, "%s:%d Disabling INTx\n", __func__, __LINE__);
 	ctx = vfio_irq_ctx_get(vdev, 0);
+	WARN_ON_ONCE(!ctx);
 	if (ctx) {
 		vfio_virqfd_disable(&ctx->unmask);
 		vfio_virqfd_disable(&ctx->mask);
@@ -586,7 +587,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev,
 		struct vfio_pci_irq_ctx *ctx = vfio_irq_ctx_get(vdev, 0);
 		int32_t fd = *(int32_t *)data;
 
-		if (!ctx)
+		if (WARN_ON_ONCE(!ctx))
 			return -EINVAL;
 		if (fd >= 0)
 			return vfio_virqfd_enable((void *) vdev,



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V4 07/11] vfio/pci: Update stale comment
  2023-04-28  6:42   ` Tian, Kevin
@ 2023-04-28 18:24     ` Reinette Chatre
  0 siblings, 0 replies; 33+ messages in thread
From: Reinette Chatre @ 2023-04-28 18:24 UTC (permalink / raw)
  To: Tian, Kevin, jgg, yishaih, shameerali.kolothum.thodi, alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

Hi Kevin,

On 4/27/2023 11:42 PM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Friday, April 28, 2023 1:36 AM
>>
>> @@ -419,11 +419,9 @@ static int vfio_msi_set_vector_signal(struct
>> vfio_pci_core_device *vdev,
>>  	}
>>
>>  	/*
>> -	 * The MSIx vector table resides in device memory which may be
>> cleared
>> -	 * via backdoor resets. We don't allow direct access to the vector
>> -	 * table so even if a userspace driver attempts to save/restore
>> around
>> -	 * such a reset it would be unsuccessful. To avoid this, restore the
>> -	 * cached value of the message prior to enabling.
>> +	 * If the vector was previously allocated, refresh the on-device
>> +	 * message data before enabling in case it had been cleared or
>> +	 * corrupted since writing.
>>  	 */
>>  	cmd = vfio_pci_memory_lock_and_enable(vdev);
>>  	if (msix) {
> 
> What about keeping backdoor reset as an example, e.g.
> 
> "in case it had been cleared or corrupted (e.g. due to backdoor resets) ..."
> 
> otherwise one may doubt whether it is a more severe problem causing
> the corruption inadvertently w/o userspace driver's awareness and then
> whether just restoring the data is sufficient to move on.

Will do. Thank you very much.

Reinette


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-04-28  6:50   ` Tian, Kevin
@ 2023-04-28 18:35     ` Reinette Chatre
  2023-05-05  8:10       ` Tian, Kevin
  0 siblings, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-04-28 18:35 UTC (permalink / raw)
  To: Tian, Kevin, jgg, yishaih, shameerali.kolothum.thodi, alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

Hi Kevin,

On 4/27/2023 11:50 PM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Friday, April 28, 2023 1:36 AM
>>
>> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
>> allocated after MSI-X enabling.
>>
>> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
>> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
>> the time a valid eventfd is assigned. This is different behavior from
>> a range provided during MSI-X enabling where interrupts are allocated
>> for the entire range whether a valid eventfd is provided for each
>> interrupt or not.
>>
>> The PCI-MSIX API requires that some number of irqs are allocated for
>> an initial set of vectors when enabling MSI-X on the device. When
>> dynamic MSIX allocation is not supported, the vector table, and thus
>> the allocated irq set can only be resized by disabling and re-enabling
>> MSI-X with a different range. In that case the irq allocation is
>> essentially a cache for configuring vectors within the previously
>> allocated vector range. When dynamic MSI-X allocation is supported,
>> the API still requires some initial set of irqs to be allocated, but
>> also supports allocating and freeing specific irq vectors both
>> within and beyond the initially allocated range.
>>
>> For consistency between modes, as well as to reduce latency and improve
>> reliability of allocations, and also simplicity, this implementation
>> only releases irqs via pci_free_irq_vectors() when either the interrupt
>> mode changes or the device is released.
> 
> It improves the reliability of allocations from the calling device p.o.v.
> 
> But system-wide this is not efficient use of irqs and not releasing them
> timely may affect the reliability of allocations for other devices.

Could you please elaborate how other devices may be impacted?

> Should this behavior be something configurable?

This is not clear to me and I look to you for guidance here. From practical
side it looks like configuration via module parameters is supported but
whether it should be done is not clear to me.

When considering this we need to think about what the user may expect when
turning on/off the configuration. For example, MSI-X continues to allocate a
range of interrupts during enabling. These have always been treated as a
"cache" (interrupts remain allocated, whether they have an associated trigger
or not). If there is new configurable behavior, do you expect that the
driver needs to distinguish between the original "cache" that the user is
used to and the new dynamic allocations? That is, should a dynamic MSI-X
capable device always free interrupts when user space removes an eventfd
or should only interrupts that were allocated dynamically be freed dynamically?

>> +/*
>> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
>> + * If a Linux IRQ number is not available then a new interrupt will be
>> + * allocated if dynamic MSI-X is supported.
>> + */
>> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
>> +			      unsigned int vector, bool msix)
>> +{
>> +	struct pci_dev *pdev = vdev->pdev;
>> +	struct msi_map map;
>> +	int irq;
>> +	u16 cmd;
>> +
>> +	irq = pci_irq_vector(pdev, vector);
>> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
>> +		return irq;
> 
> if (irq >= 0 || ...)
> 

I am not sure about this request because pci_irq_vector() cannot return 0.
The Linux interrupt number will be > 0 on success. 0 means "not found"
(see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector().

>> +
>> +/*
>> + * Where is vfio_msi_free_irq() ?
>> + *
>> + * Allocated interrupts are maintained, essentially forming a cache that
>> + * subsequent allocations can draw from. Interrupts are freed using
>> + * pci_free_irq_vectors() when MSI/MSI-X is disabled.
>> + */
> 
> Probably merge it with the comment of vfio_msi_alloc_irq()?

Sure, will do.

> 
>> @@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct
>> vfio_pci_core_device *vdev,
>>  	if (fd < 0)
>>  		return 0;
>>
>> +	if (irq == -EINVAL) {
>> +		irq = vfio_msi_alloc_irq(vdev, vector, msix);
>> +		if (irq < 0)
>> +			return irq;
>> +	}
>> +
>>  	ctx = vfio_irq_ctx_alloc(vdev, vector);
>>  	if (!ctx)
>>  		return -ENOMEM;
> 
> This doesn't read clean that an irq is allocated but not released
> in the error unwind.

I can add a comment similar to the location where the trigger is released:
	/* Interrupt stays allocated, will be freed at MSI-X disable. */

Reinette



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage
  2023-04-28 18:24     ` Reinette Chatre
@ 2023-05-05  7:21       ` Tian, Kevin
  2023-05-08 22:52         ` Reinette Chatre
  0 siblings, 1 reply; 33+ messages in thread
From: Tian, Kevin @ 2023-05-05  7:21 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Saturday, April 29, 2023 2:24 AM
> 
> Hi Kevin,
> 
> On 4/27/2023 11:33 PM, Tian, Kevin wrote:
> >> From: Chatre, Reinette <reinette.chatre@intel.com>
> >> Sent: Friday, April 28, 2023 1:36 AM
> >>
> >> @@ -55,17 +80,28 @@ static void vfio_send_intx_eventfd(void *opaque,
> >> void *unused)
> >>  {
> >>  	struct vfio_pci_core_device *vdev = opaque;
> >>
> >> -	if (likely(is_intx(vdev) && !vdev->virq_disabled))
> >> -		eventfd_signal(vdev->ctx[0].trigger, 1);
> >> +	if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
> >> +		struct vfio_pci_irq_ctx *ctx;
> >> +
> >> +		ctx = vfio_irq_ctx_get(vdev, 0);
> >> +		if (!ctx)
> >> +			return;
> >
> > if this error happens it implies a kernel bug since the same check
> > has been done in vfio_intx_enable(). Then should be a WARN_ON().
> 
> Sure. Considering that if these are triggered it may result
> in many instances, so perhaps WARN_ON_ONCE()?

yes.

> 
> > ditto for other intx functions which can be called only after intx
> > is enabled.
> 
> It seems the instances in this category can be identified as the places
> where the array contents is currently used without any checks.
> 
> I am planning on the following changes:
> 

that looks good to me

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-04-28 18:35     ` Reinette Chatre
@ 2023-05-05  8:10       ` Tian, Kevin
  2023-05-05 15:28         ` Alex Williamson
  2023-05-05 17:21         ` Reinette Chatre
  0 siblings, 2 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-05-05  8:10 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Saturday, April 29, 2023 2:35 AM
> 
> Hi Kevin,
> 
> On 4/27/2023 11:50 PM, Tian, Kevin wrote:
> >> From: Chatre, Reinette <reinette.chatre@intel.com>
> >> Sent: Friday, April 28, 2023 1:36 AM
> >>
> >> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
> >> allocated after MSI-X enabling.
> >>
> >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> >> the time a valid eventfd is assigned. This is different behavior from
> >> a range provided during MSI-X enabling where interrupts are allocated
> >> for the entire range whether a valid eventfd is provided for each
> >> interrupt or not.
> >>
> >> The PCI-MSIX API requires that some number of irqs are allocated for
> >> an initial set of vectors when enabling MSI-X on the device. When
> >> dynamic MSIX allocation is not supported, the vector table, and thus
> >> the allocated irq set can only be resized by disabling and re-enabling
> >> MSI-X with a different range. In that case the irq allocation is
> >> essentially a cache for configuring vectors within the previously
> >> allocated vector range. When dynamic MSI-X allocation is supported,
> >> the API still requires some initial set of irqs to be allocated, but
> >> also supports allocating and freeing specific irq vectors both
> >> within and beyond the initially allocated range.
> >>
> >> For consistency between modes, as well as to reduce latency and improve
> >> reliability of allocations, and also simplicity, this implementation
> >> only releases irqs via pci_free_irq_vectors() when either the interrupt
> >> mode changes or the device is released.
> >
> > It improves the reliability of allocations from the calling device p.o.v.
> >
> > But system-wide this is not efficient use of irqs and not releasing them
> > timely may affect the reliability of allocations for other devices.
> 
> Could you please elaborate how other devices may be impacted?

the more this devices reserves the less remains for others, e.g. irte entries.

> 
> > Should this behavior be something configurable?
> 
> This is not clear to me and I look to you for guidance here. From practical
> side it looks like configuration via module parameters is supported but
> whether it should be done is not clear to me.
> 
> When considering this we need to think about what the user may expect
> when
> turning on/off the configuration. For example, MSI-X continues to allocate a
> range of interrupts during enabling. These have always been treated as a
> "cache" (interrupts remain allocated, whether they have an associated
> trigger
> or not). If there is new configurable behavior, do you expect that the
> driver needs to distinguish between the original "cache" that the user is
> used to and the new dynamic allocations? That is, should a dynamic MSI-X
> capable device always free interrupts when user space removes an eventfd
> or should only interrupts that were allocated dynamically be freed
> dynamically?

That looks tricky. Probably that is why Alex suggested doing this simple
scheme and it is on par with the old logic anyway. So I'll withdraw this
comment.

> 
> >> +/*
> >> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
> >> + * If a Linux IRQ number is not available then a new interrupt will be
> >> + * allocated if dynamic MSI-X is supported.
> >> + */
> >> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
> >> +			      unsigned int vector, bool msix)
> >> +{
> >> +	struct pci_dev *pdev = vdev->pdev;
> >> +	struct msi_map map;
> >> +	int irq;
> >> +	u16 cmd;
> >> +
> >> +	irq = pci_irq_vector(pdev, vector);
> >> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
> >> +		return irq;
> >
> > if (irq >= 0 || ...)
> >
> 
> I am not sure about this request because pci_irq_vector() cannot return 0.
> The Linux interrupt number will be > 0 on success. 0 means "not found"
> (see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector().
> 

There is a subtle difference between the description and the code of
pci_irq_vector().

/**
 * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector
 * @dev: the PCI device to operate on
 * @nr:  device-relative interrupt vector index (0-based); has different
 *       meanings, depending on interrupt mode:
 *
 *         * MSI-X     the index in the MSI-X vector table
 *         * MSI       the index of the enabled MSI vectors
 *         * INTx      must be 0
 *
 * Return: the Linux IRQ number, or -EINVAL if @nr is out of range
 */

From above '0' is a valid irq number.

then in following code:

	irq = msi_get_virq(&dev->dev, nr);
	return irq ? irq : -EINVAL;

'0' is obviously invalid for msi.

I didn't realize the msi part when reading the patch. It left me in
confusion that '0' is unhandled as here we only check ">0" while in
other places "-EINVAL" is checked.

Not big matter but it sounds slightly clearer to me to follow the
description of pci_irq_vector() instead of its internal detail. 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-05-05  8:10       ` Tian, Kevin
@ 2023-05-05 15:28         ` Alex Williamson
  2023-05-06  8:15           ` Tian, Kevin
  2023-05-05 17:21         ` Reinette Chatre
  1 sibling, 1 reply; 33+ messages in thread
From: Alex Williamson @ 2023-05-05 15:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi, tglx,
	darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu, Fenghua,
	tom.zanussi, linux-kernel

On Fri, 5 May 2023 08:10:33 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Chatre, Reinette <reinette.chatre@intel.com>
> > Sent: Saturday, April 29, 2023 2:35 AM
> > 
> > Hi Kevin,
> > 
> > On 4/27/2023 11:50 PM, Tian, Kevin wrote:  
> > >> From: Chatre, Reinette <reinette.chatre@intel.com>
> > >> Sent: Friday, April 28, 2023 1:36 AM
> > >>
> > >> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
> > >> allocated after MSI-X enabling.
> > >>
> > >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> > >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> > >> the time a valid eventfd is assigned. This is different behavior from
> > >> a range provided during MSI-X enabling where interrupts are allocated
> > >> for the entire range whether a valid eventfd is provided for each
> > >> interrupt or not.
> > >>
> > >> The PCI-MSIX API requires that some number of irqs are allocated for
> > >> an initial set of vectors when enabling MSI-X on the device. When
> > >> dynamic MSIX allocation is not supported, the vector table, and thus
> > >> the allocated irq set can only be resized by disabling and re-enabling
> > >> MSI-X with a different range. In that case the irq allocation is
> > >> essentially a cache for configuring vectors within the previously
> > >> allocated vector range. When dynamic MSI-X allocation is supported,
> > >> the API still requires some initial set of irqs to be allocated, but
> > >> also supports allocating and freeing specific irq vectors both
> > >> within and beyond the initially allocated range.
> > >>
> > >> For consistency between modes, as well as to reduce latency and improve
> > >> reliability of allocations, and also simplicity, this implementation
> > >> only releases irqs via pci_free_irq_vectors() when either the interrupt
> > >> mode changes or the device is released.  
> > >
> > > It improves the reliability of allocations from the calling device p.o.v.
> > >
> > > But system-wide this is not efficient use of irqs and not releasing them
> > > timely may affect the reliability of allocations for other devices.  
> > 
> > Could you please elaborate how other devices may be impacted?  
> 
> the more this devices reserves the less remains for others, e.g. irte entries.
> 
> >   
> > > Should this behavior be something configurable?  
> > 
> > This is not clear to me and I look to you for guidance here. From practical
> > side it looks like configuration via module parameters is supported but
> > whether it should be done is not clear to me.
> > 
> > When considering this we need to think about what the user may expect
> > when
> > turning on/off the configuration. For example, MSI-X continues to allocate a
> > range of interrupts during enabling. These have always been treated as a
> > "cache" (interrupts remain allocated, whether they have an associated
> > trigger
> > or not). If there is new configurable behavior, do you expect that the
> > driver needs to distinguish between the original "cache" that the user is
> > used to and the new dynamic allocations? That is, should a dynamic MSI-X
> > capable device always free interrupts when user space removes an eventfd
> > or should only interrupts that were allocated dynamically be freed
> > dynamically?  
> 
> That looks tricky. Probably that is why Alex suggested doing this simple
> scheme and it is on par with the old logic anyway. So I'll withdraw this
> comment.

Don't forget we're also releasing the irq reservations when the guest
changes interrupt mode, ex. reboot, so the "caching" is really only
within a session of the guest/userspace driver where it would be
unusual to have an unused reservation for an extended period.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-05-05  8:10       ` Tian, Kevin
  2023-05-05 15:28         ` Alex Williamson
@ 2023-05-05 17:21         ` Reinette Chatre
  2023-05-06  8:13           ` Tian, Kevin
  1 sibling, 1 reply; 33+ messages in thread
From: Reinette Chatre @ 2023-05-05 17:21 UTC (permalink / raw)
  To: Tian, Kevin, jgg, yishaih, shameerali.kolothum.thodi, alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

Hi Kevin,

On 5/5/2023 1:10 AM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Saturday, April 29, 2023 2:35 AM
>> On 4/27/2023 11:50 PM, Tian, Kevin wrote:
>>>> From: Chatre, Reinette <reinette.chatre@intel.com>
>>>> Sent: Friday, April 28, 2023 1:36 AM

...

>>>> +/*
>>>> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
>>>> + * If a Linux IRQ number is not available then a new interrupt will be
>>>> + * allocated if dynamic MSI-X is supported.
>>>> + */
>>>> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
>>>> +			      unsigned int vector, bool msix)
>>>> +{
>>>> +	struct pci_dev *pdev = vdev->pdev;
>>>> +	struct msi_map map;
>>>> +	int irq;
>>>> +	u16 cmd;
>>>> +
>>>> +	irq = pci_irq_vector(pdev, vector);
>>>> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
>>>> +		return irq;
>>>
>>> if (irq >= 0 || ...)
>>>
>>
>> I am not sure about this request because pci_irq_vector() cannot return 0.
>> The Linux interrupt number will be > 0 on success. 0 means "not found"
>> (see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector().
>>
> 
> There is a subtle difference between the description and the code of
> pci_irq_vector().
> 
> /**
>  * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector
>  * @dev: the PCI device to operate on
>  * @nr:  device-relative interrupt vector index (0-based); has different
>  *       meanings, depending on interrupt mode:
>  *
>  *         * MSI-X     the index in the MSI-X vector table
>  *         * MSI       the index of the enabled MSI vectors
>  *         * INTx      must be 0
>  *
>  * Return: the Linux IRQ number, or -EINVAL if @nr is out of range
>  */
> 
> From above '0' is a valid irq number.
> 
> then in following code:
> 
> 	irq = msi_get_virq(&dev->dev, nr);
> 	return irq ? irq : -EINVAL;
> 
> '0' is obviously invalid for msi.
> 
> I didn't realize the msi part when reading the patch. It left me in
> confusion that '0' is unhandled as here we only check ">0" while in
> other places "-EINVAL" is checked.
> 
> Not big matter but it sounds slightly clearer to me to follow the
> description of pci_irq_vector() instead of its internal detail. 

I can add an explicit check for '0' and, as you confirmed, this is
invalid for MSI and thus I think it should be treated as an error.
This is perhaps another candidate for a WARN considering that
pci_irq_vector() returning a '0' for MSI indicates a kernel problem .

I now consider taking guidance from pci_irq_get_affinity(). Note that
pci_irq_get_affinity() contains:

const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr)
{
	int idx, irq = pci_irq_vector(dev, nr);
	...
	if (WARN_ON_ONCE(irq <= 0))
		return NULL;
	...
}	


Would you be ok with something like below?

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index b549f5c97cb8..a8e96254f953 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -393,6 +393,8 @@ static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
 	u16 cmd;
 
 	irq = pci_irq_vector(pdev, vector);
+	if (WARN_ON_ONCE(irq == 0))
+		return -EINVAL;
 	if (irq > 0 || !msix || !vdev->has_dyn_msix)
 		return irq;

I would prefer that vfio_msi_alloc_irq() returns negative errors. This enables
callers to in turn just return the error code on failure (note that dynamic
allocation can return different error codes), not needing to translate 0 into
an error.

Reinette









^ permalink raw reply related	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-05-05 17:21         ` Reinette Chatre
@ 2023-05-06  8:13           ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-05-06  8:13 UTC (permalink / raw)
  To: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi,
	alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Saturday, May 6, 2023 1:21 AM
> 
> Would you be ok with something like below?
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index b549f5c97cb8..a8e96254f953 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -393,6 +393,8 @@ static int vfio_msi_alloc_irq(struct
> vfio_pci_core_device *vdev,
>  	u16 cmd;
> 
>  	irq = pci_irq_vector(pdev, vector);
> +	if (WARN_ON_ONCE(irq == 0))
> +		return -EINVAL;
>  	if (irq > 0 || !msix || !vdev->has_dyn_msix)
>  		return irq;
> 
> I would prefer that vfio_msi_alloc_irq() returns negative errors. This enables
> callers to in turn just return the error code on failure (note that dynamic
> allocation can return different error codes), not needing to translate 0 into
> an error.
> 

This looks good to me. Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X
  2023-05-05 15:28         ` Alex Williamson
@ 2023-05-06  8:15           ` Tian, Kevin
  0 siblings, 0 replies; 33+ messages in thread
From: Tian, Kevin @ 2023-05-06  8:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Chatre, Reinette, jgg, yishaih, shameerali.kolothum.thodi, tglx,
	darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu, Fenghua,
	tom.zanussi, linux-kernel

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, May 5, 2023 11:28 PM
> 
> On Fri, 5 May 2023 08:10:33 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Chatre, Reinette <reinette.chatre@intel.com>
> > > Sent: Saturday, April 29, 2023 2:35 AM
> > >
> > > Hi Kevin,
> > >
> > > On 4/27/2023 11:50 PM, Tian, Kevin wrote:
> > >
> > > > Should this behavior be something configurable?
> > >
> > > This is not clear to me and I look to you for guidance here. From practical
> > > side it looks like configuration via module parameters is supported but
> > > whether it should be done is not clear to me.
> > >
> > > When considering this we need to think about what the user may expect
> > > when
> > > turning on/off the configuration. For example, MSI-X continues to
> allocate a
> > > range of interrupts during enabling. These have always been treated as a
> > > "cache" (interrupts remain allocated, whether they have an associated
> > > trigger
> > > or not). If there is new configurable behavior, do you expect that the
> > > driver needs to distinguish between the original "cache" that the user is
> > > used to and the new dynamic allocations? That is, should a dynamic MSI-
> X
> > > capable device always free interrupts when user space removes an
> eventfd
> > > or should only interrupts that were allocated dynamically be freed
> > > dynamically?
> >
> > That looks tricky. Probably that is why Alex suggested doing this simple
> > scheme and it is on par with the old logic anyway. So I'll withdraw this
> > comment.
> 
> Don't forget we're also releasing the irq reservations when the guest
> changes interrupt mode, ex. reboot, so the "caching" is really only
> within a session of the guest/userspace driver where it would be
> unusual to have an unused reservation for an extended period.  Thanks,
> 

Yeah, that makes sense.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage
  2023-05-05  7:21       ` Tian, Kevin
@ 2023-05-08 22:52         ` Reinette Chatre
  0 siblings, 0 replies; 33+ messages in thread
From: Reinette Chatre @ 2023-05-08 22:52 UTC (permalink / raw)
  To: Tian, Kevin, jgg, yishaih, shameerali.kolothum.thodi, alex.williamson
  Cc: tglx, darwi, kvm, Jiang, Dave, Liu, Jing2, Raj, Ashok, Yu,
	Fenghua, tom.zanussi, linux-kernel

Hi Kevin,

On 5/5/2023 12:21 AM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Saturday, April 29, 2023 2:24 AM
>>
>> Hi Kevin,
>>
>> On 4/27/2023 11:33 PM, Tian, Kevin wrote:
>>>> From: Chatre, Reinette <reinette.chatre@intel.com>
>>>> Sent: Friday, April 28, 2023 1:36 AM
>>>>
>>>> @@ -55,17 +80,28 @@ static void vfio_send_intx_eventfd(void *opaque,
>>>> void *unused)
>>>>  {
>>>>  	struct vfio_pci_core_device *vdev = opaque;
>>>>
>>>> -	if (likely(is_intx(vdev) && !vdev->virq_disabled))
>>>> -		eventfd_signal(vdev->ctx[0].trigger, 1);
>>>> +	if (likely(is_intx(vdev) && !vdev->virq_disabled)) {
>>>> +		struct vfio_pci_irq_ctx *ctx;
>>>> +
>>>> +		ctx = vfio_irq_ctx_get(vdev, 0);
>>>> +		if (!ctx)
>>>> +			return;
>>>
>>> if this error happens it implies a kernel bug since the same check
>>> has been done in vfio_intx_enable(). Then should be a WARN_ON().
>>
>> Sure. Considering that if these are triggered it may result
>> in many instances, so perhaps WARN_ON_ONCE()?
> 
> yes.
> 
>>
>>> ditto for other intx functions which can be called only after intx
>>> is enabled.
>>
>> It seems the instances in this category can be identified as the places
>> where the array contents is currently used without any checks.
>>
>> I am planning on the following changes:
>>
> 
> that looks good to me

Thank you so much for this guidance. After adding these WARNs two of them
were actually encountered and revealed that the interrupt context was retrieved
too early in vfio_pci_intx_mask() and vfio_pci_intx_unmask_handler() in
the scenario where these calls are triggered via config writes while INTx is
disabled. This will be fixed in the next version.

Reinette

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-05-08 22:53 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-27 17:35 [PATCH V4 00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts Reinette Chatre
2023-04-27 17:35 ` [PATCH V4 01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable Reinette Chatre
2023-04-28  6:28   ` Tian, Kevin
2023-04-27 17:35 ` [PATCH V4 02/11] vfio/pci: Remove negative check on unsigned vector Reinette Chatre
2023-04-28  6:29   ` Tian, Kevin
2023-04-27 17:36 ` [PATCH V4 03/11] vfio/pci: Prepare for dynamic interrupt context storage Reinette Chatre
2023-04-28  6:33   ` Tian, Kevin
2023-04-28 18:24     ` Reinette Chatre
2023-05-05  7:21       ` Tian, Kevin
2023-05-08 22:52         ` Reinette Chatre
2023-04-27 17:36 ` [PATCH V4 04/11] vfio/pci: Move to single error path Reinette Chatre
2023-04-28  6:34   ` Tian, Kevin
2023-04-27 17:36 ` [PATCH V4 05/11] vfio/pci: Use xarray for interrupt context storage Reinette Chatre
2023-04-28  6:35   ` Tian, Kevin
2023-04-27 17:36 ` [PATCH V4 06/11] vfio/pci: Remove interrupt context counter Reinette Chatre
2023-04-28  6:36   ` Tian, Kevin
2023-04-27 17:36 ` [PATCH V4 07/11] vfio/pci: Update stale comment Reinette Chatre
2023-04-28  6:42   ` Tian, Kevin
2023-04-28 18:24     ` Reinette Chatre
2023-04-27 17:36 ` [PATCH V4 08/11] vfio/pci: Use bitfield for struct vfio_pci_core_device flags Reinette Chatre
2023-04-28  6:43   ` Tian, Kevin
2023-04-27 17:36 ` [PATCH V4 09/11] vfio/pci: Probe and store ability to support dynamic MSI-X Reinette Chatre
2023-04-28  6:43   ` Tian, Kevin
2023-04-27 17:36 ` [PATCH V4 10/11] vfio/pci: Support " Reinette Chatre
2023-04-28  6:50   ` Tian, Kevin
2023-04-28 18:35     ` Reinette Chatre
2023-05-05  8:10       ` Tian, Kevin
2023-05-05 15:28         ` Alex Williamson
2023-05-06  8:15           ` Tian, Kevin
2023-05-05 17:21         ` Reinette Chatre
2023-05-06  8:13           ` Tian, Kevin
2023-04-27 17:36 ` [PATCH V4 11/11] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X Reinette Chatre
2023-04-28  6:50   ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).