[RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors
@ 2019-06-22  0:19 Megha Dey
  2019-06-22  0:19 ` [RFC V1 RESEND 1/6] PCI/MSI: New structures/macros for dynamic MSI-X allocation Megha Dey
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Megha Dey @ 2019-06-22  0:19 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel
  Cc: tglx, marc.zyngier, ashok.raj, jacob.jun.pan, megha.dey, Megha Dey

Currently, MSI-X vector enabling and allocation for a PCIe device is
static i.e. a device driver gets only one chance to enable a specific
number of MSI-X vectors, usually during device probe. Also, in many
cases, drivers usually reserve more than required number of vectors
anticipating their use, which unnecessarily blocks resources that
could have been made available to other devices. Lastly, there is no
way for drivers to reserve more vectors, if the MSI-x has already been
enabled for that device.

Hence, a dynamic MSI-X kernel infrastructure can benefit drivers by
deferring MSI-X allocation to post probe phase, where actual demand
information is available.

This patchset enables the dynamic allocation/de-allocation of MSI-X
vectors by introducing 2 new APIs:
pci_alloc_irq_vectors_dyn() and pci_free_irq_vectors_grp():

We have had requests from some of the NIC/RDMA users who have lots of
interrupt resources and would like to allocate them on demand,
instead of using an all or none approach.

The APIs are fairly well tested (multiple allocations/deallocations),
but we have no early adopters yet. Hence, sending this series as an
RFC for review and comments.

The patches are based out of Linux 5.2-rc5.

Megha Dey (6):
  PCI/MSI: New structures/macros for dynamic MSI-X allocation
  PCI/MSI: Dynamic allocation of MSI-X vectors by group
  x86: Introduce the dynamic teardown function
  PCI/MSI: Introduce new structure to manage MSI-x entries
  PCI/MSI: Free MSI-X resources by group
  Documentation: PCI/MSI: Document dynamic MSI-X infrastructure

 Documentation/PCI/MSI-HOWTO.txt |  38 +++++
 arch/x86/include/asm/x86_init.h |   1 +
 arch/x86/kernel/x86_init.c      |   6 +
 drivers/pci/msi.c               | 363 +++++++++++++++++++++++++++++++++++++---
 drivers/pci/probe.c             |   9 +
 include/linux/device.h          |   3 +
 include/linux/msi.h             |  13 ++
 include/linux/pci.h             |  61 +++++++
 kernel/irq/msi.c                |  34 +++-
 9 files changed, 497 insertions(+), 31 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC V1 RESEND 1/6] PCI/MSI: New structures/macros for dynamic MSI-X allocation
  2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
@ 2019-06-22  0:19 ` Megha Dey
  2019-06-22  0:19 ` [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group Megha Dey
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-06-22  0:19 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel
  Cc: tglx, marc.zyngier, ashok.raj, jacob.jun.pan, megha.dey, Megha Dey

This is a preparatory patch to introduce the dynamic allocation of
MSI-X vectors. In this patch, we add new structure members and macros
which will be consumed by the API that will dynamically allocate
MSI-X vectors.

Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 include/linux/device.h | 3 +++
 include/linux/msi.h    | 9 +++++++++
 include/linux/pci.h    | 6 ++++++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 848fc71..99d4951 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -967,6 +967,7 @@ struct dev_links_info {
  *              device.
  * @dma_coherent: this particular device is dma coherent, even if the
  *		architecture supports non-coherent devices.
+ * @grp_first_desc: Pointer to the first msi_desc in every MSI-X group
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -1062,6 +1063,8 @@ struct device {
     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
 	bool			dma_coherent:1;
 #endif
+	/* For dynamic MSI-X allocation */
+	struct msi_desc		*grp_first_desc;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
diff --git a/include/linux/msi.h b/include/linux/msi.h
index d48e919..91273cd 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -74,6 +74,7 @@ struct ti_sci_inta_msi_desc {
  * @default_irq:[PCI MSI/X] The default pre-assigned non-MSI irq
  * @mask_pos:	[PCI MSI]   Mask register position
  * @mask_base:	[PCI MSI-X] Mask register base address
+ * @group_id:	[PCI MSI-X] group to which this descriptor belongs
  * @platform:	[platform]  Platform device specific msi descriptor data
  * @fsl_mc:	[fsl-mc]    FSL MC device specific msi descriptor data
  * @inta:	[INTA]	    TISCI based INTA specific msi descriptor data
@@ -107,6 +108,7 @@ struct msi_desc {
 				u8	mask_pos;
 				void __iomem *mask_base;
 			};
+			unsigned int	group_id;
 		};
 
 		/*
@@ -131,6 +133,10 @@ struct msi_desc {
 	list_for_each_entry((desc), dev_to_msi_list((dev)), list)
 #define for_each_msi_entry_safe(desc, tmp, dev)	\
 	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
+/* Iterate through MSI entries of device dev starting from a given desc */
+#define for_each_msi_entry_from(desc, dev)				\
+	desc = (*dev).grp_first_desc;					\
+	list_for_each_entry_from((desc), dev_to_msi_list((dev)), list)	\
 
 #ifdef CONFIG_IRQ_MSI_IOMMU
 static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
@@ -159,6 +165,9 @@ static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
 #define first_pci_msi_entry(pdev)	first_msi_entry(&(pdev)->dev)
 #define for_each_pci_msi_entry(desc, pdev)	\
 	for_each_msi_entry((desc), &(pdev)->dev)
+/* Iterate through PCI-MSI entries of pdev starting from a given desc */
+#define for_each_pci_msi_entry_from(desc, pdev)        \
+	for_each_msi_entry_from((desc), &(pdev)->dev)
 
 struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc);
 void *msi_desc_to_pci_sysdata(struct msi_desc *desc);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index dd436da..b9a1d41 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -465,6 +465,12 @@ struct pci_dev {
 	char		*driver_override; /* Driver name to force a match */
 
 	unsigned long	priv_flags;	/* Private flags for the PCI driver */
+	/* For dynamic MSI-X allocation */
+	unsigned int	num_msix;	/* Number of MSI-X vectors supported */
+	void __iomem	*base;		/* Holds base address of MSI-X table */
+	struct idr	*grp_idr;       /* IDR to assign group to MSI-X vecs */
+	unsigned long	*entry;         /* Bitmap to represent MSI-X entries */
+	bool		one_shot;	/* If true, oneshot MSI-X allocation */
 };
 
 static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
  2019-06-22  0:19 ` [RFC V1 RESEND 1/6] PCI/MSI: New structures/macros for dynamic MSI-X allocation Megha Dey
@ 2019-06-22  0:19 ` Megha Dey
  2019-06-29  7:59   ` Thomas Gleixner
  2019-06-22  0:19 ` [RFC V1 RESEND 3/6] x86: Introduce the dynamic teardown function Megha Dey
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Megha Dey @ 2019-06-22  0:19 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel
  Cc: tglx, marc.zyngier, ashok.raj, jacob.jun.pan, megha.dey, Megha Dey

Currently, MSI-X vector enabling and allocation for a PCIe device is
static i.e. a device driver gets only one chance to enable a specific
number of MSI-X vectors, usually during device probe. Also, in many
cases, drivers usually reserve more than required number of vectors
anticipating their use, which unnecessarily blocks resources that
could have been made available to other devices. Lastly, there is no
way for drivers to reserve more vectors, if the MSI-X has already been
enabled for that device.

Hence, a dynamic MSI-X kernel infrastructure can benefit drivers by
deferring MSI-X allocation to post probe phase, where actual demand
information is available.

This patch enables the dynamic allocation of MSI-X vectors even after
MSI-X is enabled for a PCIe device by introducing a new API:
pci_alloc_irq_vectors_dyn().

This API can be called multiple times by the driver. The MSI-X vectors
allocated each time are associated with a group ID. If the existing
static allocation is used, a default group ID of -1 is assigned. The
existing pci_alloc_irq_vectors() and the new pci_alloc_irq_vectors_dyn()
API cannot be used alongside each other.

Lastly, in order to obtain the Linux IRQ number associated with any
vector in a group, a new API pci_irq_vector_group() has been introduced.

Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 drivers/pci/msi.c   | 203 +++++++++++++++++++++++++++++++++++++++++++++-------
 drivers/pci/probe.c |   8 +++
 include/linux/pci.h |  37 ++++++++++
 kernel/irq/msi.c    |   8 +--
 4 files changed, 226 insertions(+), 30 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index e039b74..73ad9bf 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -102,7 +102,7 @@ int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 	if (type == PCI_CAP_ID_MSI && nvec > 1)
 		return 1;
 
-	for_each_pci_msi_entry(entry, dev) {
+	for_each_pci_msi_entry_from(entry, dev) {
 		ret = arch_setup_msi_irq(dev, entry);
 		if (ret < 0)
 			return ret;
@@ -468,7 +468,7 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
 	int i;
 
 	/* Determine how many msi entries we have */
-	for_each_pci_msi_entry(entry, pdev)
+	for_each_pci_msi_entry_from(entry, pdev)
 		num_msi += entry->nvec_used;
 	if (!num_msi)
 		return 0;
@@ -477,7 +477,7 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
 	msi_attrs = kcalloc(num_msi + 1, sizeof(void *), GFP_KERNEL);
 	if (!msi_attrs)
 		return -ENOMEM;
-	for_each_pci_msi_entry(entry, pdev) {
+	for_each_pci_msi_entry_from(entry, pdev) {
 		for (i = 0; i < entry->nvec_used; i++) {
 			msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
 			if (!msi_dev_attr)
@@ -506,7 +506,11 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
 		goto error_irq_group;
 	msi_irq_groups[0] = msi_irq_group;
 
-	ret = sysfs_create_groups(&pdev->dev.kobj, msi_irq_groups);
+	if (!pdev->msix_enabled)
+		ret = sysfs_create_group(&pdev->dev.kobj, msi_irq_group);
+	else
+		ret = sysfs_merge_group(&pdev->dev.kobj, msi_irq_group);
+
 	if (ret)
 		goto error_irq_groups;
 	pdev->msi_irq_groups = msi_irq_groups;
@@ -574,7 +578,7 @@ static int msi_verify_entries(struct pci_dev *dev)
 {
 	struct msi_desc *entry;
 
-	for_each_pci_msi_entry(entry, dev) {
+	for_each_pci_msi_entry_from(entry, dev) {
 		if (!dev->no_64bit_msi || !entry->msg.address_hi)
 			continue;
 		pci_err(dev, "Device has broken 64-bit MSI but arch"
@@ -615,6 +619,9 @@ static int msi_capability_init(struct pci_dev *dev, int nvec,
 
 	list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
 
+	 dev->dev.grp_first_desc = list_last_entry
+			(dev_to_msi_list(&dev->dev), struct msi_desc, list);
+
 	/* Configure MSI capability structure */
 	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
 	if (ret) {
@@ -669,7 +676,7 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
 
 static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			      struct msix_entry *entries, int nvec,
-			      struct irq_affinity *affd)
+			      struct irq_affinity *affd, int group)
 {
 	struct irq_affinity_desc *curmsk, *masks = NULL;
 	struct msi_desc *entry;
@@ -698,8 +705,20 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			entry->msi_attrib.entry_nr = i;
 		entry->msi_attrib.default_irq	= dev->irq;
 		entry->mask_base		= base;
+		entry->group_id			= group;
 
 		list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
+
+		/*
+		 * Save the pointer to the first msi_desc entry of every
+		 * MSI-X group. This pointer is used by other functions
+		 * as the starting point to iterate through each of the
+		 * entries in that particular group.
+		 */
+		if (!i)
+			dev->dev.grp_first_desc = list_last_entry
+			(dev_to_msi_list(&dev->dev), struct msi_desc, list);
+
 		if (masks)
 			curmsk++;
 	}
@@ -715,7 +734,7 @@ static void msix_program_entries(struct pci_dev *dev,
 	struct msi_desc *entry;
 	int i = 0;
 
-	for_each_pci_msi_entry(entry, dev) {
+	for_each_pci_msi_entry_from(entry, dev) {
 		if (entries)
 			entries[i++].vector = entry->irq;
 		entry->masked = readl(pci_msix_desc_addr(entry) +
@@ -730,28 +749,31 @@ static void msix_program_entries(struct pci_dev *dev,
  * @entries: pointer to an array of struct msix_entry entries
  * @nvec: number of @entries
  * @affd: Optional pointer to enable automatic affinity assignement
+ * @group: The Group ID to be allocated to the msi-x vectors
  *
  * Setup the MSI-X capability structure of device function with a
  * single MSI-X irq. A return of zero indicates the successful setup of
  * requested MSI-X entries with allocated irqs or non-zero for otherwise.
  **/
 static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
-				int nvec, struct irq_affinity *affd)
+				int nvec, struct irq_affinity *affd, int group)
 {
 	int ret;
 	u16 control;
-	void __iomem *base;
 
 	/* Ensure MSI-X is disabled while it is set up */
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
 
 	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
+
 	/* Request & Map MSI-X table region */
-	base = msix_map_region(dev, msix_table_size(control));
-	if (!base)
-		return -ENOMEM;
+	if (!dev->msix_enabled) {
+		dev->base = msix_map_region(dev, msix_table_size(control));
+		if (!dev->base)
+			return -ENOMEM;
+	}
 
-	ret = msix_setup_entries(dev, base, entries, nvec, affd);
+	ret = msix_setup_entries(dev, dev->base, entries, nvec, affd, group);
 	if (ret)
 		return ret;
 
@@ -784,6 +806,7 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
 
 	pcibios_free_irq(dev);
+
 	return 0;
 
 out_avail:
@@ -795,7 +818,7 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
 		struct msi_desc *entry;
 		int avail = 0;
 
-		for_each_pci_msi_entry(entry, dev) {
+		for_each_pci_msi_entry_from(entry, dev) {
 			if (entry->irq != 0)
 				avail++;
 		}
@@ -932,7 +955,8 @@ int pci_msix_vec_count(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_msix_vec_count);
 
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
-			     int nvec, struct irq_affinity *affd)
+			     int nvec, struct irq_affinity *affd,
+			     bool one_shot, int group)
 {
 	int nr_entries;
 	int i, j;
@@ -963,7 +987,7 @@ static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 		pci_info(dev, "can't enable MSI-X (MSI IRQ already assigned)\n");
 		return -EINVAL;
 	}
-	return msix_capability_init(dev, entries, nvec, affd);
+	return msix_capability_init(dev, entries, nvec, affd, group);
 }
 
 static void pci_msix_shutdown(struct pci_dev *dev)
@@ -1079,16 +1103,14 @@ EXPORT_SYMBOL(pci_enable_msi);
 
 static int __pci_enable_msix_range(struct pci_dev *dev,
 				   struct msix_entry *entries, int minvec,
-				   int maxvec, struct irq_affinity *affd)
+				   int maxvec, struct irq_affinity *affd,
+				   bool one_shot, int group)
 {
 	int rc, nvec = maxvec;
 
 	if (maxvec < minvec)
 		return -ERANGE;
 
-	if (WARN_ON_ONCE(dev->msix_enabled))
-		return -EINVAL;
-
 	for (;;) {
 		if (affd) {
 			nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
@@ -1096,7 +1118,8 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 				return -ENOSPC;
 		}
 
-		rc = __pci_enable_msix(dev, entries, nvec, affd);
+		rc = __pci_enable_msix(dev, entries, nvec, affd, one_shot,
+									group);
 		if (rc == 0)
 			return nvec;
 
@@ -1127,7 +1150,8 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
 		int minvec, int maxvec)
 {
-	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL);
+	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL,
+								false, -1);
 }
 EXPORT_SYMBOL(pci_enable_msix_range);
 
@@ -1153,9 +1177,49 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   unsigned int max_vecs, unsigned int flags,
 				   struct irq_affinity *affd)
 {
+	int group = -1;
+
+	dev->one_shot = true;
+
+	return pci_alloc_irq_vectors_affinity_dyn(dev, min_vecs, max_vecs,
+					flags, NULL, &group, dev->one_shot);
+}
+EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
+
+/**
+ * pci_alloc_irq_vectors_affinity_dyn - allocate multiple IRQs for a device
+ * dynamically. Can be called multiple times.
+ * @dev:		PCI device to operate on
+ * @min_vecs:		minimum number of vectors required (must be >= 1)
+ * @max_vecs:		maximum (desired) number of vectors
+ * @flags:		flags or quirks for the allocation
+ * @affd:		optional description of the affinity requirements
+ * @group_id:		group ID assigned to vectors allocated
+ * @one_shot:		true if dynamic MSI-X allocation is disabled, else false
+ *
+ * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X. Return
+ * the number of vectors allocated (which might be smaller than @max_vecs)
+ * if successful, or a negative error code on error. If less than @min_vecs
+ * interrupt vectors are available for @dev the function will fail with -ENOSPC.
+ * Assign a unique group ID to the set of vectors being allocated.
+ *
+ * To get the Linux IRQ number used for a vector that can be passed to
+ * request_irq() use the pci_irq_vector_group() helper.
+ */
+int pci_alloc_irq_vectors_affinity_dyn(struct pci_dev *dev,
+				       unsigned int min_vecs,
+				       unsigned int max_vecs,
+				       unsigned int flags,
+				       struct irq_affinity *affd,
+				       int *group_id, bool one_shot)
+{
+
 	struct irq_affinity msi_default_affd = {0};
-	int msix_vecs = -ENOSPC;
+	int msix_vecs = -ENOSPC, i;
 	int msi_vecs = -ENOSPC;
+	struct msix_entry *entries = NULL;
+	struct msi_desc *entry;
+	int p = 0;
 
 	if (flags & PCI_IRQ_AFFINITY) {
 		if (!affd)
@@ -1166,8 +1230,54 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 	}
 
 	if (flags & PCI_IRQ_MSIX) {
-		msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs,
-						    max_vecs, affd);
+		if (dev->msix_enabled) {
+			if (one_shot) {
+				goto err_alloc;
+			} else {
+				for_each_pci_msi_entry(entry, dev) {
+					if (entry->group_id != -1)
+						p = 1;
+				}
+				if (!p)
+					goto err_alloc;
+			}
+		} else {
+			dev->num_msix = pci_msix_vec_count(dev);
+			dev->entry = kcalloc(BITS_TO_LONGS(dev->num_msix),
+					sizeof(unsigned long), GFP_KERNEL);
+			if (!dev->entry)
+				return -ENOMEM;
+		}
+
+		entries = kcalloc(max_vecs, sizeof(struct msix_entry),
+								GFP_KERNEL);
+		if (entries == NULL)
+			return -ENOMEM;
+
+		for (i = 0; i < max_vecs; i++) {
+			entries[i].entry = find_first_zero_bit(
+						dev->entry, dev->num_msix);
+			if (entries[i].entry == dev->num_msix)
+				return -ENOSPC;
+			set_bit(entries[i].entry, dev->entry);
+		}
+
+		if (!one_shot) {
+			/* Assign a unique group ID */
+			*group_id = idr_alloc(dev->grp_idr, NULL,
+						0, dev->num_msix, GFP_KERNEL);
+			if (*group_id < 0) {
+				if (*group_id == -ENOSPC)
+					pci_err(dev, "No free group IDs\n");
+				return *group_id;
+			}
+		}
+
+		msix_vecs = __pci_enable_msix_range(dev, entries, min_vecs,
+					max_vecs, affd, one_shot, *group_id);
+
+		kfree(entries);
+
 		if (msix_vecs > 0)
 			return msix_vecs;
 	}
@@ -1197,8 +1307,12 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 	if (msix_vecs == -ENOSPC)
 		return -ENOSPC;
 	return msi_vecs;
+
+err_alloc:
+	WARN_ON_ONCE(dev->msix_enabled);
+	return -EINVAL;
 }
-EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
+EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity_dyn);
 
 /**
  * pci_free_irq_vectors - free previously allocated IRQs for a device
@@ -1248,6 +1362,43 @@ int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
 EXPORT_SYMBOL(pci_irq_vector);
 
 /**
+ * pci_irq_vector_group - return the IRQ number of a device vector associated
+ * with a group
+ * @dev: PCI device to operate on
+ * @nr: device-relative interrupt vector index (0-based)
+ * @group_id: group from which IRQ number should be returned
+ */
+int pci_irq_vector_group(struct pci_dev *dev, unsigned int nr,
+						unsigned int group_id)
+{
+	if (dev->msix_enabled) {
+		struct msi_desc *entry;
+		int i = 0, grp_present = 0;
+
+		for_each_pci_msi_entry(entry, dev) {
+			if (entry->group_id == group_id) {
+				grp_present = 1;
+				if (i == nr)
+					return entry->irq;
+				i++;
+			}
+		}
+
+		if (!grp_present) {
+			pci_err(dev, "Group %d not present\n", group_id);
+			return -EINVAL;
+		}
+
+		pci_err(dev, "Interrupt vector index %d does not exist in "
+						"group %d\n", nr, group_id);
+	}
+
+	pci_err(dev, "MSI-X not enabled\n");
+	return -EINVAL;
+}
+EXPORT_SYMBOL(pci_irq_vector_group);
+
+/**
  * pci_irq_get_affinity - return the affinity of a particular msi vector
  * @dev:	PCI device to operate on
  * @nr:		device-relative interrupt vector index (0-based).
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 0e8e2c1..491c1cf 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2426,6 +2426,14 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
 	if (!dev)
 		return NULL;
 
+	/* For dynamic MSI-x */
+	dev->grp_idr = kzalloc(sizeof(struct idr), GFP_KERNEL);
+	if (!dev->grp_idr)
+		return NULL;
+
+	/* Initialise the IDR structures */
+	idr_init(dev->grp_idr);
+
 	INIT_LIST_HEAD(&dev->bus_list);
 	dev->dev.type = &pci_dev_type;
 	dev->bus = pci_bus_get(bus);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b9a1d41..c56462c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1411,9 +1411,17 @@ static inline int pci_enable_msix_exact(struct pci_dev *dev,
 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   unsigned int max_vecs, unsigned int flags,
 				   struct irq_affinity *affd);
+int pci_alloc_irq_vectors_affinity_dyn(struct pci_dev *dev,
+				       unsigned int min_vecs,
+				       unsigned int max_vecs,
+				       unsigned int flags,
+				       struct irq_affinity *affd,
+				       int *group_id, bool one_shot);
 
 void pci_free_irq_vectors(struct pci_dev *dev);
 int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
+int pci_irq_vector_group(struct pci_dev *dev, unsigned int nr,
+						unsigned int group_id);
 const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);
 int pci_irq_get_node(struct pci_dev *pdev, int vec);
 
@@ -1443,6 +1451,17 @@ pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 	return -ENOSPC;
 }
 
+static inline int
+pci_alloc_irq_vectors_affinity_dyn(struct pci_dev *dev, unsigned int min_vecs,
+				   unsigned int max_vecs, unsigned int flags,
+				   struct irq_affinity *aff_desc,
+				   int *group_id, bool one_shot)
+{
+	if ((flags & PCI_IRQ_LEGACY) && min_vecs == 1 && dev->irq)
+		return 1;
+	return -ENOSPC;
+}
+
 static inline void pci_free_irq_vectors(struct pci_dev *dev)
 {
 }
@@ -1453,6 +1472,15 @@ static inline int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
 		return -EINVAL;
 	return dev->irq;
 }
+
+static inline int pci_irq_vector_group(struct pci_dev *dev, unsigned int nr,
+							unsigned int group_id)
+{
+	if (WARN_ON_ONCE(nr > 0))
+		return -EINVAL;
+	return dev->irq;
+}
+
 static inline const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev,
 		int vec)
 {
@@ -1473,6 +1501,15 @@ pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 					      NULL);
 }
 
+static inline int
+pci_alloc_irq_vectors_dyn(struct pci_dev *dev, unsigned int min_vecs,
+			  unsigned int max_vecs, unsigned int flags,
+			  int *group_id)
+{
+	return pci_alloc_irq_vectors_affinity_dyn(dev, min_vecs, max_vecs,
+					  flags, NULL, group_id, false);
+}
+
 /**
  * pci_irqd_intx_xlate() - Translate PCI INTx value to an IRQ domain hwirq
  * @d: the INTx IRQ domain
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index ad26fbc..5cfa931 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -411,7 +411,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	if (ret)
 		return ret;
 
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_from(desc, dev) {
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -437,7 +437,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 
 	can_reserve = msi_check_reservation_mode(domain, info, dev);
 
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_from(desc, dev) {
 		virq = desc->irq;
 		if (desc->nvec_used == 1)
 			dev_dbg(dev, "irq %d for MSI\n", virq);
@@ -465,7 +465,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	 * so request_irq() will assign the final vector.
 	 */
 	if (can_reserve) {
-		for_each_msi_entry(desc, dev) {
+		for_each_msi_entry_from(desc, dev) {
 			irq_data = irq_domain_get_irq_data(domain, desc->irq);
 			irqd_clr_activated(irq_data);
 		}
@@ -473,7 +473,7 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 	return 0;
 
 cleanup:
-	for_each_msi_entry(desc, dev) {
+	for_each_msi_entry_from(desc, dev) {
 		struct irq_data *irqd;
 
 		if (desc->irq == virq)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC V1 RESEND 3/6] x86: Introduce the dynamic teardown function
  2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
  2019-06-22  0:19 ` [RFC V1 RESEND 1/6] PCI/MSI: New structures/macros for dynamic MSI-X allocation Megha Dey
  2019-06-22  0:19 ` [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group Megha Dey
@ 2019-06-22  0:19 ` Megha Dey
  2019-06-29  8:01   ` Thomas Gleixner
  2019-06-22  0:19 ` [RFC V1 RESEND 4/6] PCI/MSI: Introduce new structure to manage MSI-x entries Megha Dey
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Megha Dey @ 2019-06-22  0:19 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel
  Cc: tglx, marc.zyngier, ashok.raj, jacob.jun.pan, megha.dey, Megha Dey

This is a preparatory patch to introduce disabling of MSI-X vectors
belonging to a particular group. In this patch, we introduce a x86
specific mechanism to teardown the IRQ vectors belonging to a
particular group.

Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 arch/x86/include/asm/x86_init.h |  1 +
 arch/x86/kernel/x86_init.c      |  6 ++++++
 drivers/pci/msi.c               | 18 ++++++++++++++++++
 include/linux/msi.h             |  2 ++
 4 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index b85a7c5..50f26a0 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -283,6 +283,7 @@ struct pci_dev;
 struct x86_msi_ops {
 	int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type);
 	void (*teardown_msi_irq)(unsigned int irq);
+	void (*teardown_msi_irqs_grp)(struct pci_dev *dev, int group_id);
 	void (*teardown_msi_irqs)(struct pci_dev *dev);
 	void (*restore_msi_irqs)(struct pci_dev *dev);
 };
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 50a2b49..794e7d4 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -127,6 +127,7 @@ EXPORT_SYMBOL_GPL(x86_platform);
 struct x86_msi_ops x86_msi __ro_after_init = {
 	.setup_msi_irqs		= native_setup_msi_irqs,
 	.teardown_msi_irq	= native_teardown_msi_irq,
+	.teardown_msi_irqs_grp	= default_teardown_msi_irqs_grp,
 	.teardown_msi_irqs	= default_teardown_msi_irqs,
 	.restore_msi_irqs	= default_restore_msi_irqs,
 };
@@ -142,6 +143,11 @@ void arch_teardown_msi_irqs(struct pci_dev *dev)
 	x86_msi.teardown_msi_irqs(dev);
 }
 
+void arch_teardown_msi_irqs_grp(struct pci_dev *dev, int group_id)
+{
+	x86_msi.teardown_msi_irqs_grp(dev, group_id);
+}
+
 void arch_teardown_msi_irq(unsigned int irq)
 {
 	x86_msi.teardown_msi_irq(irq);
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 73ad9bf..fd7fa6e 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -133,6 +133,24 @@ void __weak arch_teardown_msi_irqs(struct pci_dev *dev)
 	return default_teardown_msi_irqs(dev);
 }
 
+void default_teardown_msi_irqs_grp(struct pci_dev *dev, int group_id)
+{
+	int i;
+	struct msi_desc *entry;
+
+	for_each_pci_msi_entry(entry, dev) {
+		if (entry->group_id == group_id && entry->irq) {
+			for (i = 0; i < entry->nvec_used; i++)
+				arch_teardown_msi_irq(entry->irq + i);
+		}
+	}
+}
+
+void __weak arch_teardown_msi_irqs_grp(struct pci_dev *dev, int group_id)
+{
+	return default_teardown_msi_irqs_grp(dev, group_id);
+}
+
 static void default_restore_msi_irq(struct pci_dev *dev, int irq)
 {
 	struct msi_desc *entry;
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 91273cd..e61ba24 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -202,9 +202,11 @@ int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
 int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
+void arch_teardown_msi_irqs_grp(struct pci_dev *dev, int group_id);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
+void default_teardown_msi_irqs_grp(struct pci_dev *dev, int group_id);
 void default_restore_msi_irqs(struct pci_dev *dev);
 
 struct msi_controller {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC V1 RESEND 4/6] PCI/MSI: Introduce new structure to manage MSI-x entries
  2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
                   ` (2 preceding siblings ...)
  2019-06-22  0:19 ` [RFC V1 RESEND 3/6] x86: Introduce the dynamic teardown function Megha Dey
@ 2019-06-22  0:19 ` Megha Dey
  2019-06-22  0:19 ` [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group Megha Dey
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-06-22  0:19 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel
  Cc: tglx, marc.zyngier, ashok.raj, jacob.jun.pan, megha.dey, Megha Dey

This is a preparatory patch to introduce disabling of MSI-X vectors
belonging to a particular group. In this patch, we introduce a new
structure msix_sysfs, which manages sysfs entries for dynamically
allocated MSI-X vectors belonging to a particular group.

Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 drivers/pci/msi.c   | 12 +++++++++++-
 drivers/pci/probe.c |  1 +
 include/linux/pci.h |  9 +++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index fd7fa6e..e947243 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -479,10 +479,11 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
 	struct device_attribute *msi_dev_attr;
 	struct attribute_group *msi_irq_group;
 	const struct attribute_group **msi_irq_groups;
+	struct msix_sysfs *msix_sysfs_entry;
 	struct msi_desc *entry;
 	int ret = -ENOMEM;
 	int num_msi = 0;
-	int count = 0;
+	int count = 0, group = -1;
 	int i;
 
 	/* Determine how many msi entries we have */
@@ -509,6 +510,8 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
 				goto error_attrs;
 			msi_dev_attr->attr.mode = S_IRUGO;
 			msi_dev_attr->show = msi_mode_show;
+			if (!i)
+				group = entry->group_id;
 			++count;
 		}
 	}
@@ -524,6 +527,13 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
 		goto error_irq_group;
 	msi_irq_groups[0] = msi_irq_group;
 
+	msix_sysfs_entry = kzalloc(sizeof(*msix_sysfs_entry) * 2, GFP_KERNEL);
+	msix_sysfs_entry->msi_irq_group = msi_irq_group;
+	msix_sysfs_entry->group_id = group;
+	msix_sysfs_entry->vecs_in_grp = count;
+	INIT_LIST_HEAD(&msix_sysfs_entry->list);
+	list_add_tail(&msix_sysfs_entry->list, &pdev->msix_sysfs);
+
 	if (!pdev->msix_enabled)
 		ret = sysfs_create_group(&pdev->dev.kobj, msi_irq_group);
 	else
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 491c1cf..bb20ef6 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2435,6 +2435,7 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
 	idr_init(dev->grp_idr);
 
 	INIT_LIST_HEAD(&dev->bus_list);
+	INIT_LIST_HEAD(&dev->msix_sysfs);
 	dev->dev.type = &pci_dev_type;
 	dev->bus = pci_bus_get(bus);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c56462c..73385c0 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -471,6 +471,7 @@ struct pci_dev {
 	struct idr	*grp_idr;       /* IDR to assign group to MSI-X vecs */
 	unsigned long	*entry;         /* Bitmap to represent MSI-X entries */
 	bool		one_shot;	/* If true, oneshot MSI-X allocation */
+	struct list_head	msix_sysfs; /* sysfs entries for MSI-X group */
 };
 
 static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
@@ -1390,6 +1391,14 @@ struct msix_entry {
 	u16	entry;	/* Driver uses to specify entry, OS writes */
 };
 
+/* Manage sysfs entries for dynamically allocated MSI-X vectors */
+struct msix_sysfs {
+	struct	attribute_group *msi_irq_group;
+	struct	list_head list;
+	int	group_id;
+	int	vecs_in_grp;
+};
+
 #ifdef CONFIG_PCI_MSI
 int pci_msi_vec_count(struct pci_dev *dev);
 void pci_disable_msi(struct pci_dev *dev);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group
  2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
                   ` (3 preceding siblings ...)
  2019-06-22  0:19 ` [RFC V1 RESEND 4/6] PCI/MSI: Introduce new structure to manage MSI-x entries Megha Dey
@ 2019-06-22  0:19 ` Megha Dey
  2019-06-29  8:08   ` Thomas Gleixner
  2019-06-22  0:19 ` [RFC V1 RESEND 6/6] Documentation: PCI/MSI: Document dynamic MSI-X infrastructure Megha Dey
  2019-08-02  0:24 ` [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Bjorn Helgaas
  6 siblings, 1 reply; 24+ messages in thread
From: Megha Dey @ 2019-06-22  0:19 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel
  Cc: tglx, marc.zyngier, ashok.raj, jacob.jun.pan, megha.dey, Megha Dey

Currently, the pci_free_irq_vectors() frees all the allocated resources
associated with a PCIe device when the device is being shut down. With
the introduction of dynamic allocation of MSI-X vectors by group ID,
there should exist an API which can free the resources allocated only
to a particular group, which can be called even if the device is not
being shut down. The pci_free_irq_vectors_grp() function provides this
type of interface.

The existing pci_free_irq_vectors() can be called along side this API.

Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 drivers/pci/msi.c   | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/msi.h |   2 +
 include/linux/pci.h |   9 ++++
 kernel/irq/msi.c    |  26 +++++++++++
 4 files changed, 167 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index e947243..342e267 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -53,9 +53,23 @@ static void pci_msi_teardown_msi_irqs(struct pci_dev *dev)
 	else
 		arch_teardown_msi_irqs(dev);
 }
+
+static void pci_msi_teardown_msi_irqs_grp(struct pci_dev *dev, int group_id)
+{
+	struct irq_domain *domain;
+
+	domain = dev_get_msi_domain(&dev->dev);
+
+	if (domain && irq_domain_is_hierarchy(domain))
+		msi_domain_free_irqs_grp(domain, &dev->dev, group_id);
+	else
+		arch_teardown_msi_irqs_grp(dev, group_id);
+}
+
 #else
 #define pci_msi_setup_msi_irqs		arch_setup_msi_irqs
 #define pci_msi_teardown_msi_irqs	arch_teardown_msi_irqs
+#define pci_msi_teardown_msi_irqs_grp	default_teardown_msi_irqs_grp
 #endif
 
 /* Arch hooks */
@@ -373,6 +387,7 @@ static void free_msi_irqs(struct pci_dev *dev)
 
 	list_for_each_entry_safe(entry, tmp, msi_list, list) {
 		if (entry->msi_attrib.is_msix) {
+			clear_bit(entry->msi_attrib.entry_nr, dev->entry);
 			if (list_is_last(&entry->list, msi_list))
 				iounmap(entry->mask_base);
 		}
@@ -381,6 +396,8 @@ static void free_msi_irqs(struct pci_dev *dev)
 		free_msi_entry(entry);
 	}
 
+	idr_destroy(dev->grp_idr);
+
 	if (dev->msi_irq_groups) {
 		sysfs_remove_groups(&dev->dev.kobj, dev->msi_irq_groups);
 		msi_attrs = dev->msi_irq_groups[0]->attrs;
@@ -398,6 +415,60 @@ static void free_msi_irqs(struct pci_dev *dev)
 	}
 }
 
+static const char msix_sysfs_grp[] = "msi_irqs";
+
+static int free_msi_irqs_grp(struct pci_dev *dev, int group_id)
+{
+	struct list_head *msi_list = dev_to_msi_list(&dev->dev);
+	struct msi_desc *entry, *tmp;
+	struct attribute **msi_attrs;
+	struct device_attribute *dev_attr;
+	int i;
+	long vec;
+	struct msix_sysfs *msix_sysfs_entry, *tmp_msix;
+	struct list_head *pci_msix = &dev->msix_sysfs;
+	int num_vec = 0;
+
+	for_each_pci_msi_entry(entry, dev) {
+		if (entry->group_id == group_id && entry->irq)
+			for (i = 0; i < entry->nvec_used; i++)
+				BUG_ON(irq_has_action(entry->irq + i));
+	}
+
+	pci_msi_teardown_msi_irqs_grp(dev, group_id);
+
+	list_for_each_entry_safe(entry, tmp, msi_list, list) {
+		if (entry->group_id == group_id) {
+			clear_bit(entry->msi_attrib.entry_nr, dev->entry);
+			list_del(&entry->list);
+			free_msi_entry(entry);
+		}
+	}
+
+	list_for_each_entry_safe(msix_sysfs_entry, tmp_msix, pci_msix, list) {
+		if (msix_sysfs_entry->group_id == group_id) {
+			msi_attrs = msix_sysfs_entry->msi_irq_group->attrs;
+			for (i = 0; i < msix_sysfs_entry->vecs_in_grp; i++) {
+				if (!i)
+					num_vec = msix_sysfs_entry->vecs_in_grp;
+				dev_attr = container_of(msi_attrs[i],
+						struct device_attribute, attr);
+				sysfs_remove_file_from_group(&dev->dev.kobj,
+					&dev_attr->attr, msix_sysfs_grp);
+				if (kstrtol(dev_attr->attr.name, 10, &vec))
+					return -EINVAL;
+				kfree(dev_attr->attr.name);
+				kfree(dev_attr);
+			}
+			msix_sysfs_entry->msi_irq_group = NULL;
+			list_del(&msix_sysfs_entry->list);
+			idr_remove(dev->grp_idr, group_id);
+			kfree(msix_sysfs_entry);
+		}
+	}
+	return num_vec;
+}
+
 static void pci_intx_for_msi(struct pci_dev *dev, int enable)
 {
 	if (!(dev->dev_flags & PCI_DEV_FLAGS_MSI_INTX_DISABLE_BUG))
@@ -1052,6 +1123,45 @@ void pci_disable_msix(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_disable_msix);
 
+static void pci_msix_shutdown_grp(struct pci_dev *dev, int group_id)
+{
+	struct msi_desc *entry;
+	int grp_present = 0;
+
+	if (pci_dev_is_disconnected(dev)) {
+		dev->msix_enabled = 0;
+		return;
+	}
+
+	/* Return the device with MSI-X masked as initial states */
+	for_each_pci_msi_entry(entry, dev) {
+		if (entry->group_id == group_id) {
+			/* Keep cached states to be restored */
+			__pci_msix_desc_mask_irq(entry, 1);
+			grp_present = 1;
+		}
+	}
+
+	if (!grp_present) {
+		pci_err(dev, "Group to be disabled not present\n");
+		return;
+	}
+}
+
+int pci_disable_msix_grp(struct pci_dev *dev, int group_id)
+{
+	int num_vecs;
+
+	if (!pci_msi_enable || !dev)
+		return -EINVAL;
+
+	pci_msix_shutdown_grp(dev, group_id);
+	num_vecs = free_msi_irqs_grp(dev, group_id);
+
+	return num_vecs;
+}
+EXPORT_SYMBOL(pci_disable_msix_grp);
+
 void pci_no_msi(void)
 {
 	pci_msi_enable = 0;
@@ -1356,6 +1466,26 @@ void pci_free_irq_vectors(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_free_irq_vectors);
 
 /**
+ * pci_free_irq_vectors_grp - free previously allocated IRQs for a
+ * device associated with a group
+ * @dev:		PCI device to operate on
+ * @group_id:		group to be freed
+ *
+ * Undoes the allocations and enabling in pci_alloc_irq_vectors_dyn().
+ * Can be only called for MSIx vectors.
+ */
+int pci_free_irq_vectors_grp(struct pci_dev *dev, int group_id)
+{
+	if (group_id < 0) {
+		pci_err(dev, "Group should be > 0\n");
+		return -EINVAL;
+	}
+
+	return pci_disable_msix_grp(dev, group_id);
+}
+EXPORT_SYMBOL(pci_free_irq_vectors_grp);
+
+/**
  * pci_irq_vector - return Linux IRQ number of a device vector
  * @dev: PCI device to operate on
  * @nr: device-relative interrupt vector index (0-based).
diff --git a/include/linux/msi.h b/include/linux/msi.h
index e61ba24..78929ad 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -333,6 +333,8 @@ struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode,
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			  int nvec);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
+void msi_domain_free_irqs_grp(struct irq_domain *domain, struct device *dev,
+								int group_id);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
 
 struct irq_domain *platform_msi_create_irq_domain(struct fwnode_handle *fwnode,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73385c0..944e539 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1404,6 +1404,7 @@ int pci_msi_vec_count(struct pci_dev *dev);
 void pci_disable_msi(struct pci_dev *dev);
 int pci_msix_vec_count(struct pci_dev *dev);
 void pci_disable_msix(struct pci_dev *dev);
+int pci_disable_msix_grp(struct pci_dev *dev, int group_id);
 void pci_restore_msi_state(struct pci_dev *dev);
 int pci_msi_enabled(void);
 int pci_enable_msi(struct pci_dev *dev);
@@ -1428,6 +1429,7 @@ int pci_alloc_irq_vectors_affinity_dyn(struct pci_dev *dev,
 				       int *group_id, bool one_shot);
 
 void pci_free_irq_vectors(struct pci_dev *dev);
+int pci_free_irq_vectors_grp(struct pci_dev *dev, int group_id);
 int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
 int pci_irq_vector_group(struct pci_dev *dev, unsigned int nr,
 						unsigned int group_id);
@@ -1439,6 +1441,8 @@ static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
 static inline void pci_disable_msi(struct pci_dev *dev) { }
 static inline int pci_msix_vec_count(struct pci_dev *dev) { return -ENOSYS; }
 static inline void pci_disable_msix(struct pci_dev *dev) { }
+static inline int pci_disable_msix_grp(struct pci_dev *dev, int group_id)
+							{ return -ENOSYS; }
 static inline void pci_restore_msi_state(struct pci_dev *dev) { }
 static inline int pci_msi_enabled(void) { return 0; }
 static inline int pci_enable_msi(struct pci_dev *dev)
@@ -1475,6 +1479,11 @@ static inline void pci_free_irq_vectors(struct pci_dev *dev)
 {
 }
 
+static inline void pci_free_irq_vectors_grp(struct pci_dev *dev, int group_id)
+{
+	return 0;
+}
+
 static inline int pci_irq_vector(struct pci_dev *dev, unsigned int nr)
 {
 	if (WARN_ON_ONCE(nr > 0))
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 5cfa931..d73a5dc 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -511,6 +511,32 @@ void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 }
 
 /**
+ * msi_domain_free_irqs_grp - Free interrupts belonging to a group from
+ * a MSI interrupt @domain associated to @dev
+ * @domain:	The domain to managing the interrupts
+ * @dev:	Pointer to device struct of the device for which the interrupt
+ *		should be freed
+ * @group_id:	The group ID to be freed
+ */
+void msi_domain_free_irqs_grp(struct irq_domain *domain, struct device *dev,
+								int group_id)
+{
+	struct msi_desc *desc;
+
+	for_each_msi_entry(desc, dev) {
+		/*
+		 * We might have failed to allocate an MSI early
+		 * enough that there is no IRQ associated to this
+		 * entry. If that's the case, don't do anything.
+		 */
+		if (desc->group_id == group_id && desc->irq) {
+			irq_domain_free_irqs(desc->irq, desc->nvec_used);
+			desc->irq = 0;
+		}
+	}
+}
+
+/**
  * msi_get_domain_info - Get the MSI interrupt domain info for @domain
  * @domain:	The interrupt domain to retrieve data from
  *
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC V1 RESEND 6/6] Documentation: PCI/MSI: Document dynamic MSI-X infrastructure
  2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
                   ` (4 preceding siblings ...)
  2019-06-22  0:19 ` [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group Megha Dey
@ 2019-06-22  0:19 ` Megha Dey
  2019-08-02  0:24 ` [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Bjorn Helgaas
  6 siblings, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-06-22  0:19 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel
  Cc: tglx, marc.zyngier, ashok.raj, jacob.jun.pan, megha.dey, Megha Dey

Add Documentation for the newly introduced dynamic allocation
and deallocation of MSI-X vectors.

Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Megha Dey <megha.dey@linux.intel.com>
---
 Documentation/PCI/MSI-HOWTO.txt | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 618e13d..5f6daf4 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -156,6 +156,44 @@ the driver can specify that only MSI or MSI-X is acceptable:
 	if (nvec < 0)
 		goto out_err;
 
+4.2.1 Dynamic MSI-X Allocation:
+
+The pci_alloc_irq_vectors() API is a one-shot method to allocate MSI resources
+i.e. they cannot be called multiple times. In order to allocate MSI-X vectors
+post probe phase, multiple times, use the following API:
+
+  int pci_alloc_irq_vectors_dyn(struct pci_dev *dev, unsigned int min_vecs,
+                unsigned int max_vecs, unsigned int flags, int *group_id);
+
+This API allocates up to max_vecs interrupt vectors for a PCI device. It returns
+the number of vectors allocated or a negative error. If the device has a
+requirement for a minimum number of vectors the driver can pass a min_vecs
+argument set to this limit, and the PCI core will return -ENOSPC if it can't
+meet the minimum number of vectors. This API is only to be used for MSI-X vectors.
+
+A group ID pointer is passed which gets populated by this function. A unique
+group_id will associated with all the MSI-X vectors allocated each time this
+function is called:
+
+	int group_id;
+	nvec = pci_alloc_irq_vectors_dyn(pdev, minvecs, maxvecs,
+					flags | PCI_IRQ_MSIX, &group_id);
+	if (nvec < 0)
+		goto out_err;
+
+To get the Linux IRQ numbers to pass to request_irq() and free_irq(), use the
+following function:
+
+  int pci_irq_vec_grp(struct pci_dev *dev, unsigned int nr, unsigned int group_id);
+
+In order to free the MSI-X resources associated with a particular group, use
+the following function:
+
+  int pci_free_irq_vectors_grp(struct pci_dev *dev, int group_id);
+
+For example, to delete the group allocated with the pci_alloc_irq_vectors_dyn(),
+	nvec = pci_free_irq_vectors_grp(pdev, group_id);
+
 4.3 Legacy APIs
 
 The following old APIs to enable and disable MSI or MSI-X interrupts should
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-06-22  0:19 ` [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group Megha Dey
@ 2019-06-29  7:59   ` Thomas Gleixner
  2019-08-06 19:05     ` Megha Dey
  2021-01-07 22:30     ` Jacob Keller
  0 siblings, 2 replies; 24+ messages in thread
From: Thomas Gleixner @ 2019-06-29  7:59 UTC (permalink / raw)
  To: Megha Dey
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

Megha,

On Fri, 21 Jun 2019, Megha Dey wrote:

> Currently, MSI-X vector enabling and allocation for a PCIe device is
> static i.e. a device driver gets only one chance to enable a specific
> number of MSI-X vectors, usually during device probe. Also, in many
> cases, drivers usually reserve more than required number of vectors
> anticipating their use, which unnecessarily blocks resources that
> could have been made available to other devices. Lastly, there is no
> way for drivers to reserve more vectors, if the MSI-X has already been
> enabled for that device.

It would be helpful if the justification for this would contain a real
world example instead of some handwaving. Also this whole set lacks a
pointer to a driver which makes use of it.

> Hence, a dynamic MSI-X kernel infrastructure can benefit drivers by
> deferring MSI-X allocation to post probe phase, where actual demand
> information is available.
> 
> This patch enables the dynamic allocation of MSI-X vectors even after

git grep 'This patch' Documentation/process/submitting-patches.rst

> MSI-X is enabled for a PCIe device by introducing a new API:
> pci_alloc_irq_vectors_dyn().
> 
> This API can be called multiple times by the driver. The MSI-X vectors
> allocated each time are associated with a group ID. If the existing
> static allocation is used, a default group ID of -1 is assigned. The

Why -1? Why not start from group ID 0 which is the most natural thing,

> +
>  	/* Configure MSI capability structure */
>  	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
>  	if (ret) {
> @@ -669,7 +676,7 @@ static void __iomem *msix_map_region(struct pci_dev *dev, unsigned nr_entries)
>  
>  static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  			      struct msix_entry *entries, int nvec,
> -			      struct irq_affinity *affd)
> +			      struct irq_affinity *affd, int group)
>  {
>  	struct irq_affinity_desc *curmsk, *masks = NULL;
>  	struct msi_desc *entry;
> @@ -698,8 +705,20 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  			entry->msi_attrib.entry_nr = i;
>  		entry->msi_attrib.default_irq	= dev->irq;
>  		entry->mask_base		= base;
> +		entry->group_id			= group;
>  
>  		list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));

What protects other users which might concurrently iterate the list from
falling on their nose?

Right now the device entry list is set up and then considered immutable
until the driver shuts down. So there is no need for serialization.

With that scheme entries are populated piecewise and are visible and might
be in use.

> +
> +		/*
> +		 * Save the pointer to the first msi_desc entry of every
> +		 * MSI-X group. This pointer is used by other functions
> +		 * as the starting point to iterate through each of the
> +		 * entries in that particular group.
> +		 */
> +		if (!i)
> +			dev->dev.grp_first_desc = list_last_entry
> +			(dev_to_msi_list(&dev->dev), struct msi_desc, list);

How is that supposed to work? The pointer gets overwritten on every
invocation of that interface. I assume this is merily an intermediate
storage for setup. Shudder.

> +
>  		if (masks)
>  			curmsk++;
>  	}
> @@ -715,7 +734,7 @@ static void msix_program_entries(struct pci_dev *dev,
>  	struct msi_desc *entry;
>  	int i = 0;
>  
> -	for_each_pci_msi_entry(entry, dev) {
> +	for_each_pci_msi_entry_from(entry, dev) {

  > +/* Iterate through MSI entries of device dev starting from a given desc */
  > +#define for_each_msi_entry_from(desc, dev)                             \
  > +       desc = (*dev).grp_first_desc;                                   \
  > +       list_for_each_entry_from((desc), dev_to_msi_list((dev)), list)  \

So this hides the whole group stuff behind a hideous iterator.

for_each_pci_msi_entry_from() ? from what? from the device? Sane iterators
which have a _from naming, have also a from argument.

>  		if (entries)
>  			entries[i++].vector = entry->irq;
>  		entry->masked = readl(pci_msix_desc_addr(entry) +
> @@ -730,28 +749,31 @@ static void msix_program_entries(struct pci_dev *dev,
>   * @entries: pointer to an array of struct msix_entry entries
>   * @nvec: number of @entries
>   * @affd: Optional pointer to enable automatic affinity assignement
> + * @group: The Group ID to be allocated to the msi-x vectors

That suggest that the group id is allocated in this function. Which is not
true. The id is handed in so the entries can be associated to that group.

>   *
>   * Setup the MSI-X capability structure of device function with a
>   * single MSI-X irq. A return of zero indicates the successful setup of
>   * requested MSI-X entries with allocated irqs or non-zero for otherwise.
>   **/
>  static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
> -				int nvec, struct irq_affinity *affd)
> +				int nvec, struct irq_affinity *affd, int group)
>  {
>  	int ret;
>  	u16 control;
> -	void __iomem *base;
>  
>  	/* Ensure MSI-X is disabled while it is set up */
>  	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
>  
>  	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
> +
>  	/* Request & Map MSI-X table region */
> -	base = msix_map_region(dev, msix_table_size(control));
> -	if (!base)
> -		return -ENOMEM;
> +	if (!dev->msix_enabled) {
> +		dev->base = msix_map_region(dev, msix_table_size(control));
> +		if (!dev->base)
> +			return -ENOMEM;
> +	}
>  
> -	ret = msix_setup_entries(dev, base, entries, nvec, affd);
> +	ret = msix_setup_entries(dev, dev->base, entries, nvec, affd, group);
>  	if (ret)
>  		return ret;

Any error exit in this function will leave MSIx disabled. That means if
this is a subsequent group allocation which fails for whatever reason, this
will render all existing and possibly already in use interrupts unusable.

>  static int __pci_enable_msix_range(struct pci_dev *dev,
>  				   struct msix_entry *entries, int minvec,
> -				   int maxvec, struct irq_affinity *affd)
> +				   int maxvec, struct irq_affinity *affd,
> +				   bool one_shot, int group)
>  {
>  	int rc, nvec = maxvec;
>  
>  	if (maxvec < minvec)
>  		return -ERANGE;
>  
> -	if (WARN_ON_ONCE(dev->msix_enabled))
> -		return -EINVAL;

So any misbehaving PCI driver can now call into this without being caught.

> -
>  	for (;;) {
>  		if (affd) {
>  			nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
> @@ -1096,7 +1118,8 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  				return -ENOSPC;
>  		}
>  
> -		rc = __pci_enable_msix(dev, entries, nvec, affd);
> +		rc = __pci_enable_msix(dev, entries, nvec, affd, one_shot,
> +									group);
>  		if (rc == 0)
>  			return nvec;
>  
> @@ -1127,7 +1150,8 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
>  		int minvec, int maxvec)
>  {
> -	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL);
> +	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL,
> +								false, -1);
>  }
>  EXPORT_SYMBOL(pci_enable_msix_range);
>  
> @@ -1153,9 +1177,49 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   unsigned int max_vecs, unsigned int flags,
>  				   struct irq_affinity *affd)
>  {
> +	int group = -1;
> +
> +	dev->one_shot = true;

So instead of checking first whether this was already initialized you
unconditionally allow this function to be called and cause havoc.

> +
> +	return pci_alloc_irq_vectors_affinity_dyn(dev, min_vecs, max_vecs,
> +					flags, NULL, &group, dev->one_shot);
> +}
> +EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
> +
> +/**
> + * pci_alloc_irq_vectors_affinity_dyn - allocate multiple IRQs for a device
> + * dynamically. Can be called multiple times.
> + * @dev:		PCI device to operate on
> + * @min_vecs:		minimum number of vectors required (must be >= 1)
> + * @max_vecs:		maximum (desired) number of vectors
> + * @flags:		flags or quirks for the allocation
> + * @affd:		optional description of the affinity requirements
> + * @group_id:		group ID assigned to vectors allocated

This is again misleading. It suggest that this is a group id handed in,
while it is a pointer to an int where the group id will be stored.

> + * @one_shot:		true if dynamic MSI-X allocation is disabled, else false

What? You have a interface for dynamically allocating the vectors and that
has an argument to disable dynamic allocation?

> + * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-X. Return
> + * the number of vectors allocated (which might be smaller than @max_vecs)
> + * if successful, or a negative error code on error. If less than @min_vecs
> + * interrupt vectors are available for @dev the function will fail with -ENOSPC.
> + * Assign a unique group ID to the set of vectors being allocated.
> + *
> + * To get the Linux IRQ number used for a vector that can be passed to
> + * request_irq() use the pci_irq_vector_group() helper.
> + */
> +int pci_alloc_irq_vectors_affinity_dyn(struct pci_dev *dev,
> +				       unsigned int min_vecs,
> +				       unsigned int max_vecs,
> +				       unsigned int flags,
> +				       struct irq_affinity *affd,
> +				       int *group_id, bool one_shot)
> +{
> +
>  	struct irq_affinity msi_default_affd = {0};
> -	int msix_vecs = -ENOSPC;
> +	int msix_vecs = -ENOSPC, i;
>  	int msi_vecs = -ENOSPC;
> +	struct msix_entry *entries = NULL;
> +	struct msi_desc *entry;
> +	int p = 0;
>  
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
> @@ -1166,8 +1230,54 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  	}
>  
>  	if (flags & PCI_IRQ_MSIX) {
> -		msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs,
> -						    max_vecs, affd);
> +		if (dev->msix_enabled) {
> +			if (one_shot) {
> +				goto err_alloc;
> +			} else {
> +				for_each_pci_msi_entry(entry, dev) {
> +					if (entry->group_id != -1)
> +						p = 1;
> +				}
> +				if (!p)
> +					goto err_alloc;
> +			}
> +		} else {
> +			dev->num_msix = pci_msix_vec_count(dev);
> +			dev->entry = kcalloc(BITS_TO_LONGS(dev->num_msix),
> +					sizeof(unsigned long), GFP_KERNEL);
> +			if (!dev->entry)
> +				return -ENOMEM;
> +		}
> +
> +		entries = kcalloc(max_vecs, sizeof(struct msix_entry),
> +								GFP_KERNEL);
> +		if (entries == NULL)
> +			return -ENOMEM;
> +
> +		for (i = 0; i < max_vecs; i++) {
> +			entries[i].entry = find_first_zero_bit(
> +						dev->entry, dev->num_msix);
> +			if (entries[i].entry == dev->num_msix)
> +				return -ENOSPC;

Leaks entries and dev->entry

> +			set_bit(entries[i].entry, dev->entry);
> +		}
> +
> +		if (!one_shot) {
> +			/* Assign a unique group ID */
> +			*group_id = idr_alloc(dev->grp_idr, NULL,
> +						0, dev->num_msix, GFP_KERNEL);

Why on earth do you need an IDR for this? The group ID is merily a
cookie. So just having an u64 which starts from 0 and is incremented after
each invocation should be good enough. You can catch the wraparound case,
but even if you just use an unsigned int, then it takes 4G invocations to
reach that point.

> +			if (*group_id < 0) {
> +				if (*group_id == -ENOSPC)
> +					pci_err(dev, "No free group IDs\n");
> +				return *group_id;
> +			}
> +		}
> +
> +		msix_vecs = __pci_enable_msix_range(dev, entries, min_vecs,
> +					max_vecs, affd, one_shot, *group_id);
> +
> +		kfree(entries);
> +
>  		if (msix_vecs > 0)
>  			return msix_vecs;
>  	}
> @@ -1197,8 +1307,12 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  	if (msix_vecs == -ENOSPC)
>  		return -ENOSPC;
>  	return msi_vecs;
> +
> +err_alloc:
> +	WARN_ON_ONCE(dev->msix_enabled);

Oh well. 

> +	return -EINVAL;
>  }
> -EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
> +EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity_dyn);
>  
> +	/* For dynamic MSI-x */
> +	dev->grp_idr = kzalloc(sizeof(struct idr), GFP_KERNEL);
> +	if (!dev->grp_idr)
> +		return NULL;
> +
> +	/* Initialise the IDR structures */
> +	idr_init(dev->grp_idr);

As already pointed out, that's overengineered.

First of all this patch is doing too many things at once. These changes
need to be split up in reviewable bits and pieces.

But I consider this approach as a POC and not something which can be meant
as a maintainable solution. It just duct tapes this new functionality into
the existing code thereby breaking things left and right. And even if you
can 'fix' these issues with more duct tape it won't be maintainable at all.

If you want to support group based allocations, then the PCI/MSI facility
has to be refactored from ground up.

  1) Introduce the concept of groups by adding a group list head to struct
     pci_dev. Ideally you create a new struct pci_dev_msi or whatever where
     all this muck goes into.

  2) Change the existing code to treat the current allocation mode as a
     group allocation. Keep the entries in a new struct msi_entry_group and
     have a group id, list head and the entries in there.

     Take care of protecting the group list.

     Assign group id 0 and add the entry_group to the list in the pci device.

     Rework the related functions so they are group aware.

     This can be split into preparatory and functional pieces, i.e. multiple
     patches.

  3) Split out the 'first time' enablement code into separate functions and
     store the relevant state in struct pci_dev_msi

  4) Rename pci_alloc_irq_vectors_affinity() to
     pci_alloc_irq_vectors_affinity_group() and add a group_id pointer
     argument.

     Make pci_alloc_irq_vectors_affinity() a wrapper function which hands
     in a NULL group id pointer and does proper sanity checking.

  5) Create similar constructs for related functionality

  6) Enable the group allocation mode for subsequent allocations

Thanks,

	tglx


  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 3/6] x86: Introduce the dynamic teardown function
  2019-06-22  0:19 ` [RFC V1 RESEND 3/6] x86: Introduce the dynamic teardown function Megha Dey
@ 2019-06-29  8:01   ` Thomas Gleixner
  2019-08-06 19:06     ` Megha Dey
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-06-29  8:01 UTC (permalink / raw)
  To: Megha Dey
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

Megha,

On Fri, 21 Jun 2019, Megha Dey wrote:
>  
> +void default_teardown_msi_irqs_grp(struct pci_dev *dev, int group_id)
> +{
> +	int i;
> +	struct msi_desc *entry;
> +
> +	for_each_pci_msi_entry(entry, dev) {
> +		if (entry->group_id == group_id && entry->irq) {
> +			for (i = 0; i < entry->nvec_used; i++)
> +				arch_teardown_msi_irq(entry->irq + i);

With proper group management this whole group_id muck goes away. You hand
in a group and clean it up and if done right then you don't need a new
interface at all simply because everything is group based.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group
  2019-06-22  0:19 ` [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group Megha Dey
@ 2019-06-29  8:08   ` Thomas Gleixner
  2019-08-06 19:09     ` Megha Dey
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-06-29  8:08 UTC (permalink / raw)
  To: Megha Dey
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

Megha,

On Fri, 21 Jun 2019, Megha Dey wrote:
> +static int free_msi_irqs_grp(struct pci_dev *dev, int group_id)
> +{

> +
> +	for_each_pci_msi_entry(entry, dev) {
> +		if (entry->group_id == group_id && entry->irq)
> +			for (i = 0; i < entry->nvec_used; i++)
> +				BUG_ON(irq_has_action(entry->irq + i));

BUG_ON is wrong here. This can and must be handled gracefully.

> +	}
> +
> +	pci_msi_teardown_msi_irqs_grp(dev, group_id);
> +
> +	list_for_each_entry_safe(entry, tmp, msi_list, list) {
> +		if (entry->group_id == group_id) {
> +			clear_bit(entry->msi_attrib.entry_nr, dev->entry);
> +			list_del(&entry->list);
> +			free_msi_entry(entry);
> +		}
> +	}
> +
> +	list_for_each_entry_safe(msix_sysfs_entry, tmp_msix, pci_msix, list) {
> +		if (msix_sysfs_entry->group_id == group_id) {

Again. Proper group management makes all of that just straight forward and
not yet another special case.

> +			msi_attrs = msix_sysfs_entry->msi_irq_group->attrs;
>  
> +static void pci_msix_shutdown_grp(struct pci_dev *dev, int group_id)
> +{
> +	struct msi_desc *entry;
> +	int grp_present = 0;
> +
> +	if (pci_dev_is_disconnected(dev)) {
> +		dev->msix_enabled = 0;

Huch? What's that? I can't figure out why this is needed and of course it
completely lacks a comment explaining this. 

> +		return;
> +	}
> +
> +	/* Return the device with MSI-X masked as initial states */
> +	for_each_pci_msi_entry(entry, dev) {
> +		if (entry->group_id == group_id) {
> +			/* Keep cached states to be restored */
> +			__pci_msix_desc_mask_irq(entry, 1);
> +			grp_present = 1;
> +		}
> +	}
> +
> +	if (!grp_present) {
> +		pci_err(dev, "Group to be disabled not present\n");
> +		return;

So you print an error and silently return

> +	}
> +}
> +
> +int pci_disable_msix_grp(struct pci_dev *dev, int group_id)
> +{
> +	int num_vecs;
> +
> +	if (!pci_msi_enable || !dev)
> +		return -EINVAL;
> +
> +	pci_msix_shutdown_grp(dev, group_id);
> +	num_vecs = free_msi_irqs_grp(dev, group_id);

just to call in another function which has to do the same group_id lookup
muck again.

> +
> +	return num_vecs;
> +}
> +EXPORT_SYMBOL(pci_disable_msix_grp);

Why is this exposed ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors
  2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
                   ` (5 preceding siblings ...)
  2019-06-22  0:19 ` [RFC V1 RESEND 6/6] Documentation: PCI/MSI: Document dynamic MSI-X infrastructure Megha Dey
@ 2019-08-02  0:24 ` Bjorn Helgaas
  2019-08-06 19:12   ` Megha Dey
  6 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2019-08-02  0:24 UTC (permalink / raw)
  To: Megha Dey
  Cc: linux-pci, linux-kernel, tglx, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

Hi Megha,

On Fri, Jun 21, 2019 at 05:19:32PM -0700, Megha Dey wrote:
> Currently, MSI-X vector enabling and allocation for a PCIe device is
> static i.e. a device driver gets only one chance to enable a specific
> number of MSI-X vectors, usually during device probe. Also, in many
> cases, drivers usually reserve more than required number of vectors
> anticipating their use, which unnecessarily blocks resources that
> could have been made available to other devices. Lastly, there is no
> way for drivers to reserve more vectors, if the MSI-x has already been
> enabled for that device.
>  
> Hence, a dynamic MSI-X kernel infrastructure can benefit drivers by
> deferring MSI-X allocation to post probe phase, where actual demand
> information is available.
>  
> This patchset enables the dynamic allocation/de-allocation of MSI-X
> vectors by introducing 2 new APIs:
> pci_alloc_irq_vectors_dyn() and pci_free_irq_vectors_grp():
> 
> We have had requests from some of the NIC/RDMA users who have lots of
> interrupt resources and would like to allocate them on demand,
> instead of using an all or none approach.
> 
> The APIs are fairly well tested (multiple allocations/deallocations),
> but we have no early adopters yet. Hence, sending this series as an
> RFC for review and comments.
> 
> The patches are based out of Linux 5.2-rc5.
> 
> Megha Dey (6):
>   PCI/MSI: New structures/macros for dynamic MSI-X allocation
>   PCI/MSI: Dynamic allocation of MSI-X vectors by group
>   x86: Introduce the dynamic teardown function
>   PCI/MSI: Introduce new structure to manage MSI-x entries

s/MSI-x/MSI-X/
If "entries" here means the same as "vectors" above, please use the
same word.

>   PCI/MSI: Free MSI-X resources by group

Is "resources" the same as "vectors"?

>   Documentation: PCI/MSI: Document dynamic MSI-X infrastructure

When you post a v2 after addressing Thomas' comments, please make
these subject lines imperative sentences beginning with a descriptive
verb.  You can run "git log --oneline drivers/pci" to see the style.
If you're adding a specific interface or structure, mention it by name
in the subject if it's practical.  The "x86" line needs a little more
context; I assume it should include "IRQ", "MSI-X", or something.

>  Documentation/PCI/MSI-HOWTO.txt |  38 +++++
>  arch/x86/include/asm/x86_init.h |   1 +
>  arch/x86/kernel/x86_init.c      |   6 +
>  drivers/pci/msi.c               | 363 +++++++++++++++++++++++++++++++++++++---
>  drivers/pci/probe.c             |   9 +
>  include/linux/device.h          |   3 +
>  include/linux/msi.h             |  13 ++
>  include/linux/pci.h             |  61 +++++++
>  kernel/irq/msi.c                |  34 +++-
>  9 files changed, 497 insertions(+), 31 deletions(-)
> 
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-06-29  7:59   ` Thomas Gleixner
@ 2019-08-06 19:05     ` Megha Dey
  2019-08-07 13:56       ` Thomas Gleixner
  2021-01-07 22:30     ` Jacob Keller
  1 sibling, 1 reply; 24+ messages in thread
From: Megha Dey @ 2019-08-06 19:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan

On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
> Megha,
> 
> On Fri, 21 Jun 2019, Megha Dey wrote:
> 
> > 
> > Currently, MSI-X vector enabling and allocation for a PCIe device
> > is
> > static i.e. a device driver gets only one chance to enable a
> > specific
> > number of MSI-X vectors, usually during device probe. Also, in many
> > cases, drivers usually reserve more than required number of vectors
> > anticipating their use, which unnecessarily blocks resources that
> > could have been made available to other devices. Lastly, there is
> > no
> > way for drivers to reserve more vectors, if the MSI-X has already
> > been
> > enabled for that device.
> It would be helpful if the justification for this would contain a
> real
> world example instead of some handwaving. Also this whole set lacks a
> pointer to a driver which makes use of it.

Hi Thomas,

Totally agreed. The request to add a dynamic MSI-X infrastructure came
from some driver teams internally and currently they do not have
bandwidth to come up with relevant test cases. <sigh>

But we hope that this patch set could serve as a precursor to the
interrupt message store (IMS) patch set, and we can use this patch set
as the baseline for the IMS patches.

> 
> > 
> > Hence, a dynamic MSI-X kernel infrastructure can benefit drivers by
> > deferring MSI-X allocation to post probe phase, where actual demand
> > information is available.
> > 
> > This patch enables the dynamic allocation of MSI-X vectors even
> > after
> git grep 'This patch' Documentation/process/submitting-patches.rst
> 

Yeah, will change this in V2.

> > 
> > MSI-X is enabled for a PCIe device by introducing a new API:
> > pci_alloc_irq_vectors_dyn().
> > 
> > This API can be called multiple times by the driver. The MSI-X
> > vectors
> > allocated each time are associated with a group ID. If the existing
> > static allocation is used, a default group ID of -1 is assigned.
> > The
> Why -1? Why not start from group ID 0 which is the most natural
> thing,
> 

According to this scheme, we had divided static/dynamic allocation of
the MSI-X vectors, hence the statically allocated vectors were assigned
a default group ID -1. The dynamic allocation group IDs would start
from 0.

But now if we are going to treat the current static allocation of MSI-X 
vectors also as a group ID, the IDs should start from 0.

> > 
> > +
> >  	/* Configure MSI capability structure */
> >  	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> >  	if (ret) {
> > @@ -669,7 +676,7 @@ static void __iomem *msix_map_region(struct
> > pci_dev *dev, unsigned nr_entries)
> >  
> >  static int msix_setup_entries(struct pci_dev *dev, void __iomem
> > *base,
> >  			      struct msix_entry *entries, int
> > nvec,
> > -			      struct irq_affinity *affd)
> > +			      struct irq_affinity *affd, int
> > group)
> >  {
> >  	struct irq_affinity_desc *curmsk, *masks = NULL;
> >  	struct msi_desc *entry;
> > @@ -698,8 +705,20 @@ static int msix_setup_entries(struct pci_dev
> > *dev, void __iomem *base,
> >  			entry->msi_attrib.entry_nr = i;
> >  		entry->msi_attrib.default_irq	= dev->irq;
> >  		entry->mask_base		= base;
> > +		entry->group_id			= group;
> >  
> >  		list_add_tail(&entry->list, dev_to_msi_list(&dev-
> > >dev));
> What protects other users which might concurrently iterate the list
> from
> falling on their nose?
> 
> Right now the device entry list is set up and then considered
> immutable
> until the driver shuts down. So there is no need for serialization.
> 
> With that scheme entries are populated piecewise and are visible and
> might
> be in use.

Yeah, good point, I will add a mutex_lock to protect the list.

According to your proposal, now we will have 2 lists; one for the
groups and one for the entries belonging to that group, so I will add
the necessary protection for both of these lists.

> > 
> > +
> > +		/*
> > +		 * Save the pointer to the first msi_desc entry of
> > every
> > +		 * MSI-X group. This pointer is used by other
> > functions
> > +		 * as the starting point to iterate through each
> > of the
> > +		 * entries in that particular group.
> > +		 */
> > +		if (!i)
> > +			dev->dev.grp_first_desc = list_last_entry
> > +			(dev_to_msi_list(&dev->dev), struct
> > msi_desc, list);
> How is that supposed to work? The pointer gets overwritten on every
> invocation of that interface. I assume this is merily an intermediate
> storage for setup. Shudder.
> 

Yes, you are right.

The grp_first_desc is simply a temporary storage to store the
first msi_desc entry of every group, which can be used by other
functions to iterate through the entries belonging to that group only,
using the for_each_pci_msi_entry/ for_each_msi_entry_from macro. It is
not the cleanest of solutions, I agree.

With your proposal of supporting a separate group list, I don't think
there will be a need to use this kind of temporary storage variable.

> > 
> > +
> >  		if (masks)
> >  			curmsk++;
> >  	}
> > @@ -715,7 +734,7 @@ static void msix_program_entries(struct pci_dev
> > *dev,
> >  	struct msi_desc *entry;
> >  	int i = 0;
> >  
> > -	for_each_pci_msi_entry(entry, dev) {
> > +	for_each_pci_msi_entry_from(entry, dev) {
>   > +/* Iterate through MSI entries of device dev starting from a
> given desc */
>   > +#define for_each_msi_entry_from(desc,
> dev)                             \
>   > +       desc =
> (*dev).grp_first_desc;                                   \
>   > +       list_for_each_entry_from((desc), dev_to_msi_list((dev)),
> list)  \
> 
> So this hides the whole group stuff behind a hideous iterator.
> 
> for_each_pci_msi_entry_from() ? from what? from the device? Sane
> iterators
> which have a _from naming, have also a from argument.
> 

This was meant to be "iterate over all the entries belonging to a
group", sorry if that was not clear. 

The current 'for_each_pci_msi_entry' macro iterates through all the
msi_desc entries belonging to a particular device. Since we have a
piecewise allocation of the MSI-X vectors with this change, we would
want to iterate only through the newly added entries, i.e the entries
allocated to the current group.

In V2, I will introduce a new macro, 'for_each_pci_msi_entry_group',
which will only iterate through the msi_desc entries belonging to a
particular group.

> > 
> >  		if (entries)
> >  			entries[i++].vector = entry->irq;
> >  		entry->masked = readl(pci_msix_desc_addr(entry) +
> > @@ -730,28 +749,31 @@ static void msix_program_entries(struct
> > pci_dev *dev,
> >   * @entries: pointer to an array of struct msix_entry entries
> >   * @nvec: number of @entries
> >   * @affd: Optional pointer to enable automatic affinity
> > assignement
> > + * @group: The Group ID to be allocated to the msi-x vectors
> That suggest that the group id is allocated in this function. Which
> is not
> true. The id is handed in so the entries can be associated to that
> group.
> 

Yeah , my bad , I will update this in V2

> > 
> >   *
> >   * Setup the MSI-X capability structure of device function with a
> >   * single MSI-X irq. A return of zero indicates the successful
> > setup of
> >   * requested MSI-X entries with allocated irqs or non-zero for
> > otherwise.
> >   **/
> >  static int msix_capability_init(struct pci_dev *dev, struct
> > msix_entry *entries,
> > -				int nvec, struct irq_affinity
> > *affd)
> > +				int nvec, struct irq_affinity
> > *affd, int group)
> >  {
> >  	int ret;
> >  	u16 control;
> > -	void __iomem *base;
> >  
> >  	/* Ensure MSI-X is disabled while it is set up */
> >  	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE,
> > 0);
> >  
> >  	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS,
> > &control);
> > +
> >  	/* Request & Map MSI-X table region */
> > -	base = msix_map_region(dev, msix_table_size(control));
> > -	if (!base)
> > -		return -ENOMEM;
> > +	if (!dev->msix_enabled) {
> > +		dev->base = msix_map_region(dev,
> > msix_table_size(control));
> > +		if (!dev->base)
> > +			return -ENOMEM;
> > +	}
> >  
> > -	ret = msix_setup_entries(dev, base, entries, nvec, affd);
> > +	ret = msix_setup_entries(dev, dev->base, entries, nvec,
> > affd, group);
> >  	if (ret)
> >  		return ret;
> Any error exit in this function will leave MSIx disabled. That means
> if
> this is a subsequent group allocation which fails for whatever
> reason, this
> will render all existing and possibly already in use interrupts
> unusable.
> 

Hmmm yeah, I hadn't thought about this!

So according to the code, we must 'Ensure MSI-X is disabled while it is
set up'. MSI-X would be disabled until the setup of the new vectors is
complete, even if we do not take the error exit right?

Earlier this was not a problem since we disable the MSI-X, setup all
the vectors at once, and then enable the MSI-X once and for all. 

I am not sure how to avoid disabling of MSI-X here.

> > 
> >  static int __pci_enable_msix_range(struct pci_dev *dev,
> >  				   struct msix_entry *entries, int
> > minvec,
> > -				   int maxvec, struct irq_affinity
> > *affd)
> > +				   int maxvec, struct irq_affinity
> > *affd,
> > +				   bool one_shot, int group)
> >  {
> >  	int rc, nvec = maxvec;
> >  
> >  	if (maxvec < minvec)
> >  		return -ERANGE;
> >  
> > -	if (WARN_ON_ONCE(dev->msix_enabled))
> > -		return -EINVAL;
> So any misbehaving PCI driver can now call into this without being
> caught.
> 

I do not understand what misbehaving PCI driver means :(

Basically this statement is what denies multiple MSI-X vector
allocations, and I wanted to remove it so that we could do just that.

Please let me know how I could change this.

> > 
> > -
> >  	for (;;) {
> >  		if (affd) {
> >  			nvec = irq_calc_affinity_vectors(minvec,
> > nvec, affd);
> > @@ -1096,7 +1118,8 @@ static int __pci_enable_msix_range(struct
> > pci_dev *dev,
> >  				return -ENOSPC;
> >  		}
> >  
> > -		rc = __pci_enable_msix(dev, entries, nvec, affd);
> > +		rc = __pci_enable_msix(dev, entries, nvec, affd,
> > one_shot,
> > +									
> > group);
> >  		if (rc == 0)
> >  			return nvec;
> >  
> > @@ -1127,7 +1150,8 @@ static int __pci_enable_msix_range(struct
> > pci_dev *dev,
> >  int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry
> > *entries,
> >  		int minvec, int maxvec)
> >  {
> > -	return __pci_enable_msix_range(dev, entries, minvec,
> > maxvec, NULL);
> > +	return __pci_enable_msix_range(dev, entries, minvec,
> > maxvec, NULL,
> > +								fa
> > lse, -1);
> >  }
> >  EXPORT_SYMBOL(pci_enable_msix_range);
> >  
> > @@ -1153,9 +1177,49 @@ int pci_alloc_irq_vectors_affinity(struct
> > pci_dev *dev, unsigned int min_vecs,
> >  				   unsigned int max_vecs, unsigned
> > int flags,
> >  				   struct irq_affinity *affd)
> >  {
> > +	int group = -1;
> > +
> > +	dev->one_shot = true;
> So instead of checking first whether this was already initialized you
> unconditionally allow this function to be called and cause havoc.
> 

Ok, got your point.

With your proposal, we can remove the one_shot member, as we will
change to use the current allocation as a group allocation itself.

> > 
> > +
> > +	return pci_alloc_irq_vectors_affinity_dyn(dev, min_vecs,
> > max_vecs,
> > +					flags, NULL, &group, dev-
> > >one_shot);
> > +}
> > +EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
> > +
> > +/**
> > + * pci_alloc_irq_vectors_affinity_dyn - allocate multiple IRQs for
> > a device
> > + * dynamically. Can be called multiple times.
> > + * @dev:		PCI device to operate on
> > + * @min_vecs:		minimum number of vectors required
> > (must be >= 1)
> > + * @max_vecs:		maximum (desired) number of vectors
> > + * @flags:		flags or quirks for the allocation
> > + * @affd:		optional description of the affinity
> > requirements
> > + * @group_id:		group ID assigned to vectors
> > allocated
> This is again misleading. It suggest that this is a group id handed
> in,
> while it is a pointer to an int where the group id will be stored.
> 

Yeah will update this in V2.

> > 
> > + * @one_shot:		true if dynamic MSI-X allocation is
> > disabled, else false
> What? You have a interface for dynamically allocating the vectors and
> that
> has an argument to disable dynamic allocation?

Well my intention was to let the user choose if they wanted the
one_shot(static) or the dynamic approach to allocate MSI-X vectors.

Of course when we change to your suggested approach, we will not have
this issue, since everything will be a group allocation.

> 
> > 
> > + * Allocate up to @max_vecs interrupt vectors for @dev, using MSI-
> > X. Return
> > + * the number of vectors allocated (which might be smaller than
> > @max_vecs)
> > + * if successful, or a negative error code on error. If less than
> > @min_vecs
> > + * interrupt vectors are available for @dev the function will fail
> > with -ENOSPC.
> > + * Assign a unique group ID to the set of vectors being allocated.
> > + *
> > + * To get the Linux IRQ number used for a vector that can be
> > passed to
> > + * request_irq() use the pci_irq_vector_group() helper.
> > + */
> > +int pci_alloc_irq_vectors_affinity_dyn(struct pci_dev *dev,
> > +				       unsigned int min_vecs,
> > +				       unsigned int max_vecs,
> > +				       unsigned int flags,
> > +				       struct irq_affinity *affd,
> > +				       int *group_id, bool
> > one_shot)
> > +{
> > +
> >  	struct irq_affinity msi_default_affd = {0};
> > -	int msix_vecs = -ENOSPC;
> > +	int msix_vecs = -ENOSPC, i;
> >  	int msi_vecs = -ENOSPC;
> > +	struct msix_entry *entries = NULL;
> > +	struct msi_desc *entry;
> > +	int p = 0;
> >  
> >  	if (flags & PCI_IRQ_AFFINITY) {
> >  		if (!affd)
> > @@ -1166,8 +1230,54 @@ int pci_alloc_irq_vectors_affinity(struct
> > pci_dev *dev, unsigned int min_vecs,
> >  	}
> >  
> >  	if (flags & PCI_IRQ_MSIX) {
> > -		msix_vecs = __pci_enable_msix_range(dev, NULL,
> > min_vecs,
> > -						    max_vecs,
> > affd);
> > +		if (dev->msix_enabled) {
> > +			if (one_shot) {
> > +				goto err_alloc;
> > +			} else {
> > +				for_each_pci_msi_entry(entry, dev)
> > {
> > +					if (entry->group_id != -1)
> > +						p = 1;
> > +				}
> > +				if (!p)
> > +					goto err_alloc;
> > +			}
> > +		} else {
> > +			dev->num_msix = pci_msix_vec_count(dev);
> > +			dev->entry = kcalloc(BITS_TO_LONGS(dev-
> > >num_msix),
> > +					sizeof(unsigned long),
> > GFP_KERNEL);
> > +			if (!dev->entry)
> > +				return -ENOMEM;
> > +		}
> > +
> > +		entries = kcalloc(max_vecs, sizeof(struct
> > msix_entry),
> > +								GF
> > P_KERNEL);
> > +		if (entries == NULL)
> > +			return -ENOMEM;
> > +
> > +		for (i = 0; i < max_vecs; i++) {
> > +			entries[i].entry = find_first_zero_bit(
> > +						dev->entry, dev-
> > >num_msix);
> > +			if (entries[i].entry == dev->num_msix)
> > +				return -ENOSPC;
> Leaks entries and dev->entry
> 

Hmm, I do a kfree(entries) later. But yeah the dev->entry has a memory
leak.

> > 
> > +			set_bit(entries[i].entry, dev->entry);
> > +		}
> > +
> > +		if (!one_shot) {
> > +			/* Assign a unique group ID */
> > +			*group_id = idr_alloc(dev->grp_idr, NULL,
> > +						0, dev->num_msix,
> > GFP_KERNEL);
> Why on earth do you need an IDR for this? The group ID is merily a
> cookie. So just having an u64 which starts from 0 and is incremented
> after
> each invocation should be good enough. You can catch the wraparound
> case,
> but even if you just use an unsigned int, then it takes 4G
> invocations to
> reach that point.
> 

I thought an IDR is better to manage the group IDs i.e to track
the group ID which have been allocated/freed, so that the group ID
could be reused.

For now MSI-X can have a maximum of 2048 interrupts, but going forward
the IMS interrupts have no theoretical limit. Since it seems unlikely
that there will be > 2^64 IMS vectors for any system, for now we can
have the simple approach suggested by you, and make changes later if
necessary.

So, in V2 the group ID will only be a monotonically increasing number,
which does not get reused even if the group is deleted at some point.

> > 
> > +			if (*group_id < 0) {
> > +				if (*group_id == -ENOSPC)
> > +					pci_err(dev, "No free
> > group IDs\n");
> > +				return *group_id;
> > +			}
> > +		}
> > +
> > +		msix_vecs = __pci_enable_msix_range(dev, entries,
> > min_vecs,
> > +					max_vecs, affd, one_shot,
> > *group_id);
> > +
> > +		kfree(entries);
> > +
> >  		if (msix_vecs > 0)
> >  			return msix_vecs;
> >  	}
> > @@ -1197,8 +1307,12 @@ int pci_alloc_irq_vectors_affinity(struct
> > pci_dev *dev, unsigned int min_vecs,
> >  	if (msix_vecs == -ENOSPC)
> >  		return -ENOSPC;
> >  	return msi_vecs;
> > +
> > +err_alloc:
> > +	WARN_ON_ONCE(dev->msix_enabled);
> Oh well.
> 

Will remove this in V2.

> > 
> > +	return -EINVAL;
> >  }
> > -EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
> > +EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity_dyn);
> >  
> > +	/* For dynamic MSI-x */
> > +	dev->grp_idr = kzalloc(sizeof(struct idr), GFP_KERNEL);
> > +	if (!dev->grp_idr)
> > +		return NULL;
> > +
> > +	/* Initialise the IDR structures */
> > +	idr_init(dev->grp_idr);
> As already pointed out, that's overengineered.
> 
> First of all this patch is doing too many things at once. These
> changes
> need to be split up in reviewable bits and pieces.

Yes, will submit smaller changes in V2.

> 
> But I consider this approach as a POC and not something which can be
> meant
> as a maintainable solution. It just duct tapes this new functionality
> into
> the existing code thereby breaking things left and right. And even if
> you
> can 'fix' these issues with more duct tape it won't be maintainable
> at all.
> 

Yeah I agree that this is not maintainable in the long run.

> If you want to support group based allocations, then the PCI/MSI
> facility
> has to be refactored from ground up.
> 
>   1) Introduce the concept of groups by adding a group list head to
> struct
>      pci_dev. Ideally you create a new struct pci_dev_msi or whatever
> where
>      all this muck goes into.
> 

I think we can use the existing list_head 'msi_list' in the struct
device for this, instead of having a new list_head for the group. So
now instead of msi_list being a list of all the msi_desc entries, it
will have a list of the different groups associated with the device.

IMHO, since IMS is non PCI compliant, having this group_list_head would
be better off in struct device than struct pci_dev, which would enable
code reuse.

>   2) Change the existing code to treat the current allocation mode as
> a
>      group allocation. Keep the entries in a new struct
> msi_entry_group and
>      have a group id, list head and the entries in there.
> 

I am thinking of something like this, please let me know if this is
what you are suggesting:

1. Introduce a new msi_entry_group struct:
struct msi_entry_grp {
  int group_id; // monotonically increasing group_id
  int num_vecs; // number of msi_desc entries per group
  struct list_head group_list; // Added to msi_list in struct device
  struct list_head entry_list; // list of msi_desc entries for this grp
}

2. Add a new 'for_each_pci_msi_entry_group' macro. This macro should
only iterate through the msi_desc entries belonging to a group.

3. The existing for_each_pci_msi_entry, needs to be modified so that it
is backward compatible. This macro should still be able to iterate
through all the entries in all the groups. 

>      Take care of protecting the group list.
> 
>      Assign group id 0 and add the entry_group to the list in the pci
> device.
> 
>      Rework the related functions so they are group aware.
> 
>      This can be split into preparatory and functional pieces, i.e.
> multiple
>      patches.
> 
>   3) Split out the 'first time' enablement code into separate
> functions and
>      store the relevant state in struct pci_dev_msi
> 
>   4) Rename pci_alloc_irq_vectors_affinity() to
>      pci_alloc_irq_vectors_affinity_group() and add a group_id
> pointer
>      argument.
> 
>      Make pci_alloc_irq_vectors_affinity() a wrapper function which
> hands
>      in a NULL group id pointer and does proper sanity checking.
> 
>   5) Create similar constructs for related functionality
> 
>   6) Enable the group allocation mode for subsequent allocations
> 

All this makes perfect sense. Thanks for the detailed feedback Thomas,
I will send out a V2 shortly.

> Thanks,
> 
> 	tglx
> 
> 
>   
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 3/6] x86: Introduce the dynamic teardown function
  2019-06-29  8:01   ` Thomas Gleixner
@ 2019-08-06 19:06     ` Megha Dey
  0 siblings, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-08-06 19:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

On Sat, 2019-06-29 at 10:01 +0200, Thomas Gleixner wrote:
> Megha,
> 
> On Fri, 21 Jun 2019, Megha Dey wrote:
> > 
> >  
> > +void default_teardown_msi_irqs_grp(struct pci_dev *dev, int
> > group_id)
> > +{
> > +	int i;
> > +	struct msi_desc *entry;
> > +
> > +	for_each_pci_msi_entry(entry, dev) {
> > +		if (entry->group_id == group_id && entry->irq) {
> > +			for (i = 0; i < entry->nvec_used; i++)
> > +				arch_teardown_msi_irq(entry->irq +
> > i);
> With proper group management this whole group_id muck goes away. You
> hand
> in a group and clean it up and if done right then you don't need a
> new
> interface at all simply because everything is group based.
> 

Yeah , with the new proposal, this will hopefully be much cleaner.

> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group
  2019-06-29  8:08   ` Thomas Gleixner
@ 2019-08-06 19:09     ` Megha Dey
  2019-08-11  7:18       ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: Megha Dey @ 2019-08-06 19:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

On Sat, 2019-06-29 at 10:08 +0200, Thomas Gleixner wrote:
> Megha,
> 
> On Fri, 21 Jun 2019, Megha Dey wrote:
> > 
> > +static int free_msi_irqs_grp(struct pci_dev *dev, int group_id)
> > +{
> > 
> > +
> > +	for_each_pci_msi_entry(entry, dev) {
> > +		if (entry->group_id == group_id && entry->irq)
> > +			for (i = 0; i < entry->nvec_used; i++)
> > +				BUG_ON(irq_has_action(entry->irq +
> > i));
> BUG_ON is wrong here. This can and must be handled gracefully.
> 

Hmm, I reused this code from the 'free_msi_irqs' function. I am not
sure why it is wrong to use BUG_ON here but ok to use it there, please
let me know.

> > 
> > +	}
> > +
> > +	pci_msi_teardown_msi_irqs_grp(dev, group_id);
> > +
> > +	list_for_each_entry_safe(entry, tmp, msi_list, list) {
> > +		if (entry->group_id == group_id) {
> > +			clear_bit(entry->msi_attrib.entry_nr, dev-
> > >entry);
> > +			list_del(&entry->list);
> > +			free_msi_entry(entry);
> > +		}
> > +	}
> > +
> > +	list_for_each_entry_safe(msix_sysfs_entry, tmp_msix,
> > pci_msix, list) {
> > +		if (msix_sysfs_entry->group_id == group_id) {
> Again. Proper group management makes all of that just straight
> forward and
> not yet another special case.
> 

Yeah, the new proposal of having a group_list would get rid of this.

> > 
> > +			msi_attrs = msix_sysfs_entry-
> > >msi_irq_group->attrs;
> >  
> > +static void pci_msix_shutdown_grp(struct pci_dev *dev, int
> > group_id)
> > +{
> > +	struct msi_desc *entry;
> > +	int grp_present = 0;
> > +
> > +	if (pci_dev_is_disconnected(dev)) {
> > +		dev->msix_enabled = 0;
> Huch? What's that? I can't figure out why this is needed and of
> course it
> completely lacks a comment explaining this. 
> 

Again, I have reused this code from the pci_msix_shutdown() function.
So for the group case, this is not required?

> > 
> > +		return;
> > +	}
> > +
> > +	/* Return the device with MSI-X masked as initial states
> > */
> > +	for_each_pci_msi_entry(entry, dev) {
> > +		if (entry->group_id == group_id) {
> > +			/* Keep cached states to be restored */
> > +			__pci_msix_desc_mask_irq(entry, 1);
> > +			grp_present = 1;
> > +		}
> > +	}
> > +
> > +	if (!grp_present) {
> > +		pci_err(dev, "Group to be disabled not
> > present\n");
> > +		return;
> So you print an error and silently return
> 

This is a void function, hence no error value can be returned. What do
you think is the right thing to do if someone wants to delete a group
which is not present?

> > 
> > +	}
> > +}
> > +
> > +int pci_disable_msix_grp(struct pci_dev *dev, int group_id)
> > +{
> > +	int num_vecs;
> > +
> > +	if (!pci_msi_enable || !dev)
> > +		return -EINVAL;
> > +
> > +	pci_msix_shutdown_grp(dev, group_id);
> > +	num_vecs = free_msi_irqs_grp(dev, group_id);
> just to call in another function which has to do the same group_id
> lookup
> muck again.

Even with the new proposal, we are to have 2 sets of functions: one to
delete all the msic_desc entries associated with the device, and the
other to delete those only belonging a 'user specified' group. So we do
need to pass a group_id to these functions right? Yes, internally the
deletion would be straightforward with the new approach.

> 
> > 
> > +
> > +	return num_vecs;
> > +}
> > +EXPORT_SYMBOL(pci_disable_msix_grp);
> Why is this exposed ?
> 

As before, I just followed what pci_disable_msix() did <sigh>. Looks
like pci_disable_msix is called from a variety of drivers, thus it is
exposed. Currently, no one will use the pci_disable_msix_grp()
externally, thus it need not be exposed for now.

> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors
  2019-08-02  0:24 ` [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Bjorn Helgaas
@ 2019-08-06 19:12   ` Megha Dey
  0 siblings, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-08-06 19:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, tglx, marc.zyngier, ashok.raj, jacob.jun.pan

On Thu, 2019-08-01 at 19:24 -0500, Bjorn Helgaas wrote:
> Hi Megha,
> 
> On Fri, Jun 21, 2019 at 05:19:32PM -0700, Megha Dey wrote:
> > 
> > Currently, MSI-X vector enabling and allocation for a PCIe device
> > is
> > static i.e. a device driver gets only one chance to enable a
> > specific
> > number of MSI-X vectors, usually during device probe. Also, in many
> > cases, drivers usually reserve more than required number of vectors
> > anticipating their use, which unnecessarily blocks resources that
> > could have been made available to other devices. Lastly, there is
> > no
> > way for drivers to reserve more vectors, if the MSI-x has already
> > been
> > enabled for that device.
> >  
> > Hence, a dynamic MSI-X kernel infrastructure can benefit drivers by
> > deferring MSI-X allocation to post probe phase, where actual demand
> > information is available.
> >  
> > This patchset enables the dynamic allocation/de-allocation of MSI-X
> > vectors by introducing 2 new APIs:
> > pci_alloc_irq_vectors_dyn() and pci_free_irq_vectors_grp():
> > 
> > We have had requests from some of the NIC/RDMA users who have lots
> > of
> > interrupt resources and would like to allocate them on demand,
> > instead of using an all or none approach.
> > 
> > The APIs are fairly well tested (multiple
> > allocations/deallocations),
> > but we have no early adopters yet. Hence, sending this series as an
> > RFC for review and comments.
> > 
> > The patches are based out of Linux 5.2-rc5.
> > 
> > Megha Dey (6):
> >   PCI/MSI: New structures/macros for dynamic MSI-X allocation
> >   PCI/MSI: Dynamic allocation of MSI-X vectors by group
> >   x86: Introduce the dynamic teardown function
> >   PCI/MSI: Introduce new structure to manage MSI-x entries
> s/MSI-x/MSI-X/
> If "entries" here means the same as "vectors" above, please use the
> same word.
> 

Hi Bjorn,

Well, here entries basically mean the msi_desc entries. I thought MSI
vectors are used for each address/data pair, hence used the term
entries. Will update it to vectors to ensure uniformity.

> > 
> >   PCI/MSI: Free MSI-X resources by group
> Is "resources" the same as "vectors"?
> 

Yes, will update this in V2.

> > 
> >   Documentation: PCI/MSI: Document dynamic MSI-X infrastructure
> When you post a v2 after addressing Thomas' comments, please make
> these subject lines imperative sentences beginning with a descriptive
> verb.  You can run "git log --oneline drivers/pci" to see the style.

Sure, I will update the subject lines in V2.

> If you're adding a specific interface or structure, mention it by
> name
> in the subject if it's practical.  The "x86" line needs a little more
> context; I assume it should include "IRQ", "MSI-X", or something.
> 

True, will change this in V2.

> > 
> >  Documentation/PCI/MSI-HOWTO.txt |  38 +++++
> >  arch/x86/include/asm/x86_init.h |   1 +
> >  arch/x86/kernel/x86_init.c      |   6 +
> >  drivers/pci/msi.c               | 363
> > +++++++++++++++++++++++++++++++++++++---
> >  drivers/pci/probe.c             |   9 +
> >  include/linux/device.h          |   3 +
> >  include/linux/msi.h             |  13 ++
> >  include/linux/pci.h             |  61 +++++++
> >  kernel/irq/msi.c                |  34 +++-
> >  9 files changed, 497 insertions(+), 31 deletions(-)
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-08-06 19:05     ` Megha Dey
@ 2019-08-07 13:56       ` Thomas Gleixner
  2019-08-07 14:18         ` Marc Zyngier
  2019-08-12 17:47         ` Megha Dey
  0 siblings, 2 replies; 24+ messages in thread
From: Thomas Gleixner @ 2019-08-07 13:56 UTC (permalink / raw)
  To: Megha Dey
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan

[-- Attachment #1: Type: text/plain, Size: 8186 bytes --]

Megha,

On Tue, 6 Aug 2019, Megha Dey wrote:
> On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
> > On Fri, 21 Jun 2019, Megha Dey wrote:
> 
> Totally agreed. The request to add a dynamic MSI-X infrastructure came
> from some driver teams internally and currently they do not have
> bandwidth to come up with relevant test cases. <sigh>

Hahahaha.

> But we hope that this patch set could serve as a precursor to the
> interrupt message store (IMS) patch set, and we can use this patch set
> as the baseline for the IMS patches.

If IMS needs the same functionality, then we need to think about it
slightly differently because IMS is not necessarily tied to PCI.
 
IMS has some similarity to the ARM GIC ITS stuff IIRC, which already
provides these things outside of PCI. Marc?

We probably need some generic infrastructure for this so PCI and everything
else can use it.

> > > +		/*
> > > +		 * Save the pointer to the first msi_desc entry of
> > > every
> > > +		 * MSI-X group. This pointer is used by other
> > > functions
> > > +		 * as the starting point to iterate through each
> > > of the
> > > +		 * entries in that particular group.
> > > +		 */
> > > +		if (!i)
> > > +			dev->dev.grp_first_desc = list_last_entry
> > > +			(dev_to_msi_list(&dev->dev), struct
> > > msi_desc, list);
> > How is that supposed to work? The pointer gets overwritten on every
> > invocation of that interface. I assume this is merily an intermediate
> > storage for setup. Shudder.
> > 
> 
> Yes, you are right.
> 
> The grp_first_desc is simply a temporary storage to store the
> first msi_desc entry of every group, which can be used by other
> functions to iterate through the entries belonging to that group only,
> using the for_each_pci_msi_entry/ for_each_msi_entry_from macro. It is
> not the cleanest of solutions, I agree.

Yeah, it's too ugly to exist.

> With your proposal of supporting a separate group list, I don't think
> there will be a need to use this kind of temporary storage variable.

Exactly.

> > > -	for_each_pci_msi_entry(entry, dev) {
> > > +	for_each_pci_msi_entry_from(entry, dev) {
> >   > +/* Iterate through MSI entries of device dev starting from a
> > given desc */
> >   > +#define for_each_msi_entry_from(desc,
> > dev)                             \
> >   > +       desc =
> > (*dev).grp_first_desc;                                   \
> >   > +       list_for_each_entry_from((desc), dev_to_msi_list((dev)),
> > list)  \
> > 
> > So this hides the whole group stuff behind a hideous iterator.
> > 
> > for_each_pci_msi_entry_from() ? from what? from the device? Sane
> > iterators
> > which have a _from naming, have also a from argument.
> > 
> 
> This was meant to be "iterate over all the entries belonging to a
> group", sorry if that was not clear. 
> 
> The current 'for_each_pci_msi_entry' macro iterates through all the
> msi_desc entries belonging to a particular device. Since we have a
> piecewise allocation of the MSI-X vectors with this change, we would
> want to iterate only through the newly added entries, i.e the entries
> allocated to the current group.

I understand that, but please make macros and function names so they are
halfways self explaining and intuitive.

 > In V2, I will introduce a new macro, 'for_each_pci_msi_entry_group',
> which will only iterate through the msi_desc entries belonging to a
> particular group.

for_each_pci_msi_entry_group()

is ambiguous. It could mean to iterate over the groups. 

for_each_pci_msi_entry_in_group()

avoids that.
 
> > > -	ret = msix_setup_entries(dev, base, entries, nvec, affd);
> > > +	ret = msix_setup_entries(dev, dev->base, entries, nvec,
> > > affd, group);
> > >  	if (ret)
> > >  		return ret;
> > Any error exit in this function will leave MSIx disabled. That means
> > if
> > this is a subsequent group allocation which fails for whatever
> > reason, this
> > will render all existing and possibly already in use interrupts
> > unusable.
> > 
> 
> Hmmm yeah, I hadn't thought about this!
> 
> So according to the code, we must 'Ensure MSI-X is disabled while it is
> set up'. MSI-X would be disabled until the setup of the new vectors is
> complete, even if we do not take the error exit right?
> 
> Earlier this was not a problem since we disable the MSI-X, setup all
> the vectors at once, and then enable the MSI-X once and for all. 
> 
> I am not sure how to avoid disabling of MSI-X here.

The problem with your code is that is keeps it disabled in case of an
error, which makes all existing users (groups) starve.

But, yes there is also the question what happens during the time when
interrupts are raised on already configured devices exactly during the time
where MSI-X is disabled temporarily to setup a new group. I fear that will
end up with lost interrupts and/or spurious interrupts via the legacy
INT[ABCD]. That really needs to be investigated _before_ we go there.

> > >  static int __pci_enable_msix_range(struct pci_dev *dev,
> > >  				   struct msix_entry *entries, int
> > > minvec,
> > > -				   int maxvec, struct irq_affinity
> > > *affd)
> > > +				   int maxvec, struct irq_affinity
> > > *affd,
> > > +				   bool one_shot, int group)
> > >  {
> > >  	int rc, nvec = maxvec;
> > >  
> > >  	if (maxvec < minvec)
> > >  		return -ERANGE;
> > >  
> > > -	if (WARN_ON_ONCE(dev->msix_enabled))
> > > -		return -EINVAL;
> > So any misbehaving PCI driver can now call into this without being
> > caught.
> > 
> 
> I do not understand what misbehaving PCI driver means :(

The one which calls into that interface _AFTER_ msix is enabled. We catch
that right now and reject it.

> Basically this statement is what denies multiple MSI-X vector
> allocations, and I wanted to remove it so that we could do just that.
>
> Please let me know how I could change this.

There are several ways to do that, but it needs to be made conditionally on
things like 'device has group mode support' ...
 
> > If you want to support group based allocations, then the PCI/MSI
> > facility
> > has to be refactored from ground up.
> > 
> >   1) Introduce the concept of groups by adding a group list head to
> > struct
> >      pci_dev. Ideally you create a new struct pci_dev_msi or whatever
> > where
> >      all this muck goes into.
> > 
> 
> I think we can use the existing list_head 'msi_list' in the struct
> device for this, instead of having a new list_head for the group. So
> now instead of msi_list being a list of all the msi_desc entries, it
> will have a list of the different groups associated with the device.
> 
> IMHO, since IMS is non PCI compliant, having this group_list_head would
> be better off in struct device than struct pci_dev, which would enable
> code reuse.

Sure, but then we really need to look at the IMS requirements in order not
to rewrite this whole thing over and over.

> >   2) Change the existing code to treat the current allocation mode as
> > a
> >      group allocation. Keep the entries in a new struct
> > msi_entry_group and
> >      have a group id, list head and the entries in there.
> > 
> 
> I am thinking of something like this, please let me know if this is
> what you are suggesting:
> 
> 1. Introduce a new msi_entry_group struct:
> struct msi_entry_grp {
>   int group_id; // monotonically increasing group_id
>   int num_vecs; // number of msi_desc entries per group
>   struct list_head group_list; // Added to msi_list in struct device
>   struct list_head entry_list; // list of msi_desc entries for this grp
> }

Looks about right.
 
> 2. Add a new 'for_each_pci_msi_entry_group' macro. This macro should
> only iterate through the msi_desc entries belonging to a group.

See above.
 
> 3. The existing for_each_pci_msi_entry, needs to be modified so that it
> is backward compatible. This macro should still be able to iterate
> through all the entries in all the groups. 

I'm not sure. It might be just the thing which iterates over group 0, which
is the default for all devices which do not use/support group mode, but
let's see.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-08-07 13:56       ` Thomas Gleixner
@ 2019-08-07 14:18         ` Marc Zyngier
  2019-08-11  7:20           ` Thomas Gleixner
  2019-08-12 17:48           ` Megha Dey
  2019-08-12 17:47         ` Megha Dey
  1 sibling, 2 replies; 24+ messages in thread
From: Marc Zyngier @ 2019-08-07 14:18 UTC (permalink / raw)
  To: Thomas Gleixner, Megha Dey
  Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj, jacob.jun.pan

On 07/08/2019 14:56, Thomas Gleixner wrote:
> Megha,
> 
> On Tue, 6 Aug 2019, Megha Dey wrote:
>> On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
>>> On Fri, 21 Jun 2019, Megha Dey wrote:
>>
>> Totally agreed. The request to add a dynamic MSI-X infrastructure came
>> from some driver teams internally and currently they do not have
>> bandwidth to come up with relevant test cases. <sigh>
> 
> Hahahaha.
> 
>> But we hope that this patch set could serve as a precursor to the
>> interrupt message store (IMS) patch set, and we can use this patch set
>> as the baseline for the IMS patches.
> 
> If IMS needs the same functionality, then we need to think about it
> slightly differently because IMS is not necessarily tied to PCI.
>  
> IMS has some similarity to the ARM GIC ITS stuff IIRC, which already
> provides these things outside of PCI. Marc?

Indeed. We have MSI-like functionality almost everywhere, and make heavy
use of the generic MSI framework. Platform-MSI is probably the most
generic example we have (it's the Far West transposed to MSIs).

> We probably need some generic infrastructure for this so PCI and everything
> else can use it.

Indeed. Overall, I'd like the concept of MSI on whatever bus to have one
single behaviour across the board, as long as it makes sense for that
bus (nobody needs another PCI MultiMSI, for example).

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group
  2019-08-06 19:09     ` Megha Dey
@ 2019-08-11  7:18       ` Thomas Gleixner
  2019-08-12 18:13         ` Megha Dey
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-08-11  7:18 UTC (permalink / raw)
  To: Megha Dey
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

[-- Attachment #1: Type: text/plain, Size: 3233 bytes --]

On Tue, 6 Aug 2019, Megha Dey wrote:

> On Sat, 2019-06-29 at 10:08 +0200, Thomas Gleixner wrote:
> > Megha,
> > 
> > On Fri, 21 Jun 2019, Megha Dey wrote:
> > > 
> > > +static int free_msi_irqs_grp(struct pci_dev *dev, int group_id)
> > > +{
> > > 
> > > +
> > > +	for_each_pci_msi_entry(entry, dev) {
> > > +		if (entry->group_id == group_id && entry->irq)
> > > +			for (i = 0; i < entry->nvec_used; i++)
> > > +				BUG_ON(irq_has_action(entry->irq +
> > > i));
> > BUG_ON is wrong here. This can and must be handled gracefully.
> > 
> 
> Hmm, I reused this code from the 'free_msi_irqs' function. I am not
> sure why it is wrong to use BUG_ON here but ok to use it there, please
> let me know.

We are not adding BUG_ON() anymore except for situations where there is
absolutely no way out. Just because there is still older code having
BUG_ON() does not make it any better. Copying it surely is no
justification.

If there is really no way out, then you need to explain it.
 
> > > +static void pci_msix_shutdown_grp(struct pci_dev *dev, int
> > > group_id)
> > > +{
> > > +	struct msi_desc *entry;
> > > +	int grp_present = 0;
> > > +
> > > +	if (pci_dev_is_disconnected(dev)) {
> > > +		dev->msix_enabled = 0;
> > Huch? What's that? I can't figure out why this is needed and of
> > course it
> > completely lacks a comment explaining this. 
> > 
> 
> Again, I have reused this code from the pci_msix_shutdown() function.
> So for the group case, this is not required?

Copy and paste is not an argument, really. Can this happen here? If so,
then please add a comment.

> > > 
> > > +		return;
> > > +	}
> > > +
> > > +	/* Return the device with MSI-X masked as initial states
> > > */
> > > +	for_each_pci_msi_entry(entry, dev) {
> > > +		if (entry->group_id == group_id) {
> > > +			/* Keep cached states to be restored */
> > > +			__pci_msix_desc_mask_irq(entry, 1);
> > > +			grp_present = 1;
> > > +		}
> > > +	}
> > > +
> > > +	if (!grp_present) {
> > > +		pci_err(dev, "Group to be disabled not
> > > present\n");
> > > +		return;
> > So you print an error and silently return
> > 
> 
> This is a void function, hence no error value can be returned. What do
> you think is the right thing to do if someone wants to delete a group
> which is not present?

Well, you made it a void function. 
 
> > > 
> > > +	}
> > > +}
> > > +
> > > +int pci_disable_msix_grp(struct pci_dev *dev, int group_id)
> > > +{
> > > +	int num_vecs;
> > > +
> > > +	if (!pci_msi_enable || !dev)
> > > +		return -EINVAL;
> > > +
> > > +	pci_msix_shutdown_grp(dev, group_id);
> > > +	num_vecs = free_msi_irqs_grp(dev, group_id);
> > just to call in another function which has to do the same group_id
> > lookup
> > muck again.
> 
> Even with the new proposal, we are to have 2 sets of functions: one to
> delete all the msic_desc entries associated with the device, and the
> other to delete those only belonging a 'user specified' group. So we do
> need to pass a group_id to these functions right? Yes, internally the
> deletion would be straightforward with the new approach.

That does not matter. If pci_msix_shutdown_grp() does not find a group, why
proceeding instead of having a proper error return and telling the caller?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-08-07 14:18         ` Marc Zyngier
@ 2019-08-11  7:20           ` Thomas Gleixner
  2019-08-12 17:54             ` Megha Dey
  2019-08-12 17:48           ` Megha Dey
  1 sibling, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2019-08-11  7:20 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Megha Dey, bhelgaas, linux-pci, linux-kernel, ashok.raj, jacob.jun.pan

On Wed, 7 Aug 2019, Marc Zyngier wrote:
> On 07/08/2019 14:56, Thomas Gleixner wrote:
> > On Tue, 6 Aug 2019, Megha Dey wrote:
> >> On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
> >>> On Fri, 21 Jun 2019, Megha Dey wrote:
> >>
> >> Totally agreed. The request to add a dynamic MSI-X infrastructure came
> >> from some driver teams internally and currently they do not have
> >> bandwidth to come up with relevant test cases. <sigh>
> > 
> > Hahahaha.
> > 
> >> But we hope that this patch set could serve as a precursor to the
> >> interrupt message store (IMS) patch set, and we can use this patch set
> >> as the baseline for the IMS patches.
> > 
> > If IMS needs the same functionality, then we need to think about it
> > slightly differently because IMS is not necessarily tied to PCI.
> >  
> > IMS has some similarity to the ARM GIC ITS stuff IIRC, which already
> > provides these things outside of PCI. Marc?
> 
> Indeed. We have MSI-like functionality almost everywhere, and make heavy
> use of the generic MSI framework. Platform-MSI is probably the most
> generic example we have (it's the Far West transposed to MSIs).
> 
> > We probably need some generic infrastructure for this so PCI and everything
> > else can use it.
> 
> Indeed. Overall, I'd like the concept of MSI on whatever bus to have one
> single behaviour across the board, as long as it makes sense for that
> bus (nobody needs another PCI MultiMSI, for example).

Right.

@Intel: Is there documentation and perhaps early code for that IMS muck to
	look at?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-08-07 13:56       ` Thomas Gleixner
  2019-08-07 14:18         ` Marc Zyngier
@ 2019-08-12 17:47         ` Megha Dey
  1 sibling, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-08-12 17:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan

On Wed, 2019-08-07 at 15:56 +0200, Thomas Gleixner wrote:
> Megha,
> 
> On Tue, 6 Aug 2019, Megha Dey wrote:
> > 
> > On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
> > > 
> > > On Fri, 21 Jun 2019, Megha Dey wrote:
> > Totally agreed. The request to add a dynamic MSI-X infrastructure
> > came
> > from some driver teams internally and currently they do not have
> > bandwidth to come up with relevant test cases. <sigh>
> Hahahaha.
> 
> > 
> > But we hope that this patch set could serve as a precursor to the
> > interrupt message store (IMS) patch set, and we can use this patch
> > set
> > as the baseline for the IMS patches.
> If IMS needs the same functionality, then we need to think about it
> slightly differently because IMS is not necessarily tied to PCI.
>  
> IMS has some similarity to the ARM GIC ITS stuff IIRC, which already
> provides these things outside of PCI. Marc?
> 
> We probably need some generic infrastructure for this so PCI and
> everything
> else can use it.

Yeah agreed, I will look at the ARM GIC ITS to see how we can
consolidate this code.

> 
> > 
> > > 
> > > > 
> > > > +		/*
> > > > +		 * Save the pointer to the first msi_desc
> > > > entry of
> > > > every
> > > > +		 * MSI-X group. This pointer is used by other
> > > > functions
> > > > +		 * as the starting point to iterate through
> > > > each
> > > > of the
> > > > +		 * entries in that particular group.
> > > > +		 */
> > > > +		if (!i)
> > > > +			dev->dev.grp_first_desc =
> > > > list_last_entry
> > > > +			(dev_to_msi_list(&dev->dev), struct
> > > > msi_desc, list);
> > > How is that supposed to work? The pointer gets overwritten on
> > > every
> > > invocation of that interface. I assume this is merily an
> > > intermediate
> > > storage for setup. Shudder.
> > > 
> > Yes, you are right.
> > 
> > The grp_first_desc is simply a temporary storage to store the
> > first msi_desc entry of every group, which can be used by other
> > functions to iterate through the entries belonging to that group
> > only,
> > using the for_each_pci_msi_entry/ for_each_msi_entry_from macro. It
> > is
> > not the cleanest of solutions, I agree.
> Yeah, it's too ugly to exist.
> 

will get rid of it.

> > 
> > With your proposal of supporting a separate group list, I don't
> > think
> > there will be a need to use this kind of temporary storage
> > variable.
> Exactly.
> 
> > 
> > > 
> > > > 
> > > > -	for_each_pci_msi_entry(entry, dev) {
> > > > +	for_each_pci_msi_entry_from(entry, dev) {
> > >   > +/* Iterate through MSI entries of device dev starting from a
> > > given desc */
> > >   > +#define for_each_msi_entry_from(desc,
> > > dev)                             \
> > >   > +       desc =
> > > (*dev).grp_first_desc;                                   \
> > >   > +       list_for_each_entry_from((desc),
> > > dev_to_msi_list((dev)),
> > > list)  \
> > > 
> > > So this hides the whole group stuff behind a hideous iterator.
> > > 
> > > for_each_pci_msi_entry_from() ? from what? from the device? Sane
> > > iterators
> > > which have a _from naming, have also a from argument.
> > > 
> > This was meant to be "iterate over all the entries belonging to a
> > group", sorry if that was not clear. 
> > 
> > The current 'for_each_pci_msi_entry' macro iterates through all the
> > msi_desc entries belonging to a particular device. Since we have a
> > piecewise allocation of the MSI-X vectors with this change, we
> > would
> > want to iterate only through the newly added entries, i.e the
> > entries
> > allocated to the current group.
> I understand that, but please make macros and function names so they
> are
> halfways self explaining and intuitive.
> 

yeah, point taken.

>  > In V2, I will introduce a new macro,
> 'for_each_pci_msi_entry_group',
> > 
> > which will only iterate through the msi_desc entries belonging to a
> > particular group.
> for_each_pci_msi_entry_group()
> 
> is ambiguous. It could mean to iterate over the groups. 
> 
> for_each_pci_msi_entry_in_group()
> 
> avoids that.

Yes, agreed.
>  
> > 
> > > 
> > > > 
> > > > -	ret = msix_setup_entries(dev, base, entries, nvec,
> > > > affd);
> > > > +	ret = msix_setup_entries(dev, dev->base, entries,
> > > > nvec,
> > > > affd, group);
> > > >  	if (ret)
> > > >  		return ret;
> > > Any error exit in this function will leave MSIx disabled. That
> > > means
> > > if
> > > this is a subsequent group allocation which fails for whatever
> > > reason, this
> > > will render all existing and possibly already in use interrupts
> > > unusable.
> > > 
> > Hmmm yeah, I hadn't thought about this!
> > 
> > So according to the code, we must 'Ensure MSI-X is disabled while
> > it is
> > set up'. MSI-X would be disabled until the setup of the new vectors
> > is
> > complete, even if we do not take the error exit right?
> > 
> > Earlier this was not a problem since we disable the MSI-X, setup
> > all
> > the vectors at once, and then enable the MSI-X once and for all. 
> > 
> > I am not sure how to avoid disabling of MSI-X here.
> The problem with your code is that is keeps it disabled in case of an
> error, which makes all existing users (groups) starve.
> 
> But, yes there is also the question what happens during the time when
> interrupts are raised on already configured devices exactly during
> the time
> where MSI-X is disabled temporarily to setup a new group. I fear that
> will
> end up with lost interrupts and/or spurious interrupts via the legacy
> INT[ABCD]. That really needs to be investigated _before_ we go there.
> 

Hmm, yeah that is my concern, I do not know what the behavior would be.
Any experiments that I can run to know more?

> > 
> > > 
> > > > 
> > > >  static int __pci_enable_msix_range(struct pci_dev *dev,
> > > >  				   struct msix_entry *entries,
> > > > int
> > > > minvec,
> > > > -				   int maxvec, struct
> > > > irq_affinity
> > > > *affd)
> > > > +				   int maxvec, struct
> > > > irq_affinity
> > > > *affd,
> > > > +				   bool one_shot, int group)
> > > >  {
> > > >  	int rc, nvec = maxvec;
> > > >  
> > > >  	if (maxvec < minvec)
> > > >  		return -ERANGE;
> > > >  
> > > > -	if (WARN_ON_ONCE(dev->msix_enabled))
> > > > -		return -EINVAL;
> > > So any misbehaving PCI driver can now call into this without
> > > being
> > > caught.
> > > 
> > I do not understand what misbehaving PCI driver means :(
> The one which calls into that interface _AFTER_ msix is enabled. We
> catch
> that right now and reject it.
> 

Right.
> > 
> > Basically this statement is what denies multiple MSI-X vector
> > allocations, and I wanted to remove it so that we could do just
> > that.
> > 
> > Please let me know how I could change this.
> There are several ways to do that, but it needs to be made
> conditionally on
> things like 'device has group mode support' ...
>  

Ok.
> > 
> > > 
> > > If you want to support group based allocations, then the PCI/MSI
> > > facility
> > > has to be refactored from ground up.
> > > 
> > >   1) Introduce the concept of groups by adding a group list head
> > > to
> > > struct
> > >      pci_dev. Ideally you create a new struct pci_dev_msi or
> > > whatever
> > > where
> > >      all this muck goes into.
> > > 
> > I think we can use the existing list_head 'msi_list' in the struct
> > device for this, instead of having a new list_head for the group.
> > So
> > now instead of msi_list being a list of all the msi_desc entries,
> > it
> > will have a list of the different groups associated with the
> > device.
> > 
> > IMHO, since IMS is non PCI compliant, having this group_list_head
> > would
> > be better off in struct device than struct pci_dev, which would
> > enable
> > code reuse.
> Sure, but then we really need to look at the IMS requirements in
> order not
> to rewrite this whole thing over and over.
> 

Yeah, since IMS is non PCI compliant, if we make all these changes at
the struct device level unlike the struct pci_dev level, I think we can
reuse the same code for all MSI/ MSI like interrupts.

Having said that I will send out an RFC of the initial IMS code, to
avoid any sort of code duplication.
> > 
> > > 
> > >   2) Change the existing code to treat the current allocation
> > > mode as
> > > a
> > >      group allocation. Keep the entries in a new struct
> > > msi_entry_group and
> > >      have a group id, list head and the entries in there.
> > > 
> > I am thinking of something like this, please let me know if this is
> > what you are suggesting:
> > 
> > 1. Introduce a new msi_entry_group struct:
> > struct msi_entry_grp {
> >   int group_id; // monotonically increasing group_id
> >   int num_vecs; // number of msi_desc entries per group
> >   struct list_head group_list; // Added to msi_list in struct
> > device
> >   struct list_head entry_list; // list of msi_desc entries for this
> > grp
> > }
> Looks about right.

Ok! 
> > 
> > 2. Add a new 'for_each_pci_msi_entry_group' macro. This macro
> > should
> > only iterate through the msi_desc entries belonging to a group.
> See above.
>  

will update this is V2.

> > 
> > 3. The existing for_each_pci_msi_entry, needs to be modified so
> > that it
> > is backward compatible. This macro should still be able to iterate
> > through all the entries in all the groups. 
> I'm not sure. It might be just the thing which iterates over group 0,
> which
> is the default for all devices which do not use/support group mode,
> but
> let's see.
>  

Yeah, I am currently trying to get this approach to work i.e. reworking
the for_each_pci_msi_entry macro to be group aware (not gotten it to
work thus far, though).

> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-08-07 14:18         ` Marc Zyngier
  2019-08-11  7:20           ` Thomas Gleixner
@ 2019-08-12 17:48           ` Megha Dey
  1 sibling, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-08-12 17:48 UTC (permalink / raw)
  To: Marc Zyngier, Thomas Gleixner
  Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj, jacob.jun.pan

On Wed, 2019-08-07 at 15:18 +0100, Marc Zyngier wrote:
> On 07/08/2019 14:56, Thomas Gleixner wrote:
> > 
> > Megha,
> > 
> > On Tue, 6 Aug 2019, Megha Dey wrote:
> > > 
> > > On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
> > > > 
> > > > On Fri, 21 Jun 2019, Megha Dey wrote:
> > > Totally agreed. The request to add a dynamic MSI-X infrastructure
> > > came
> > > from some driver teams internally and currently they do not have
> > > bandwidth to come up with relevant test cases. <sigh>
> > Hahahaha.
> > 
> > > 
> > > But we hope that this patch set could serve as a precursor to the
> > > interrupt message store (IMS) patch set, and we can use this
> > > patch set
> > > as the baseline for the IMS patches.
> > If IMS needs the same functionality, then we need to think about it
> > slightly differently because IMS is not necessarily tied to PCI.
> >  
> > IMS has some similarity to the ARM GIC ITS stuff IIRC, which
> > already
> > provides these things outside of PCI. Marc?
> Indeed. We have MSI-like functionality almost everywhere, and make
> heavy
> use of the generic MSI framework. Platform-MSI is probably the most
> generic example we have (it's the Far West transposed to MSIs).
> 
Ok I will have a look at the platform-msi code.
> > 
> > We probably need some generic infrastructure for this so PCI and
> > everything
> > else can use it.
> Indeed. Overall, I'd like the concept of MSI on whatever bus to have
> one
> single behaviour across the board, as long as it makes sense for that
> bus (nobody needs another PCI MultiMSI, for example).

Yeah, agreed.
> 
> Thanks,
> 
> 	M.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-08-11  7:20           ` Thomas Gleixner
@ 2019-08-12 17:54             ` Megha Dey
  0 siblings, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-08-12 17:54 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier
  Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj, jacob.jun.pan

On Sun, 2019-08-11 at 09:20 +0200, Thomas Gleixner wrote:
> On Wed, 7 Aug 2019, Marc Zyngier wrote:
> > 
> > On 07/08/2019 14:56, Thomas Gleixner wrote:
> > > 
> > > On Tue, 6 Aug 2019, Megha Dey wrote:
> > > > 
> > > > On Sat, 2019-06-29 at 09:59 +0200, Thomas Gleixner wrote:
> > > > > 
> > > > > On Fri, 21 Jun 2019, Megha Dey wrote:
> > > > Totally agreed. The request to add a dynamic MSI-X
> > > > infrastructure came
> > > > from some driver teams internally and currently they do not
> > > > have
> > > > bandwidth to come up with relevant test cases. <sigh>
> > > Hahahaha.
> > > 
> > > > 
> > > > But we hope that this patch set could serve as a precursor to
> > > > the
> > > > interrupt message store (IMS) patch set, and we can use this
> > > > patch set
> > > > as the baseline for the IMS patches.
> > > If IMS needs the same functionality, then we need to think about
> > > it
> > > slightly differently because IMS is not necessarily tied to PCI.
> > >  
> > > IMS has some similarity to the ARM GIC ITS stuff IIRC, which
> > > already
> > > provides these things outside of PCI. Marc?
> > Indeed. We have MSI-like functionality almost everywhere, and make
> > heavy
> > use of the generic MSI framework. Platform-MSI is probably the most
> > generic example we have (it's the Far West transposed to MSIs).
> > 
> > > 
> > > We probably need some generic infrastructure for this so PCI and
> > > everything
> > > else can use it.
> > Indeed. Overall, I'd like the concept of MSI on whatever bus to
> > have one
> > single behaviour across the board, as long as it makes sense for
> > that
> > bus (nobody needs another PCI MultiMSI, for example).
> Right.
> 
> @Intel: Is there documentation and perhaps early code for that IMS
> muck to
> 	look at?

Hi Thomas,

We have some early documentation on IMS which can be found here (part
of the Scalable I/O Virtualization spec):

https://software.intel.com/sites/default/files/managed/cc/0e/intel-scal
able-io-virtualization-technical-specification.pdf
(Section 3.4.1)

Also, I do have some initial IMS patches that we use internally for
testing. I will clean it up and send it as an RFC patchset shortly.
 
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group
  2019-08-11  7:18       ` Thomas Gleixner
@ 2019-08-12 18:13         ` Megha Dey
  0 siblings, 0 replies; 24+ messages in thread
From: Megha Dey @ 2019-08-12 18:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan

On Sun, 2019-08-11 at 09:18 +0200, Thomas Gleixner wrote:
> On Tue, 6 Aug 2019, Megha Dey wrote:
> 
> > 
> > On Sat, 2019-06-29 at 10:08 +0200, Thomas Gleixner wrote:
> > > 
> > > Megha,
> > > 
> > > On Fri, 21 Jun 2019, Megha Dey wrote:
> > > > 
> > > > 
> > > > +static int free_msi_irqs_grp(struct pci_dev *dev, int
> > > > group_id)
> > > > +{
> > > > 
> > > > +
> > > > +	for_each_pci_msi_entry(entry, dev) {
> > > > +		if (entry->group_id == group_id && entry->irq)
> > > > +			for (i = 0; i < entry->nvec_used; i++)
> > > > +				BUG_ON(irq_has_action(entry-
> > > > >irq +
> > > > i));
> > > BUG_ON is wrong here. This can and must be handled gracefully.
> > > 
> > Hmm, I reused this code from the 'free_msi_irqs' function. I am not
> > sure why it is wrong to use BUG_ON here but ok to use it there,
> > please
> > let me know.
> We are not adding BUG_ON() anymore except for situations where there
> is
> absolutely no way out. Just because there is still older code having
> BUG_ON() does not make it any better. Copying it surely is no
> justification.
> 
> If there is really no way out, then you need to explain it.
> 

Ok, got it. I will look into it closely to see if the BUG_ON is
absolutely required.

>  
> > 
> > > 
> > > > 
> > > > +static void pci_msix_shutdown_grp(struct pci_dev *dev, int
> > > > group_id)
> > > > +{
> > > > +	struct msi_desc *entry;
> > > > +	int grp_present = 0;
> > > > +
> > > > +	if (pci_dev_is_disconnected(dev)) {
> > > > +		dev->msix_enabled = 0;
> > > Huch? What's that? I can't figure out why this is needed and of
> > > course it
> > > completely lacks a comment explaining this. 
> > > 
> > Again, I have reused this code from the pci_msix_shutdown()
> > function.
> > So for the group case, this is not required?
> Copy and paste is not an argument, really. Can this happen here? If
> so,
> then please add a comment.
> 

Ok, will do.
> > 
> > > 
> > > > 
> > > > 
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	/* Return the device with MSI-X masked as initial
> > > > states
> > > > */
> > > > +	for_each_pci_msi_entry(entry, dev) {
> > > > +		if (entry->group_id == group_id) {
> > > > +			/* Keep cached states to be restored
> > > > */
> > > > +			__pci_msix_desc_mask_irq(entry, 1);
> > > > +			grp_present = 1;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	if (!grp_present) {
> > > > +		pci_err(dev, "Group to be disabled not
> > > > present\n");
> > > > +		return;
> > > So you print an error and silently return
> > > 
> > This is a void function, hence no error value can be returned. What
> > do
> > you think is the right thing to do if someone wants to delete a
> > group
> > which is not present?
> Well, you made it a void function. 

ok sure, got your point.
>  
> > 
> > > 
> > > > 
> > > > 
> > > > +	}
> > > > +}
> > > > +
> > > > +int pci_disable_msix_grp(struct pci_dev *dev, int group_id)
> > > > +{
> > > > +	int num_vecs;
> > > > +
> > > > +	if (!pci_msi_enable || !dev)
> > > > +		return -EINVAL;
> > > > +
> > > > +	pci_msix_shutdown_grp(dev, group_id);
> > > > +	num_vecs = free_msi_irqs_grp(dev, group_id);
> > > just to call in another function which has to do the same
> > > group_id
> > > lookup
> > > muck again.
> > Even with the new proposal, we are to have 2 sets of functions: one
> > to
> > delete all the msic_desc entries associated with the device, and
> > the
> > other to delete those only belonging a 'user specified' group. So
> > we do
> > need to pass a group_id to these functions right? Yes, internally
> > the
> > deletion would be straightforward with the new approach.
> That does not matter. If pci_msix_shutdown_grp() does not find a
> group, why
> proceeding instead of having a proper error return and telling the
> caller?
> 

Oh ok, I got it now, I will return a proper error code in the
pci_msi_shutdown_grp and do the free_msi_irqs_grp only if the group is
found.

> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group
  2019-06-29  7:59   ` Thomas Gleixner
  2019-08-06 19:05     ` Megha Dey
@ 2021-01-07 22:30     ` Jacob Keller
  1 sibling, 0 replies; 24+ messages in thread
From: Jacob Keller @ 2021-01-07 22:30 UTC (permalink / raw)
  To: Thomas Gleixner, Megha Dey
  Cc: bhelgaas, linux-pci, linux-kernel, marc.zyngier, ashok.raj,
	jacob.jun.pan, megha.dey

Hi Thomas,

On 6/29/2019 12:59 AM, Thomas Gleixner wrote:
> As already pointed out, that's overengineered.
> 
> First of all this patch is doing too many things at once. These changes
> need to be split up in reviewable bits and pieces.
> 

I am looking into this work as I want to implement ability to do grouped
partial allocations of MSI-X vectors over time in the ice Linux NIC driver.

> But I consider this approach as a POC and not something which can be meant
> as a maintainable solution. It just duct tapes this new functionality into
> the existing code thereby breaking things left and right. And even if you
> can 'fix' these issues with more duct tape it won't be maintainable at all.
> 
> If you want to support group based allocations, then the PCI/MSI facility
> has to be refactored from ground up.
> 

I agree that this is the right direction to go, but I am having some
trouble with following these steps when I started trying to implement
this stuff.

>   1) Introduce the concept of groups by adding a group list head to struct
>      pci_dev. Ideally you create a new struct pci_dev_msi or whatever where
>      all this muck goes into.
> 

So my big problem I keep running into is that struct msi_desc is used by
several code paths that aren't PCI. It looks a bit odd trying to
refactor things to support groups for the non-PCI bus code that uses
struct msi_desc...

I'd appreciate any further thoughts you have on the right way to go
forward here. I think that treated vector allocations as groups is a
huge improvement, as it will make it easier to manage allocating MSI-X
vectors without running into exhaustion issues due to over allocating.

But does this need to be introduced as part of the generic linux/msi.h
stuff? Doing this means refactoring a bunch of code paths which don't
seem to care about grouping. But I can't find a good way to handle this
grouping in just the PCI layer.

>   2) Change the existing code to treat the current allocation mode as a
>      group allocation. Keep the entries in a new struct msi_entry_group and
>      have a group id, list head and the entries in there.
> 
>      Take care of protecting the group list.
> 
>      Assign group id 0 and add the entry_group to the list in the pci device.
> 
>      Rework the related functions so they are group aware.
> 
>      This can be split into preparatory and functional pieces, i.e. multiple
>      patches.
> 

The locking issue here also seems somewhat problematic. A lot of paths
that access the msi list don't seem to take a lock today. So any change
that affects these users would force adding locks on all these flows.

I guess for PCI code we could just stop using dev->msi_list altogether,
and instead use a PCI specific struct pci_msi_group or something? This
would mean that any flow that the PCI layer needs would have to take the
group structure instead of or in addition to the device pointer... It's
not clear how much code actually crosses between the PCI and non-PCI
usages of struct msi_desc...

>   3) Split out the 'first time' enablement code into separate functions and
>      store the relevant state in struct pci_dev_msi
> 
>   4) Rename pci_alloc_irq_vectors_affinity() to
>      pci_alloc_irq_vectors_affinity_group() and add a group_id pointer
>      argument.
> 
>      Make pci_alloc_irq_vectors_affinity() a wrapper function which hands
>      in a NULL group id pointer and does proper sanity checking.
> 
>   5) Create similar constructs for related functionality
> 
>   6) Enable the group allocation mode for subsequent allocations
> 

The rest of the flow makes sense, but I have been struggling with
finding the right abstraction for handling the msi_desc groups.

Does my idea of separating the PCI layer code to using its own structure
 (and iterators I guess?) instead of relying on msi_list make sense? I
guess other code could be converted to groups as well, but I have been
trying to find a good path forward that has minimal chance of breaking
other users...

I'd appreciate any insight here.

Thanks,
Jake

> Thanks,
> 
> 	tglx
> 
> 
>   
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-01-07 22:31 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-22  0:19 [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Megha Dey
2019-06-22  0:19 ` [RFC V1 RESEND 1/6] PCI/MSI: New structures/macros for dynamic MSI-X allocation Megha Dey
2019-06-22  0:19 ` [RFC V1 RESEND 2/6] PCI/MSI: Dynamic allocation of MSI-X vectors by group Megha Dey
2019-06-29  7:59   ` Thomas Gleixner
2019-08-06 19:05     ` Megha Dey
2019-08-07 13:56       ` Thomas Gleixner
2019-08-07 14:18         ` Marc Zyngier
2019-08-11  7:20           ` Thomas Gleixner
2019-08-12 17:54             ` Megha Dey
2019-08-12 17:48           ` Megha Dey
2019-08-12 17:47         ` Megha Dey
2021-01-07 22:30     ` Jacob Keller
2019-06-22  0:19 ` [RFC V1 RESEND 3/6] x86: Introduce the dynamic teardown function Megha Dey
2019-06-29  8:01   ` Thomas Gleixner
2019-08-06 19:06     ` Megha Dey
2019-06-22  0:19 ` [RFC V1 RESEND 4/6] PCI/MSI: Introduce new structure to manage MSI-x entries Megha Dey
2019-06-22  0:19 ` [RFC V1 RESEND 5/6] PCI/MSI: Free MSI-X resources by group Megha Dey
2019-06-29  8:08   ` Thomas Gleixner
2019-08-06 19:09     ` Megha Dey
2019-08-11  7:18       ` Thomas Gleixner
2019-08-12 18:13         ` Megha Dey
2019-06-22  0:19 ` [RFC V1 RESEND 6/6] Documentation: PCI/MSI: Document dynamic MSI-X infrastructure Megha Dey
2019-08-02  0:24 ` [RFC V1 RESEND 0/6] Introduce dynamic allocation/freeing of MSI-X vectors Bjorn Helgaas
2019-08-06 19:12   ` Megha Dey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).