linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4
@ 2021-11-27  1:24 Thomas Gleixner
  2021-11-27  1:24 ` [patch 01/10] genirq/msi: Add range argument to alloc/free MSI domain ops Thomas Gleixner
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

This is finally the point where dynamically expanding MSI-X vectors after
enabling MSI-X is implemented.

The first three parts of this work can be found here:

    https://lore.kernel.org/r/20211126222700.862407977@linutronix.de
    https://lore.kernel.org/r/20211126224100.303046749@linutronix.de
    https://lore.kernel.org/r/20211126230957.239391799@linutronix.de

This last and smallest part of the overall series contains the following
changes:

   1) Prepare the core MSI irq domain code to handle range based allocation
      and free

   2) Prepare the PCI/MSI code to handle range based allocation and free
  
   3) Implement a new interface which allows to expand the MSI-X vector
      space after initialization

   4) Enable support for the X86 PCI/MSI irq domains

      This is unfortunate, but some PCI/MSI irq domain implementations,
      e.g. powerpc and the x86/XEN irqdomain wrappers are not really ready
      to support this out of the box.

      I looked at the 30 places which implement PCI/MSI irq domains and
      many of them look like they could support it out of the box, but as
      we have two which definitely don't, making this opt-in is the only
      safe option.

I've tested this by hacking up the XHCI driver and it works like a charm.

There is certainly some more room for consolidating the PCI/MSI-X usage in
drivers, i.e. getting rid of pci_enable_msix*(), but this would have made
this overall series even larger and is an orthogonal issue.

This fourth series is based on:

     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3

and also available from git:

     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-4

Thanks,

	tglx
---
 arch/powerpc/platforms/pseries/msi.c |    6 +-
 arch/x86/kernel/apic/msi.c           |    4 -
 arch/x86/pci/xen.c                   |   10 +--
 drivers/base/platform-msi.c          |    3 -
 drivers/pci/msi/irqdomain.c          |   39 ++++++++++----
 drivers/pci/msi/msi.c                |   97 +++++++++++++++++++++++++++--------
 drivers/pci/msi/msi.h                |    4 -
 include/linux/msi.h                  |   46 +++++++++++-----
 include/linux/pci.h                  |   13 ++++
 kernel/irq/msi.c                     |   75 +++++++++++++++------------
 10 files changed, 208 insertions(+), 89 deletions(-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 01/10] genirq/msi: Add range argument to alloc/free MSI domain ops
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24   ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 02/10] genirq/msi: Add range argument to msi_domain_alloc/free_descs_locked() Thomas Gleixner
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

In preparation for supporting range allocations for MSI-X, add a range
argument to the MSI domain alloc/free function pointers and fixup all
affected places.

The range is supplied via a pointer to a struct msi_range which contains
the first and last MSI index and the number of vectors to allocate/free.

To support the sparse MSI-X allocations via pci_enable_msix_range() and
pci_enable_msix_exact() the number of vectors can be smaller than the range
defined by the first and last MSI index. This can be cleaned up later once
the code is converted by converting these sparse allocations to an initial
allocation on enable and expansion of the vector space at the required
indices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/platforms/pseries/msi.c |    6 +++---
 arch/x86/pci/xen.c                   |   10 +++++-----
 include/linux/msi.h                  |   30 +++++++++++++++++++++++-------
 kernel/irq/msi.c                     |   12 ++++++------
 4 files changed, 37 insertions(+), 21 deletions(-)

--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -450,13 +450,13 @@ static void pseries_msi_ops_msi_free(str
  * RTAS can not disable one MSI at a time. It's all or nothing. Do it
  * at the end after all IRQs have been freed.
  */
-static void pseries_msi_domain_free_irqs(struct irq_domain *domain,
-					 struct device *dev)
+static void pseries_msi_domain_free_irqs(struct irq_domain *domain, struct device *dev,
+					 struct msi_range *range)
 {
 	if (WARN_ON_ONCE(!dev_is_pci(dev)))
 		return;
 
-	__msi_domain_free_irqs(domain, dev);
+	__msi_domain_free_irqs(domain, dev, range);
 
 	rtas_disable_msi(to_pci_dev(dev));
 }
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -407,8 +407,8 @@ static void xen_pv_teardown_msi_irqs(str
 	xen_teardown_msi_irqs(dev);
 }
 
-static int xen_msi_domain_alloc_irqs(struct irq_domain *domain,
-				     struct device *dev,  int nvec)
+static int xen_msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
+				     struct msi_range *range)
 {
 	int type;
 
@@ -420,11 +420,11 @@ static int xen_msi_domain_alloc_irqs(str
 	else
 		type = PCI_CAP_ID_MSI;
 
-	return xen_msi_ops.setup_msi_irqs(to_pci_dev(dev), nvec, type);
+	return xen_msi_ops.setup_msi_irqs(to_pci_dev(dev), range->ndesc, type);
 }
 
-static void xen_msi_domain_free_irqs(struct irq_domain *domain,
-				     struct device *dev)
+static void xen_msi_domain_free_irqs(struct irq_domain *domain, struct device *dev,
+				     struct msi_range *range)
 {
 	if (WARN_ON_ONCE(!dev_is_pci(dev)))
 		return;
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -191,6 +191,23 @@ struct msi_device_data {
 	enum msi_desc_filter		__iter_filter;
 };
 
+/**
+ * msi_range - Descriptor for a MSI index range
+ * @first:	First index
+ * @last:	Last index (inclusive)
+ * @ndesc:	Number of descriptors for allocations
+ *
+ * @first = 0 and @last = UINT_MAX is the full range for an operation.
+ *
+ * Note: @ndesc can be less than the range defined by @first and @last to
+ * support sparse allocations from PCI/MSI-X.
+ */
+struct msi_range {
+	unsigned int	first;
+	unsigned int	last;
+	unsigned int	ndesc;
+};
+
 int msi_setup_device_data(struct device *dev);
 
 /* MSI device properties */
@@ -415,10 +432,10 @@ struct msi_domain_ops {
 				       msi_alloc_info_t *arg);
 	void		(*set_desc)(msi_alloc_info_t *arg,
 				    struct msi_desc *desc);
-	int		(*domain_alloc_irqs)(struct irq_domain *domain,
-					     struct device *dev, int nvec);
-	void		(*domain_free_irqs)(struct irq_domain *domain,
-					    struct device *dev);
+	int		(*domain_alloc_irqs)(struct irq_domain *domain, struct device *dev,
+					     struct msi_range *range);
+	void		(*domain_free_irqs)(struct irq_domain *domain, struct device *dev,
+					    struct msi_range *range);
 };
 
 /**
@@ -484,13 +501,12 @@ int msi_domain_set_affinity(struct irq_d
 struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode,
 					 struct msi_domain_info *info,
 					 struct irq_domain *parent);
-int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
-			    int nvec);
+int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
 int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
 				       int nvec);
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			  int nvec);
-void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
+void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
 void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -869,8 +869,7 @@ static int msi_init_virq(struct irq_doma
 	return 0;
 }
 
-int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
-			    int nvec)
+int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
@@ -880,7 +879,7 @@ int __msi_domain_alloc_irqs(struct irq_d
 	int allocated = 0;
 	int i, ret, virq;
 
-	ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
+	ret = msi_domain_prepare_irqs(domain, dev, range->ndesc, &arg);
 	if (ret)
 		return ret;
 
@@ -960,6 +959,7 @@ int msi_domain_alloc_irqs_descs_locked(s
 				       int nvec)
 {
 	struct msi_domain_info *info = domain->host_data;
+	struct msi_range range = { .ndesc = nvec };
 	struct msi_domain_ops *ops = info->ops;
 	int ret;
 
@@ -969,7 +969,7 @@ int msi_domain_alloc_irqs_descs_locked(s
 	if (ret)
 		return ret;
 
-	ret = ops->domain_alloc_irqs(domain, dev, nvec);
+	ret = ops->domain_alloc_irqs(domain, dev, &range);
 	if (ret)
 		msi_domain_free_irqs_descs_locked(domain, dev);
 	return ret;
@@ -994,7 +994,7 @@ int msi_domain_alloc_irqs(struct irq_dom
 	return ret;
 }
 
-void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct irq_data *irqd;
@@ -1041,7 +1041,7 @@ void msi_domain_free_irqs_descs_locked(s
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ops->domain_free_irqs(domain, dev);
+	ops->domain_free_irqs(domain, dev, NULL);
 	msi_domain_free_msi_descs(info, dev);
 }
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 02/10] genirq/msi: Add range argument to msi_domain_alloc/free_descs_locked()
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
  2021-11-27  1:24 ` [patch 01/10] genirq/msi: Add range argument to alloc/free MSI domain ops Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:25   ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations Thomas Gleixner
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

In preparation for supporting range allocations for MSI-X, add a range
argument to the msi_domain_alloc/free_descs_locked() functions and fixup
all affected places.

Hand in ranges which are covering the current use case. They will be
refined in later steps.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    6 ++++--
 include/linux/msi.h         |    5 ++---
 kernel/irq/msi.c            |   21 ++++++++++++---------
 3 files changed, 18 insertions(+), 14 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -10,22 +10,24 @@
 
 int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec};
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, nvec);
+		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, &range);
 
 	return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
 }
 
 void pci_msi_teardown_msi_irqs(struct pci_dev *dev)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		msi_domain_free_irqs_descs_locked(domain, &dev->dev);
+		msi_domain_free_irqs_descs_locked(domain, &dev->dev, &range);
 	else
 		pci_msi_legacy_teardown_msi_irqs(dev);
 }
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -502,12 +502,11 @@ struct irq_domain *msi_create_irq_domain
 					 struct msi_domain_info *info,
 					 struct irq_domain *parent);
 int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
-int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
-				       int nvec);
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,  struct msi_range *range);
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			  int nvec);
 void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
-void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev);
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev, struct msi_range *range);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
 
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -956,22 +956,21 @@ static int msi_domain_add_simple_msi_des
  * Return: %0 on success or an error code.
  */
 int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
-				       int nvec)
+				       struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
-	struct msi_range range = { .ndesc = nvec };
 	struct msi_domain_ops *ops = info->ops;
 	int ret;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ret = msi_domain_add_simple_msi_descs(info, dev, nvec);
+	ret = msi_domain_add_simple_msi_descs(info, dev, range->ndesc);
 	if (ret)
 		return ret;
 
-	ret = ops->domain_alloc_irqs(domain, dev, &range);
+	ret = ops->domain_alloc_irqs(domain, dev, range);
 	if (ret)
-		msi_domain_free_irqs_descs_locked(domain, dev);
+		msi_domain_free_irqs_descs_locked(domain, dev, range);
 	return ret;
 }
 
@@ -986,10 +985,11 @@ int msi_domain_alloc_irqs_descs_locked(s
  */
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec, };
 	int ret;
 
 	msi_lock_descs(dev);
-	ret = msi_domain_alloc_irqs_descs_locked(domain, dev, nvec);
+	ret = msi_domain_alloc_irqs_descs_locked(domain, dev, &range);
 	msi_unlock_descs(dev);
 	return ret;
 }
@@ -1034,14 +1034,15 @@ static void msi_domain_free_msi_descs(st
  * pair. Use this for MSI irqdomains which implement their own vector
  * allocation.
  */
-void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev)
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+				       struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ops->domain_free_irqs(domain, dev, NULL);
+	ops->domain_free_irqs(domain, dev, range);
 	msi_domain_free_msi_descs(info, dev);
 }
 
@@ -1053,8 +1054,10 @@ void msi_domain_free_irqs_descs_locked(s
  */
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
+
 	msi_lock_descs(dev);
-	msi_domain_free_irqs_descs_locked(domain, dev);
+	msi_domain_free_irqs_descs_locked(domain, dev, &range);
 	msi_unlock_descs(dev);
 }
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
  2021-11-27  1:24 ` [patch 01/10] genirq/msi: Add range argument to alloc/free MSI domain ops Thomas Gleixner
  2021-11-27  1:24 ` [patch 02/10] genirq/msi: Add range argument to msi_domain_alloc/free_descs_locked() Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:25   ` Thomas Gleixner
  2021-11-28 15:57   ` Marc Zyngier
  2021-11-27  1:24 ` [patch 05/10] genirq/msi: Add domain info flag MSI_FLAG_CAN_EXPAND Thomas Gleixner
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Convert the MSI descriptor related functions to ranges and fixup the call
sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/base/platform-msi.c |    3 ++-
 include/linux/msi.h         |    7 ++++---
 kernel/irq/msi.c            |   38 +++++++++++++++++++-------------------
 3 files changed, 25 insertions(+), 23 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -320,11 +320,12 @@ struct irq_domain *
 void platform_msi_device_domain_free(struct irq_domain *domain, unsigned int virq,
 				     unsigned int nr_irqs)
 {
+	struct msi_range range = { .first = virq, .last = virq + nr_irqs - 1, };
 	struct platform_msi_priv_data *data = domain->host_data;
 
 	msi_lock_descs(data->dev);
 	irq_domain_free_irqs_common(domain, virq, nr_irqs);
-	msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, virq, nr_irqs);
+	msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, &range);
 	msi_unlock_descs(data->dev);
 }
 
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -321,8 +321,7 @@ static inline void pci_write_msi_msg(uns
 #endif /* CONFIG_PCI_MSI */
 
 int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
-void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
-			      unsigned int base_index, unsigned int ndesc);
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter, struct msi_range *range);
 
 /**
  * msi_free_msi_descs - Free MSI descriptors of a device
@@ -330,7 +329,9 @@ void msi_free_msi_descs_range(struct dev
  */
 static inline void msi_free_msi_descs(struct device *dev)
 {
-	msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
+
+	msi_free_msi_descs_range(dev, MSI_DESC_ALL, &range);
 }
 
 void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -101,19 +101,19 @@ int msi_add_msi_desc(struct device *dev,
  *
  * Return: 0 on success or an appropriate failure code.
  */
-static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
+static int msi_add_simple_msi_descs(struct device *dev, struct msi_range *range)
 {
 	struct msi_desc *desc;
-	unsigned long i;
+	unsigned long idx;
 	int ret;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	for (i = 0; i < ndesc; i++) {
+	for (idx = range->first; idx <= range->last; idx++) {
 		desc = msi_alloc_desc(dev, 1, NULL);
 		if (!desc)
 			goto fail_mem;
-		ret = msi_insert_desc(dev->msi.data, desc, index + i);
+		ret = msi_insert_desc(dev->msi.data, desc, idx);
 		if (ret)
 			goto fail;
 	}
@@ -122,7 +122,7 @@ static int msi_add_simple_msi_descs(stru
 fail_mem:
 	ret = -ENOMEM;
 fail:
-	msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, index, ndesc);
+	msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, range);
 	return ret;
 }
 
@@ -148,14 +148,14 @@ static bool msi_desc_match(struct msi_de
  * @ndesc:	Number of descriptors to free
  */
 void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
-			      unsigned int base_index, unsigned int ndesc)
+			      struct msi_range *range)
 {
 	struct msi_desc *desc;
 	unsigned long idx;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	xa_for_each_range(&dev->msi.data->store, idx, desc, base_index, base_index + ndesc - 1) {
+	xa_for_each_range(&dev->msi.data->store, idx, desc, range->first, range->last) {
 		if (msi_desc_match(desc, filter)) {
 			xa_erase(&dev->msi.data->store, idx);
 			msi_free_desc(desc);
@@ -746,17 +746,18 @@ int msi_domain_prepare_irqs(struct irq_d
 int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
 			     int virq_base, int nvec, msi_alloc_info_t *arg)
 {
+	struct msi_range range = { .first = virq_base, .last = virq_base + nvec - 1 };
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
 	struct msi_desc *desc;
 	int ret, virq;
 
 	msi_lock_descs(dev);
-	ret = msi_add_simple_msi_descs(dev, virq_base, nvec);
+	ret = msi_add_simple_msi_descs(dev, &range);
 	if (ret)
 		goto unlock;
 
-	for (virq = virq_base; virq < virq_base + nvec; virq++) {
+	for (virq = range.first; virq <= range.last; virq++) {
 		desc = xa_load(&dev->msi.data->store, virq);
 		desc->irq = virq;
 
@@ -773,7 +774,7 @@ int msi_domain_populate_irqs(struct irq_
 fail:
 	for (--virq; virq >= virq_base; virq--)
 		irq_domain_free_irqs_common(domain, virq, 1);
-	msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+	msi_free_msi_descs_range(dev, MSI_DESC_ALL, &range);
 unlock:
 	msi_unlock_descs(dev);
 	return ret;
@@ -932,14 +933,13 @@ int __msi_domain_alloc_irqs(struct irq_d
 	return 0;
 }
 
-static int msi_domain_add_simple_msi_descs(struct msi_domain_info *info,
-					   struct device *dev,
-					   unsigned int num_descs)
+static int msi_domain_add_simple_msi_descs(struct msi_domain_info *info, struct device *dev,
+					   struct msi_range *range)
 {
 	if (!(info->flags & MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS))
 		return 0;
 
-	return msi_add_simple_msi_descs(dev, 0, num_descs);
+	return msi_add_simple_msi_descs(dev, range);
 }
 
 /**
@@ -964,7 +964,7 @@ int msi_domain_alloc_irqs_descs_locked(s
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ret = msi_domain_add_simple_msi_descs(info, dev, range->ndesc);
+	ret = msi_domain_add_simple_msi_descs(info, dev, range);
 	if (ret)
 		return ret;
 
@@ -1017,11 +1017,11 @@ void __msi_domain_free_irqs(struct irq_d
 	}
 }
 
-static void msi_domain_free_msi_descs(struct msi_domain_info *info,
-				      struct device *dev)
+static void msi_domain_free_msi_descs(struct msi_domain_info *info, struct device *dev,
+				      struct msi_range *range)
 {
 	if (info->flags & MSI_FLAG_FREE_MSI_DESCS)
-		msi_free_msi_descs(dev);
+		msi_free_msi_descs_range(dev, MSI_DESC_ALL, range);
 }
 
 /**
@@ -1043,7 +1043,7 @@ void msi_domain_free_irqs_descs_locked(s
 	lockdep_assert_held(&dev->msi.data->mutex);
 
 	ops->domain_free_irqs(domain, dev, range);
-	msi_domain_free_msi_descs(info, dev);
+	msi_domain_free_msi_descs(info, dev, range);
 }
 
 /**


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 04/10] genirq/msi: Prepare MSI domain alloc/free for range irq allocation
  2021-11-27  1:25 ` [patch 04/10] genirq/msi: Prepare MSI domain alloc/free for range irq allocation Thomas Gleixner
@ 2021-11-27  1:24   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Make the iterators in the allocation and free functions range based.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/irq/msi.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -877,6 +877,7 @@ int __msi_domain_alloc_irqs(struct irq_d
 	msi_alloc_info_t arg = { };
 	unsigned int vflags = 0;
 	struct msi_desc *desc;
+	unsigned long idx;
 	int allocated = 0;
 	int i, ret, virq;
 
@@ -906,7 +907,10 @@ int __msi_domain_alloc_irqs(struct irq_d
 			vflags |= VIRQ_NOMASK_QUIRK;
 	}
 
-	msi_for_each_desc(desc, dev, MSI_DESC_NOTASSOCIATED) {
+	xa_for_each_range(&dev->msi.data->store, idx, desc, range->first, range->last) {
+		if (!msi_desc_match(desc, MSI_DESC_NOTASSOCIATED))
+			continue;
+
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -999,10 +1003,14 @@ void __msi_domain_free_irqs(struct irq_d
 	struct msi_domain_info *info = domain->host_data;
 	struct irq_data *irqd;
 	struct msi_desc *desc;
+	unsigned long idx;
 	int i;
 
 	/* Only handle MSI entries which have an interrupt associated */
-	msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+	xa_for_each_range(&dev->msi.data->store, idx, desc, range->first, range->last) {
+		if (!msi_desc_match(desc, MSI_DESC_ASSOCIATED))
+			continue;
+
 		/* Make sure all interrupts are deactivated */
 		for (i = 0; i < desc->nvec_used; i++) {
 			irqd = irq_domain_get_irq_data(domain, desc->irq + i);


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 05/10] genirq/msi: Add domain info flag MSI_FLAG_CAN_EXPAND
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (2 preceding siblings ...)
  2021-11-27  1:24 ` [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:25   ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 06/10] PCI/MSI: Use range in allocation path Thomas Gleixner
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Not all MSI domains support runtime expansions of PCI/MSI-X vectors. Add a
domain flag so implementations can opt in.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -494,6 +494,8 @@ enum {
 	MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS	= (1 << 9),
 	/* Free MSI descriptors */
 	MSI_FLAG_FREE_MSI_DESCS		= (1 << 10),
+	/* MSI vectors can be expanded after initial setup */
+	MSI_FLAG_CAN_EXPAND		= (1 << 11),
 };
 
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 06/10] PCI/MSI: Use range in allocation path
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (3 preceding siblings ...)
  2021-11-27  1:24 ` [patch 05/10] genirq/msi: Add domain info flag MSI_FLAG_CAN_EXPAND Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:25   ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 08/10] PCI/MSI: Provide pci_msi_domain_supports_expand() Thomas Gleixner
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Make the allocation path range based to prepare for runtime expansion of
MSI-X vectors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    7 +++----
 drivers/pci/msi/msi.c       |   34 +++++++++++++++++++++-------------
 drivers/pci/msi/msi.h       |    2 +-
 3 files changed, 25 insertions(+), 18 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -8,16 +8,15 @@
 
 #include "msi.h"
 
-int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type)
 {
-	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec};
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, &range);
+		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, range);
 
-	return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
+	return pci_msi_legacy_setup_msi_irqs(dev, range->ndesc, type);
 }
 
 void pci_msi_teardown_msi_irqs(struct pci_dev *dev)
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -370,14 +370,16 @@ static int msi_setup_msi_desc(struct pci
 	return ret;
 }
 
-static int msi_verify_entries(struct pci_dev *dev)
+static int msi_verify_entries(struct pci_dev *dev, struct msi_range *range)
 {
 	struct msi_desc *entry;
 
 	if (!dev->no_64bit_msi)
 		return 0;
 
-	msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
+	msi_for_each_desc_from(entry, &dev->dev, MSI_DESC_ALL, range->first) {
+		if (entry->msi_index > range->last)
+			return 0;
 		if (entry->msg.address_hi) {
 			pci_err(dev, "arch assigned 64-bit MSI address %#x%08x but device only supports 32 bits\n",
 				entry->msg.address_hi, entry->msg.address_lo);
@@ -402,6 +404,7 @@ static int msi_verify_entries(struct pci
 static int msi_capability_init(struct pci_dev *dev, int nvec,
 			       struct irq_affinity *affd)
 {
+	struct msi_range range = { .first = 0, .last = 0, .ndesc = nvec, };
 	struct irq_affinity_desc *masks = NULL;
 	struct msi_desc *entry;
 	int ret;
@@ -421,11 +424,11 @@ static int msi_capability_init(struct pc
 	pci_msi_mask(entry, msi_multi_mask(entry));
 
 	/* Configure MSI capability structure */
-	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
+	ret = pci_msi_setup_msi_irqs(dev, &range, PCI_CAP_ID_MSI);
 	if (ret)
 		goto err;
 
-	ret = msi_verify_entries(dev);
+	ret = msi_verify_entries(dev, &range);
 	if (ret)
 		goto err;
 
@@ -469,7 +472,8 @@ static void __iomem *msix_map_region(str
 }
 
 static int msix_setup_msi_descs(struct pci_dev *dev, void __iomem *base,
-				struct msix_entry *entries, int nvec,
+				struct msi_range *range,
+				struct msix_entry *entries,
 				struct irq_affinity_desc *masks)
 {
 	int ret, i, vec_count = pci_msix_vec_count(dev);
@@ -485,8 +489,8 @@ static int msix_setup_msi_descs(struct p
 	desc.pci.msi_attrib.default_irq	= dev->irq;
 	desc.pci.mask_base		= base;
 
-	for (i = 0, curmsk = masks; i < nvec; i++, curmsk++) {
-		desc.msi_index = entries ? entries[i].entry : i;
+	for (i = 0, curmsk = masks; i < range->ndesc; i++, curmsk++) {
+		desc.msi_index = entries ? entries[i].entry : range->first + i;
 		desc.affinity = masks ? curmsk : NULL;
 		desc.pci.msi_attrib.is_virtual = desc.msi_index >= vec_count;
 		desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
@@ -500,6 +504,9 @@ static int msix_setup_msi_descs(struct p
 		ret = msi_add_msi_desc(&dev->dev, &desc);
 		if (ret)
 			break;
+
+		if (desc.msi_index > range->last)
+			range->last = desc.msi_index;
 	}
 
 	return ret;
@@ -530,28 +537,28 @@ static void msix_mask_all(void __iomem *
 }
 
 static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
-				 struct msix_entry *entries, int nvec,
+				 struct msi_range *range, struct msix_entry *entries,
 				 struct irq_affinity *affd)
 {
 	struct irq_affinity_desc *masks = NULL;
 	int ret;
 
 	if (affd)
-		masks = irq_create_affinity_masks(nvec, affd);
+		masks = irq_create_affinity_masks(range->ndesc, affd);
 
 	msi_lock_descs(&dev->dev);
-	ret = msix_setup_msi_descs(dev, base, entries, nvec, masks);
+	ret = msix_setup_msi_descs(dev, base, range, entries, masks);
 	if (ret)
 		goto out_free;
 
 	dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
 
-	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+	ret = pci_msi_setup_msi_irqs(dev, range, PCI_CAP_ID_MSIX);
 	if (ret)
 		goto out_free;
 
 	/* Check if all MSI entries honor device restrictions */
-	ret = msi_verify_entries(dev);
+	ret = msi_verify_entries(dev, range);
 	if (ret)
 		goto out_free;
 
@@ -580,6 +587,7 @@ static int msix_setup_interrupts(struct
 static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
 				int nvec, struct irq_affinity *affd)
 {
+	struct msi_range range = { .first = 0, .last = 0, .ndesc = nvec, };
 	void __iomem *base;
 	int ret, tsize;
 	u16 control;
@@ -606,7 +614,7 @@ static int msix_capability_init(struct p
 	/* Ensure that all table entries are masked. */
 	msix_mask_all(base, tsize);
 
-	ret = msix_setup_interrupts(dev, base, entries, nvec, affd);
+	ret = msix_setup_interrupts(dev, base, &range, entries, affd);
 	if (ret)
 		goto out_disable;
 
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -5,7 +5,7 @@
 
 #define msix_table_size(flags)	((flags & PCI_MSIX_FLAGS_QSIZE) + 1)
 
-extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type);
 extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev);
 
 #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 07/10] PCI/MSI: Make free related functions range based
  2021-11-27  1:25 ` [patch 07/10] PCI/MSI: Make free related functions range based Thomas Gleixner
@ 2021-11-27  1:24   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

In preparation of runtime expandable PCI/MSI-X vectors convert the related
free functions to take ranges instead of assuming a zero based vector
space.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    5 ++---
 drivers/pci/msi/msi.c       |   24 ++++++++++++++++--------
 drivers/pci/msi/msi.h       |    2 +-
 3 files changed, 19 insertions(+), 12 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -19,14 +19,13 @@ int pci_msi_setup_msi_irqs(struct pci_de
 	return pci_msi_legacy_setup_msi_irqs(dev, range->ndesc, type);
 }
 
-void pci_msi_teardown_msi_irqs(struct pci_dev *dev)
+void pci_msi_teardown_msi_irqs(struct pci_dev *dev, struct msi_range *range)
 {
-	struct msi_range range = { .first = 0, .last = UINT_MAX, };
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		msi_domain_free_irqs_descs_locked(domain, &dev->dev, &range);
+		msi_domain_free_irqs_descs_locked(domain, &dev->dev, range);
 	else
 		pci_msi_legacy_teardown_msi_irqs(dev);
 }
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -222,9 +222,12 @@ void pci_write_msi_msg(unsigned int irq,
 }
 EXPORT_SYMBOL_GPL(pci_write_msi_msg);
 
-static void free_msi_irqs(struct pci_dev *dev)
+static void free_msi_irqs(struct pci_dev *dev, struct msi_range *range, bool shutdown)
 {
-	pci_msi_teardown_msi_irqs(dev);
+	pci_msi_teardown_msi_irqs(dev, range);
+
+	if (!shutdown)
+		return;
 
 	if (dev->msix_base) {
 		iounmap(dev->msix_base);
@@ -443,7 +446,7 @@ static int msi_capability_init(struct pc
 
 err:
 	pci_msi_unmask(entry, msi_multi_mask(entry));
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, &range, true);
 unlock:
 	msi_unlock_descs(&dev->dev);
 	kfree(masks);
@@ -538,7 +541,7 @@ static void msix_mask_all(void __iomem *
 
 static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
 				 struct msi_range *range, struct msix_entry *entries,
-				 struct irq_affinity *affd)
+				 struct irq_affinity *affd, bool expand)
 {
 	struct irq_affinity_desc *masks = NULL;
 	int ret;
@@ -566,7 +569,8 @@ static int msix_setup_interrupts(struct
 	goto out_unlock;
 
 out_free:
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, range, !expand);
+
 out_unlock:
 	msi_unlock_descs(&dev->dev);
 	kfree(masks);
@@ -614,7 +618,7 @@ static int msix_capability_init(struct p
 	/* Ensure that all table entries are masked. */
 	msix_mask_all(base, tsize);
 
-	ret = msix_setup_interrupts(dev, base, &range, entries, affd);
+	ret = msix_setup_interrupts(dev, base, &range, entries, affd, false);
 	if (ret)
 		goto out_disable;
 
@@ -728,12 +732,14 @@ static void pci_msi_shutdown(struct pci_
 
 void pci_disable_msi(struct pci_dev *dev)
 {
+	struct msi_range range = { .first = 0, .last = 0, };
+
 	if (!pci_msi_enable || !dev || !dev->msi_enabled)
 		return;
 
 	msi_lock_descs(&dev->dev);
 	pci_msi_shutdown(dev);
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, &range, true);
 	msi_unlock_descs(&dev->dev);
 }
 EXPORT_SYMBOL(pci_disable_msi);
@@ -817,12 +823,14 @@ static void pci_msix_shutdown(struct pci
 
 void pci_disable_msix(struct pci_dev *dev)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
+
 	if (!pci_msi_enable || !dev || !dev->msix_enabled)
 		return;
 
 	msi_lock_descs(&dev->dev);
 	pci_msix_shutdown(dev);
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, &range, true);
 	msi_unlock_descs(&dev->dev);
 }
 EXPORT_SYMBOL(pci_disable_msix);
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -6,7 +6,7 @@
 #define msix_table_size(flags)	((flags & PCI_MSIX_FLAGS_QSIZE) + 1)
 
 extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type);
-extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev);
+extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev, struct msi_range *range);
 
 #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS
 extern int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 08/10] PCI/MSI: Provide pci_msi_domain_supports_expand()
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (4 preceding siblings ...)
  2021-11-27  1:24 ` [patch 06/10] PCI/MSI: Use range in allocation path Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:25   ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 10/10] x86/apic/msi: Support MSI-X vector expansion Thomas Gleixner
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Not all irq domain implementations can support runtime MSI-X vector
expansion as they assume zero based allocations or have other
restrictions.

The legacy PCI allocation functions are not suited for runtime vector
expansion either.

Add a function which allows to query whether runtime MSI-X vector expansion
is supported or not.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |   29 +++++++++++++++++++++++------
 include/linux/msi.h         |    2 ++
 2 files changed, 25 insertions(+), 6 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -8,12 +8,18 @@
 
 #include "msi.h"
 
+static struct irq_domain *pci_get_msi_domain(struct pci_dev *dev)
+{
+	struct irq_domain *domain = dev_get_msi_domain(&dev->dev);
+
+	return domain && irq_domain_is_hierarchy(domain) ? domain : NULL;
+}
+
 int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type)
 {
-	struct irq_domain *domain;
+	struct irq_domain *domain = pci_get_msi_domain(dev);
 
-	domain = dev_get_msi_domain(&dev->dev);
-	if (domain && irq_domain_is_hierarchy(domain))
+	if (domain)
 		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, range);
 
 	return pci_msi_legacy_setup_msi_irqs(dev, range->ndesc, type);
@@ -21,15 +27,26 @@ int pci_msi_setup_msi_irqs(struct pci_de
 
 void pci_msi_teardown_msi_irqs(struct pci_dev *dev, struct msi_range *range)
 {
-	struct irq_domain *domain;
+	struct irq_domain *domain = pci_get_msi_domain(dev);
 
-	domain = dev_get_msi_domain(&dev->dev);
-	if (domain && irq_domain_is_hierarchy(domain))
+	if (domain)
 		msi_domain_free_irqs_descs_locked(domain, &dev->dev, range);
 	else
 		pci_msi_legacy_teardown_msi_irqs(dev);
 }
 
+bool pci_msi_domain_supports_expand(struct pci_dev *dev)
+{
+	struct irq_domain *domain = pci_get_msi_domain(dev);
+	struct msi_domain_info *info;
+
+	if (!domain)
+		return false;
+
+	info = domain->host_data;
+	return info->flags & MSI_FLAG_CAN_EXPAND;
+}
+
 /**
  * pci_msi_domain_write_msg - Helper to write MSI message to PCI config space
  * @irq_data:	Pointer to interrupt data of the MSI interrupt
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -552,11 +552,13 @@ struct irq_domain *pci_msi_create_irq_do
 u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
 struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
 bool pci_dev_has_special_msi_domain(struct pci_dev *pdev);
+bool pci_msi_domain_supports_expand(struct pci_dev *dev);
 #else
 static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
 {
 	return NULL;
 }
+static inline bool pci_msi_domain_supports_expand(struct pci_dev *dev) { return false; }
 #endif /* CONFIG_PCI_MSI_IRQ_DOMAIN */
 
 #endif /* LINUX_MSI_H */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-11-27  1:25 ` [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() Thomas Gleixner
@ 2021-11-27  1:24   ` Thomas Gleixner
  2021-12-02  1:08   ` Dey, Megha
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Provide a new interface which allows to expand the MSI-X vector space if
the underlying irq domain implementation supports it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/msi.c |   41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h   |   13 +++++++++++++
 2 files changed, 54 insertions(+)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -1025,6 +1025,47 @@ int pci_alloc_irq_vectors_affinity(struc
 EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
 
 /**
+ * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
+ *
+ * @dev:	PCI device to operate on
+ * @at:		Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
+ *		the function expands automatically after the last
+ *		active index.
+ * @nvec:	Number of vectors to allocate
+ *
+ * Expand the MSI-X vectors of a device after an initial enablement and
+ * allocation.
+ *
+ * Return: 0 if the allocation was successful, an error code otherwise.
+ */
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+	struct msi_device_data *md = dev->dev.msi.data;
+	struct msi_range range = { .ndesc = nvec, };
+	unsigned int max_vecs;
+	int ret;
+
+	if (!pci_msi_enable || !dev || !dev->msix_enabled || !md)
+		return -ENOTSUPP;
+
+	if (!pci_msi_domain_supports_expand(dev))
+		return -ENOTSUPP;
+
+	max_vecs = pci_msix_vec_count(dev);
+	if (!nvec || nvec > max_vecs)
+		return -EINVAL;
+
+	range.first = at == PCI_MSIX_EXPAND_AUTO ? md->num_descs : at;
+
+	if (range.first >= max_vecs || nvec > max_vecs - range.first)
+		return -ENOSPC;
+
+	ret = msix_setup_interrupts(dev, dev->msix_base, &range, NULL, NULL, true);
+	return ret <= 0 ? ret : -ENOSPC;;
+}
+EXPORT_SYMBOL_GPL(pci_msix_expand_vectors_at);
+
+/**
  * pci_free_irq_vectors - free previously allocated IRQs for a device
  * @dev:		PCI device to operate on
  *
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1534,6 +1534,7 @@ static inline int pci_enable_msix_exact(
 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   unsigned int max_vecs, unsigned int flags,
 				   struct irq_affinity *affd);
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec);
 
 void pci_free_irq_vectors(struct pci_dev *dev);
 int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
@@ -1565,6 +1566,11 @@ pci_alloc_irq_vectors_affinity(struct pc
 	return -ENOSPC;
 }
 
+static inline int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+	return -ENOTSUPP;
+}
+
 static inline void pci_free_irq_vectors(struct pci_dev *dev)
 {
 }
@@ -1582,6 +1588,13 @@ static inline const struct cpumask *pci_
 }
 #endif
 
+#define PCI_MSIX_EXPAND_AUTO	(UINT_MAX)
+
+static inline int pci_msix_expand_vectors(struct pci_dev *dev, unsigned int nvec)
+{
+	return pci_msix_expand_vectors_at(dev, PCI_MSIX_EXPAND_AUTO, nvec);
+}
+
 /**
  * pci_irqd_intx_xlate() - Translate PCI INTx value to an IRQ domain hwirq
  * @d: the INTx IRQ domain


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 10/10] x86/apic/msi: Support MSI-X vector expansion
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (5 preceding siblings ...)
  2021-11-27  1:24 ` [patch 08/10] PCI/MSI: Provide pci_msi_domain_supports_expand() Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:25   ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

The X86 PCI/MSI irq domaim implementation supports vector expansion out of
the box. Make it available.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/msi.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -178,7 +178,7 @@ static struct msi_domain_ops pci_msi_dom
 
 static struct msi_domain_info pci_msi_domain_info = {
 	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
-			  MSI_FLAG_PCI_MSIX,
+			  MSI_FLAG_PCI_MSIX | MSI_FLAG_CAN_EXPAND,
 	.ops		= &pci_msi_domain_ops,
 	.chip		= &pci_msi_controller,
 	.handler	= handle_edge_irq,
@@ -226,7 +226,7 @@ static struct irq_chip pci_msi_ir_contro
 
 static struct msi_domain_info pci_msi_ir_domain_info = {
 	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
-			  MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX,
+			  MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX | MSI_FLAG_CAN_EXPAND,
 	.ops		= &pci_msi_domain_ops,
 	.chip		= &pci_msi_ir_controller,
 	.handler	= handle_edge_irq,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (6 preceding siblings ...)
  2021-11-27  1:24 ` [patch 10/10] x86/apic/msi: Support MSI-X vector expansion Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:25 ` [patch 04/10] genirq/msi: Prepare MSI domain alloc/free for range irq allocation Thomas Gleixner
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

This is finally the point where dynamically expanding MSI-X vectors after
enabling MSI-X is implemented.

The first three parts of this work can be found here:

    https://lore.kernel.org/r/20211126222700.862407977@linutronix.de
    https://lore.kernel.org/r/20211126224100.303046749@linutronix.de
    https://lore.kernel.org/r/20211126230957.239391799@linutronix.de

This last and smallest part of the overall series contains the following
changes:

   1) Prepare the core MSI irq domain code to handle range based allocation
      and free

   2) Prepare the PCI/MSI code to handle range based allocation and free
  
   3) Implement a new interface which allows to expand the MSI-X vector
      space after initialization

   4) Enable support for the X86 PCI/MSI irq domains

      This is unfortunate, but some PCI/MSI irq domain implementations,
      e.g. powerpc and the x86/XEN irqdomain wrappers are not really ready
      to support this out of the box.

      I looked at the 30 places which implement PCI/MSI irq domains and
      many of them look like they could support it out of the box, but as
      we have two which definitely don't, making this opt-in is the only
      safe option.

I've tested this by hacking up the XHCI driver and it works like a charm.

There is certainly some more room for consolidating the PCI/MSI-X usage in
drivers, i.e. getting rid of pci_enable_msix*(), but this would have made
this series even larger and is an orthogonal issue.

The series is based on:

     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3

and also available from git:

     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-4

Thanks,

	tglx
---
 arch/powerpc/platforms/pseries/msi.c |    6 +-
 arch/x86/kernel/apic/msi.c           |    4 -
 arch/x86/pci/xen.c                   |   10 +--
 drivers/base/platform-msi.c          |    3 -
 drivers/pci/msi/irqdomain.c          |   39 ++++++++++----
 drivers/pci/msi/msi.c                |   97 +++++++++++++++++++++++++++--------
 drivers/pci/msi/msi.h                |    4 -
 include/linux/msi.h                  |   46 +++++++++++-----
 include/linux/pci.h                  |   13 ++++
 kernel/irq/msi.c                     |   75 +++++++++++++++------------
 10 files changed, 208 insertions(+), 89 deletions(-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 01/10] genirq/msi: Add range argument to alloc/free MSI domain ops
  2021-11-27  1:24 ` [patch 01/10] genirq/msi: Add range argument to alloc/free MSI domain ops Thomas Gleixner
@ 2021-11-27  1:24   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

In preparation for supporting range allocations for MSI-X, add a range
argument to the MSI domain alloc/free function pointers and fixup all
affected places.

The range is supplied via a pointer to a struct msi_range which contains
the first and last MSI index and the number of vectors to allocate/free.

To support the sparse MSI-X allocations via pci_enable_msix_range() and
pci_enable_msix_exact() the number of vectors can be smaller than the range
defined by the first and last MSI index. This can be cleaned up later once
the code is converted by converting these sparse allocations to an initial
allocation on enable and expansion of the vector space at the required
indices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/platforms/pseries/msi.c |    6 +++---
 arch/x86/pci/xen.c                   |   10 +++++-----
 include/linux/msi.h                  |   30 +++++++++++++++++++++++-------
 kernel/irq/msi.c                     |   12 ++++++------
 4 files changed, 37 insertions(+), 21 deletions(-)

--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -450,13 +450,13 @@ static void pseries_msi_ops_msi_free(str
  * RTAS can not disable one MSI at a time. It's all or nothing. Do it
  * at the end after all IRQs have been freed.
  */
-static void pseries_msi_domain_free_irqs(struct irq_domain *domain,
-					 struct device *dev)
+static void pseries_msi_domain_free_irqs(struct irq_domain *domain, struct device *dev,
+					 struct msi_range *range)
 {
 	if (WARN_ON_ONCE(!dev_is_pci(dev)))
 		return;
 
-	__msi_domain_free_irqs(domain, dev);
+	__msi_domain_free_irqs(domain, dev, range);
 
 	rtas_disable_msi(to_pci_dev(dev));
 }
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -407,8 +407,8 @@ static void xen_pv_teardown_msi_irqs(str
 	xen_teardown_msi_irqs(dev);
 }
 
-static int xen_msi_domain_alloc_irqs(struct irq_domain *domain,
-				     struct device *dev,  int nvec)
+static int xen_msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
+				     struct msi_range *range)
 {
 	int type;
 
@@ -420,11 +420,11 @@ static int xen_msi_domain_alloc_irqs(str
 	else
 		type = PCI_CAP_ID_MSI;
 
-	return xen_msi_ops.setup_msi_irqs(to_pci_dev(dev), nvec, type);
+	return xen_msi_ops.setup_msi_irqs(to_pci_dev(dev), range->ndesc, type);
 }
 
-static void xen_msi_domain_free_irqs(struct irq_domain *domain,
-				     struct device *dev)
+static void xen_msi_domain_free_irqs(struct irq_domain *domain, struct device *dev,
+				     struct msi_range *range)
 {
 	if (WARN_ON_ONCE(!dev_is_pci(dev)))
 		return;
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -191,6 +191,23 @@ struct msi_device_data {
 	enum msi_desc_filter		__iter_filter;
 };
 
+/**
+ * msi_range - Descriptor for a MSI index range
+ * @first:	First index
+ * @last:	Last index (inclusive)
+ * @ndesc:	Number of descriptors for allocations
+ *
+ * @first = 0 and @last = UINT_MAX is the full range for an operation.
+ *
+ * Note: @ndesc can be less than the range defined by @first and @last to
+ * support sparse allocations from PCI/MSI-X.
+ */
+struct msi_range {
+	unsigned int	first;
+	unsigned int	last;
+	unsigned int	ndesc;
+};
+
 int msi_setup_device_data(struct device *dev);
 
 /* MSI device properties */
@@ -415,10 +432,10 @@ struct msi_domain_ops {
 				       msi_alloc_info_t *arg);
 	void		(*set_desc)(msi_alloc_info_t *arg,
 				    struct msi_desc *desc);
-	int		(*domain_alloc_irqs)(struct irq_domain *domain,
-					     struct device *dev, int nvec);
-	void		(*domain_free_irqs)(struct irq_domain *domain,
-					    struct device *dev);
+	int		(*domain_alloc_irqs)(struct irq_domain *domain, struct device *dev,
+					     struct msi_range *range);
+	void		(*domain_free_irqs)(struct irq_domain *domain, struct device *dev,
+					    struct msi_range *range);
 };
 
 /**
@@ -484,13 +501,12 @@ int msi_domain_set_affinity(struct irq_d
 struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode,
 					 struct msi_domain_info *info,
 					 struct irq_domain *parent);
-int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
-			    int nvec);
+int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
 int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
 				       int nvec);
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			  int nvec);
-void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
+void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
 void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -869,8 +869,7 @@ static int msi_init_virq(struct irq_doma
 	return 0;
 }
 
-int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
-			    int nvec)
+int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
@@ -880,7 +879,7 @@ int __msi_domain_alloc_irqs(struct irq_d
 	int allocated = 0;
 	int i, ret, virq;
 
-	ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
+	ret = msi_domain_prepare_irqs(domain, dev, range->ndesc, &arg);
 	if (ret)
 		return ret;
 
@@ -960,6 +959,7 @@ int msi_domain_alloc_irqs_descs_locked(s
 				       int nvec)
 {
 	struct msi_domain_info *info = domain->host_data;
+	struct msi_range range = { .ndesc = nvec };
 	struct msi_domain_ops *ops = info->ops;
 	int ret;
 
@@ -969,7 +969,7 @@ int msi_domain_alloc_irqs_descs_locked(s
 	if (ret)
 		return ret;
 
-	ret = ops->domain_alloc_irqs(domain, dev, nvec);
+	ret = ops->domain_alloc_irqs(domain, dev, &range);
 	if (ret)
 		msi_domain_free_irqs_descs_locked(domain, dev);
 	return ret;
@@ -994,7 +994,7 @@ int msi_domain_alloc_irqs(struct irq_dom
 	return ret;
 }
 
-void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct irq_data *irqd;
@@ -1041,7 +1041,7 @@ void msi_domain_free_irqs_descs_locked(s
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ops->domain_free_irqs(domain, dev);
+	ops->domain_free_irqs(domain, dev, NULL);
 	msi_domain_free_msi_descs(info, dev);
 }
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 02/10] genirq/msi: Add range argument to msi_domain_alloc/free_descs_locked()
  2021-11-27  1:24 ` [patch 02/10] genirq/msi: Add range argument to msi_domain_alloc/free_descs_locked() Thomas Gleixner
@ 2021-11-27  1:25   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

In preparation for supporting range allocations for MSI-X, add a range
argument to the msi_domain_alloc/free_descs_locked() functions and fixup
all affected places.

Hand in ranges which are covering the current use case. They will be
refined in later steps.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    6 ++++--
 include/linux/msi.h         |    5 ++---
 kernel/irq/msi.c            |   21 ++++++++++++---------
 3 files changed, 18 insertions(+), 14 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -10,22 +10,24 @@
 
 int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec};
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, nvec);
+		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, &range);
 
 	return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
 }
 
 void pci_msi_teardown_msi_irqs(struct pci_dev *dev)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		msi_domain_free_irqs_descs_locked(domain, &dev->dev);
+		msi_domain_free_irqs_descs_locked(domain, &dev->dev, &range);
 	else
 		pci_msi_legacy_teardown_msi_irqs(dev);
 }
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -502,12 +502,11 @@ struct irq_domain *msi_create_irq_domain
 					 struct msi_domain_info *info,
 					 struct irq_domain *parent);
 int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
-int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
-				       int nvec);
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,  struct msi_range *range);
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			  int nvec);
 void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev, struct msi_range *range);
-void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev);
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev, struct msi_range *range);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
 
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -956,22 +956,21 @@ static int msi_domain_add_simple_msi_des
  * Return: %0 on success or an error code.
  */
 int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
-				       int nvec)
+				       struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
-	struct msi_range range = { .ndesc = nvec };
 	struct msi_domain_ops *ops = info->ops;
 	int ret;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ret = msi_domain_add_simple_msi_descs(info, dev, nvec);
+	ret = msi_domain_add_simple_msi_descs(info, dev, range->ndesc);
 	if (ret)
 		return ret;
 
-	ret = ops->domain_alloc_irqs(domain, dev, &range);
+	ret = ops->domain_alloc_irqs(domain, dev, range);
 	if (ret)
-		msi_domain_free_irqs_descs_locked(domain, dev);
+		msi_domain_free_irqs_descs_locked(domain, dev, range);
 	return ret;
 }
 
@@ -986,10 +985,11 @@ int msi_domain_alloc_irqs_descs_locked(s
  */
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec, };
 	int ret;
 
 	msi_lock_descs(dev);
-	ret = msi_domain_alloc_irqs_descs_locked(domain, dev, nvec);
+	ret = msi_domain_alloc_irqs_descs_locked(domain, dev, &range);
 	msi_unlock_descs(dev);
 	return ret;
 }
@@ -1034,14 +1034,15 @@ static void msi_domain_free_msi_descs(st
  * pair. Use this for MSI irqdomains which implement their own vector
  * allocation.
  */
-void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev)
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+				       struct msi_range *range)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ops->domain_free_irqs(domain, dev, NULL);
+	ops->domain_free_irqs(domain, dev, range);
 	msi_domain_free_msi_descs(info, dev);
 }
 
@@ -1053,8 +1054,10 @@ void msi_domain_free_irqs_descs_locked(s
  */
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
+
 	msi_lock_descs(dev);
-	msi_domain_free_irqs_descs_locked(domain, dev);
+	msi_domain_free_irqs_descs_locked(domain, dev, &range);
 	msi_unlock_descs(dev);
 }
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations
  2021-11-27  1:24 ` [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations Thomas Gleixner
@ 2021-11-27  1:25   ` Thomas Gleixner
  2021-11-28 15:57   ` Marc Zyngier
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Convert the MSI descriptor related functions to ranges and fixup the call
sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/base/platform-msi.c |    3 ++-
 include/linux/msi.h         |    7 ++++---
 kernel/irq/msi.c            |   38 +++++++++++++++++++-------------------
 3 files changed, 25 insertions(+), 23 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -320,11 +320,12 @@ struct irq_domain *
 void platform_msi_device_domain_free(struct irq_domain *domain, unsigned int virq,
 				     unsigned int nr_irqs)
 {
+	struct msi_range range = { .first = virq, .last = virq + nr_irqs - 1, };
 	struct platform_msi_priv_data *data = domain->host_data;
 
 	msi_lock_descs(data->dev);
 	irq_domain_free_irqs_common(domain, virq, nr_irqs);
-	msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, virq, nr_irqs);
+	msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, &range);
 	msi_unlock_descs(data->dev);
 }
 
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -321,8 +321,7 @@ static inline void pci_write_msi_msg(uns
 #endif /* CONFIG_PCI_MSI */
 
 int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
-void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
-			      unsigned int base_index, unsigned int ndesc);
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter, struct msi_range *range);
 
 /**
  * msi_free_msi_descs - Free MSI descriptors of a device
@@ -330,7 +329,9 @@ void msi_free_msi_descs_range(struct dev
  */
 static inline void msi_free_msi_descs(struct device *dev)
 {
-	msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
+
+	msi_free_msi_descs_range(dev, MSI_DESC_ALL, &range);
 }
 
 void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -101,19 +101,19 @@ int msi_add_msi_desc(struct device *dev,
  *
  * Return: 0 on success or an appropriate failure code.
  */
-static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
+static int msi_add_simple_msi_descs(struct device *dev, struct msi_range *range)
 {
 	struct msi_desc *desc;
-	unsigned long i;
+	unsigned long idx;
 	int ret;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	for (i = 0; i < ndesc; i++) {
+	for (idx = range->first; idx <= range->last; idx++) {
 		desc = msi_alloc_desc(dev, 1, NULL);
 		if (!desc)
 			goto fail_mem;
-		ret = msi_insert_desc(dev->msi.data, desc, index + i);
+		ret = msi_insert_desc(dev->msi.data, desc, idx);
 		if (ret)
 			goto fail;
 	}
@@ -122,7 +122,7 @@ static int msi_add_simple_msi_descs(stru
 fail_mem:
 	ret = -ENOMEM;
 fail:
-	msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, index, ndesc);
+	msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, range);
 	return ret;
 }
 
@@ -148,14 +148,14 @@ static bool msi_desc_match(struct msi_de
  * @ndesc:	Number of descriptors to free
  */
 void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
-			      unsigned int base_index, unsigned int ndesc)
+			      struct msi_range *range)
 {
 	struct msi_desc *desc;
 	unsigned long idx;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	xa_for_each_range(&dev->msi.data->store, idx, desc, base_index, base_index + ndesc - 1) {
+	xa_for_each_range(&dev->msi.data->store, idx, desc, range->first, range->last) {
 		if (msi_desc_match(desc, filter)) {
 			xa_erase(&dev->msi.data->store, idx);
 			msi_free_desc(desc);
@@ -746,17 +746,18 @@ int msi_domain_prepare_irqs(struct irq_d
 int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
 			     int virq_base, int nvec, msi_alloc_info_t *arg)
 {
+	struct msi_range range = { .first = virq_base, .last = virq_base + nvec - 1 };
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
 	struct msi_desc *desc;
 	int ret, virq;
 
 	msi_lock_descs(dev);
-	ret = msi_add_simple_msi_descs(dev, virq_base, nvec);
+	ret = msi_add_simple_msi_descs(dev, &range);
 	if (ret)
 		goto unlock;
 
-	for (virq = virq_base; virq < virq_base + nvec; virq++) {
+	for (virq = range.first; virq <= range.last; virq++) {
 		desc = xa_load(&dev->msi.data->store, virq);
 		desc->irq = virq;
 
@@ -773,7 +774,7 @@ int msi_domain_populate_irqs(struct irq_
 fail:
 	for (--virq; virq >= virq_base; virq--)
 		irq_domain_free_irqs_common(domain, virq, 1);
-	msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+	msi_free_msi_descs_range(dev, MSI_DESC_ALL, &range);
 unlock:
 	msi_unlock_descs(dev);
 	return ret;
@@ -932,14 +933,13 @@ int __msi_domain_alloc_irqs(struct irq_d
 	return 0;
 }
 
-static int msi_domain_add_simple_msi_descs(struct msi_domain_info *info,
-					   struct device *dev,
-					   unsigned int num_descs)
+static int msi_domain_add_simple_msi_descs(struct msi_domain_info *info, struct device *dev,
+					   struct msi_range *range)
 {
 	if (!(info->flags & MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS))
 		return 0;
 
-	return msi_add_simple_msi_descs(dev, 0, num_descs);
+	return msi_add_simple_msi_descs(dev, range);
 }
 
 /**
@@ -964,7 +964,7 @@ int msi_domain_alloc_irqs_descs_locked(s
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	ret = msi_domain_add_simple_msi_descs(info, dev, range->ndesc);
+	ret = msi_domain_add_simple_msi_descs(info, dev, range);
 	if (ret)
 		return ret;
 
@@ -1017,11 +1017,11 @@ void __msi_domain_free_irqs(struct irq_d
 	}
 }
 
-static void msi_domain_free_msi_descs(struct msi_domain_info *info,
-				      struct device *dev)
+static void msi_domain_free_msi_descs(struct msi_domain_info *info, struct device *dev,
+				      struct msi_range *range)
 {
 	if (info->flags & MSI_FLAG_FREE_MSI_DESCS)
-		msi_free_msi_descs(dev);
+		msi_free_msi_descs_range(dev, MSI_DESC_ALL, range);
 }
 
 /**
@@ -1043,7 +1043,7 @@ void msi_domain_free_irqs_descs_locked(s
 	lockdep_assert_held(&dev->msi.data->mutex);
 
 	ops->domain_free_irqs(domain, dev, range);
-	msi_domain_free_msi_descs(info, dev);
+	msi_domain_free_msi_descs(info, dev, range);
 }
 
 /**


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 04/10] genirq/msi: Prepare MSI domain alloc/free for range irq allocation
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (7 preceding siblings ...)
  2021-11-27  1:24 ` [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
@ 2021-11-27  1:25 ` Thomas Gleixner
  2021-11-27  1:24   ` Thomas Gleixner
  2021-11-27  1:25 ` [patch 07/10] PCI/MSI: Make free related functions range based Thomas Gleixner
  2021-11-27  1:25 ` [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() Thomas Gleixner
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Make the iterators in the allocation and free functions range based.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/irq/msi.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -877,6 +877,7 @@ int __msi_domain_alloc_irqs(struct irq_d
 	msi_alloc_info_t arg = { };
 	unsigned int vflags = 0;
 	struct msi_desc *desc;
+	unsigned long idx;
 	int allocated = 0;
 	int i, ret, virq;
 
@@ -906,7 +907,10 @@ int __msi_domain_alloc_irqs(struct irq_d
 			vflags |= VIRQ_NOMASK_QUIRK;
 	}
 
-	msi_for_each_desc(desc, dev, MSI_DESC_NOTASSOCIATED) {
+	xa_for_each_range(&dev->msi.data->store, idx, desc, range->first, range->last) {
+		if (!msi_desc_match(desc, MSI_DESC_NOTASSOCIATED))
+			continue;
+
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -999,10 +1003,14 @@ void __msi_domain_free_irqs(struct irq_d
 	struct msi_domain_info *info = domain->host_data;
 	struct irq_data *irqd;
 	struct msi_desc *desc;
+	unsigned long idx;
 	int i;
 
 	/* Only handle MSI entries which have an interrupt associated */
-	msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+	xa_for_each_range(&dev->msi.data->store, idx, desc, range->first, range->last) {
+		if (!msi_desc_match(desc, MSI_DESC_ASSOCIATED))
+			continue;
+
 		/* Make sure all interrupts are deactivated */
 		for (i = 0; i < desc->nvec_used; i++) {
 			irqd = irq_domain_get_irq_data(domain, desc->irq + i);


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 05/10] genirq/msi: Add domain info flag MSI_FLAG_CAN_EXPAND
  2021-11-27  1:24 ` [patch 05/10] genirq/msi: Add domain info flag MSI_FLAG_CAN_EXPAND Thomas Gleixner
@ 2021-11-27  1:25   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Not all MSI domains support runtime expansions of PCI/MSI-X vectors. Add a
domain flag so implementations can opt in.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -494,6 +494,8 @@ enum {
 	MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS	= (1 << 9),
 	/* Free MSI descriptors */
 	MSI_FLAG_FREE_MSI_DESCS		= (1 << 10),
+	/* MSI vectors can be expanded after initial setup */
+	MSI_FLAG_CAN_EXPAND		= (1 << 11),
 };
 
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 06/10] PCI/MSI: Use range in allocation path
  2021-11-27  1:24 ` [patch 06/10] PCI/MSI: Use range in allocation path Thomas Gleixner
@ 2021-11-27  1:25   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Make the allocation path range based to prepare for runtime expansion of
MSI-X vectors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    7 +++----
 drivers/pci/msi/msi.c       |   34 +++++++++++++++++++++-------------
 drivers/pci/msi/msi.h       |    2 +-
 3 files changed, 25 insertions(+), 18 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -8,16 +8,15 @@
 
 #include "msi.h"
 
-int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type)
 {
-	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec};
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, &range);
+		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, range);
 
-	return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
+	return pci_msi_legacy_setup_msi_irqs(dev, range->ndesc, type);
 }
 
 void pci_msi_teardown_msi_irqs(struct pci_dev *dev)
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -370,14 +370,16 @@ static int msi_setup_msi_desc(struct pci
 	return ret;
 }
 
-static int msi_verify_entries(struct pci_dev *dev)
+static int msi_verify_entries(struct pci_dev *dev, struct msi_range *range)
 {
 	struct msi_desc *entry;
 
 	if (!dev->no_64bit_msi)
 		return 0;
 
-	msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
+	msi_for_each_desc_from(entry, &dev->dev, MSI_DESC_ALL, range->first) {
+		if (entry->msi_index > range->last)
+			return 0;
 		if (entry->msg.address_hi) {
 			pci_err(dev, "arch assigned 64-bit MSI address %#x%08x but device only supports 32 bits\n",
 				entry->msg.address_hi, entry->msg.address_lo);
@@ -402,6 +404,7 @@ static int msi_verify_entries(struct pci
 static int msi_capability_init(struct pci_dev *dev, int nvec,
 			       struct irq_affinity *affd)
 {
+	struct msi_range range = { .first = 0, .last = 0, .ndesc = nvec, };
 	struct irq_affinity_desc *masks = NULL;
 	struct msi_desc *entry;
 	int ret;
@@ -421,11 +424,11 @@ static int msi_capability_init(struct pc
 	pci_msi_mask(entry, msi_multi_mask(entry));
 
 	/* Configure MSI capability structure */
-	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
+	ret = pci_msi_setup_msi_irqs(dev, &range, PCI_CAP_ID_MSI);
 	if (ret)
 		goto err;
 
-	ret = msi_verify_entries(dev);
+	ret = msi_verify_entries(dev, &range);
 	if (ret)
 		goto err;
 
@@ -469,7 +472,8 @@ static void __iomem *msix_map_region(str
 }
 
 static int msix_setup_msi_descs(struct pci_dev *dev, void __iomem *base,
-				struct msix_entry *entries, int nvec,
+				struct msi_range *range,
+				struct msix_entry *entries,
 				struct irq_affinity_desc *masks)
 {
 	int ret, i, vec_count = pci_msix_vec_count(dev);
@@ -485,8 +489,8 @@ static int msix_setup_msi_descs(struct p
 	desc.pci.msi_attrib.default_irq	= dev->irq;
 	desc.pci.mask_base		= base;
 
-	for (i = 0, curmsk = masks; i < nvec; i++, curmsk++) {
-		desc.msi_index = entries ? entries[i].entry : i;
+	for (i = 0, curmsk = masks; i < range->ndesc; i++, curmsk++) {
+		desc.msi_index = entries ? entries[i].entry : range->first + i;
 		desc.affinity = masks ? curmsk : NULL;
 		desc.pci.msi_attrib.is_virtual = desc.msi_index >= vec_count;
 		desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
@@ -500,6 +504,9 @@ static int msix_setup_msi_descs(struct p
 		ret = msi_add_msi_desc(&dev->dev, &desc);
 		if (ret)
 			break;
+
+		if (desc.msi_index > range->last)
+			range->last = desc.msi_index;
 	}
 
 	return ret;
@@ -530,28 +537,28 @@ static void msix_mask_all(void __iomem *
 }
 
 static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
-				 struct msix_entry *entries, int nvec,
+				 struct msi_range *range, struct msix_entry *entries,
 				 struct irq_affinity *affd)
 {
 	struct irq_affinity_desc *masks = NULL;
 	int ret;
 
 	if (affd)
-		masks = irq_create_affinity_masks(nvec, affd);
+		masks = irq_create_affinity_masks(range->ndesc, affd);
 
 	msi_lock_descs(&dev->dev);
-	ret = msix_setup_msi_descs(dev, base, entries, nvec, masks);
+	ret = msix_setup_msi_descs(dev, base, range, entries, masks);
 	if (ret)
 		goto out_free;
 
 	dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
 
-	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+	ret = pci_msi_setup_msi_irqs(dev, range, PCI_CAP_ID_MSIX);
 	if (ret)
 		goto out_free;
 
 	/* Check if all MSI entries honor device restrictions */
-	ret = msi_verify_entries(dev);
+	ret = msi_verify_entries(dev, range);
 	if (ret)
 		goto out_free;
 
@@ -580,6 +587,7 @@ static int msix_setup_interrupts(struct
 static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
 				int nvec, struct irq_affinity *affd)
 {
+	struct msi_range range = { .first = 0, .last = 0, .ndesc = nvec, };
 	void __iomem *base;
 	int ret, tsize;
 	u16 control;
@@ -606,7 +614,7 @@ static int msix_capability_init(struct p
 	/* Ensure that all table entries are masked. */
 	msix_mask_all(base, tsize);
 
-	ret = msix_setup_interrupts(dev, base, entries, nvec, affd);
+	ret = msix_setup_interrupts(dev, base, &range, entries, affd);
 	if (ret)
 		goto out_disable;
 
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -5,7 +5,7 @@
 
 #define msix_table_size(flags)	((flags & PCI_MSIX_FLAGS_QSIZE) + 1)
 
-extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type);
 extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev);
 
 #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 07/10] PCI/MSI: Make free related functions range based
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (8 preceding siblings ...)
  2021-11-27  1:25 ` [patch 04/10] genirq/msi: Prepare MSI domain alloc/free for range irq allocation Thomas Gleixner
@ 2021-11-27  1:25 ` Thomas Gleixner
  2021-11-27  1:24   ` Thomas Gleixner
  2021-11-27  1:25 ` [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() Thomas Gleixner
  10 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

In preparation of runtime expandable PCI/MSI-X vectors convert the related
free functions to take ranges instead of assuming a zero based vector
space.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    5 ++---
 drivers/pci/msi/msi.c       |   24 ++++++++++++++++--------
 drivers/pci/msi/msi.h       |    2 +-
 3 files changed, 19 insertions(+), 12 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -19,14 +19,13 @@ int pci_msi_setup_msi_irqs(struct pci_de
 	return pci_msi_legacy_setup_msi_irqs(dev, range->ndesc, type);
 }
 
-void pci_msi_teardown_msi_irqs(struct pci_dev *dev)
+void pci_msi_teardown_msi_irqs(struct pci_dev *dev, struct msi_range *range)
 {
-	struct msi_range range = { .first = 0, .last = UINT_MAX, };
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		msi_domain_free_irqs_descs_locked(domain, &dev->dev, &range);
+		msi_domain_free_irqs_descs_locked(domain, &dev->dev, range);
 	else
 		pci_msi_legacy_teardown_msi_irqs(dev);
 }
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -222,9 +222,12 @@ void pci_write_msi_msg(unsigned int irq,
 }
 EXPORT_SYMBOL_GPL(pci_write_msi_msg);
 
-static void free_msi_irqs(struct pci_dev *dev)
+static void free_msi_irqs(struct pci_dev *dev, struct msi_range *range, bool shutdown)
 {
-	pci_msi_teardown_msi_irqs(dev);
+	pci_msi_teardown_msi_irqs(dev, range);
+
+	if (!shutdown)
+		return;
 
 	if (dev->msix_base) {
 		iounmap(dev->msix_base);
@@ -443,7 +446,7 @@ static int msi_capability_init(struct pc
 
 err:
 	pci_msi_unmask(entry, msi_multi_mask(entry));
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, &range, true);
 unlock:
 	msi_unlock_descs(&dev->dev);
 	kfree(masks);
@@ -538,7 +541,7 @@ static void msix_mask_all(void __iomem *
 
 static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
 				 struct msi_range *range, struct msix_entry *entries,
-				 struct irq_affinity *affd)
+				 struct irq_affinity *affd, bool expand)
 {
 	struct irq_affinity_desc *masks = NULL;
 	int ret;
@@ -566,7 +569,8 @@ static int msix_setup_interrupts(struct
 	goto out_unlock;
 
 out_free:
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, range, !expand);
+
 out_unlock:
 	msi_unlock_descs(&dev->dev);
 	kfree(masks);
@@ -614,7 +618,7 @@ static int msix_capability_init(struct p
 	/* Ensure that all table entries are masked. */
 	msix_mask_all(base, tsize);
 
-	ret = msix_setup_interrupts(dev, base, &range, entries, affd);
+	ret = msix_setup_interrupts(dev, base, &range, entries, affd, false);
 	if (ret)
 		goto out_disable;
 
@@ -728,12 +732,14 @@ static void pci_msi_shutdown(struct pci_
 
 void pci_disable_msi(struct pci_dev *dev)
 {
+	struct msi_range range = { .first = 0, .last = 0, };
+
 	if (!pci_msi_enable || !dev || !dev->msi_enabled)
 		return;
 
 	msi_lock_descs(&dev->dev);
 	pci_msi_shutdown(dev);
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, &range, true);
 	msi_unlock_descs(&dev->dev);
 }
 EXPORT_SYMBOL(pci_disable_msi);
@@ -817,12 +823,14 @@ static void pci_msix_shutdown(struct pci
 
 void pci_disable_msix(struct pci_dev *dev)
 {
+	struct msi_range range = { .first = 0, .last = UINT_MAX, };
+
 	if (!pci_msi_enable || !dev || !dev->msix_enabled)
 		return;
 
 	msi_lock_descs(&dev->dev);
 	pci_msix_shutdown(dev);
-	free_msi_irqs(dev);
+	free_msi_irqs(dev, &range, true);
 	msi_unlock_descs(&dev->dev);
 }
 EXPORT_SYMBOL(pci_disable_msix);
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -6,7 +6,7 @@
 #define msix_table_size(flags)	((flags & PCI_MSIX_FLAGS_QSIZE) + 1)
 
 extern int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type);
-extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev);
+extern void pci_msi_teardown_msi_irqs(struct pci_dev *dev, struct msi_range *range);
 
 #ifdef CONFIG_PCI_MSI_ARCH_FALLBACKS
 extern int pci_msi_legacy_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 08/10] PCI/MSI: Provide pci_msi_domain_supports_expand()
  2021-11-27  1:24 ` [patch 08/10] PCI/MSI: Provide pci_msi_domain_supports_expand() Thomas Gleixner
@ 2021-11-27  1:25   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Not all irq domain implementations can support runtime MSI-X vector
expansion as they assume zero based allocations or have other
restrictions.

The legacy PCI allocation functions are not suited for runtime vector
expansion either.

Add a function which allows to query whether runtime MSI-X vector expansion
is supported or not.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |   29 +++++++++++++++++++++++------
 include/linux/msi.h         |    2 ++
 2 files changed, 25 insertions(+), 6 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -8,12 +8,18 @@
 
 #include "msi.h"
 
+static struct irq_domain *pci_get_msi_domain(struct pci_dev *dev)
+{
+	struct irq_domain *domain = dev_get_msi_domain(&dev->dev);
+
+	return domain && irq_domain_is_hierarchy(domain) ? domain : NULL;
+}
+
 int pci_msi_setup_msi_irqs(struct pci_dev *dev, struct msi_range *range, int type)
 {
-	struct irq_domain *domain;
+	struct irq_domain *domain = pci_get_msi_domain(dev);
 
-	domain = dev_get_msi_domain(&dev->dev);
-	if (domain && irq_domain_is_hierarchy(domain))
+	if (domain)
 		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, range);
 
 	return pci_msi_legacy_setup_msi_irqs(dev, range->ndesc, type);
@@ -21,15 +27,26 @@ int pci_msi_setup_msi_irqs(struct pci_de
 
 void pci_msi_teardown_msi_irqs(struct pci_dev *dev, struct msi_range *range)
 {
-	struct irq_domain *domain;
+	struct irq_domain *domain = pci_get_msi_domain(dev);
 
-	domain = dev_get_msi_domain(&dev->dev);
-	if (domain && irq_domain_is_hierarchy(domain))
+	if (domain)
 		msi_domain_free_irqs_descs_locked(domain, &dev->dev, range);
 	else
 		pci_msi_legacy_teardown_msi_irqs(dev);
 }
 
+bool pci_msi_domain_supports_expand(struct pci_dev *dev)
+{
+	struct irq_domain *domain = pci_get_msi_domain(dev);
+	struct msi_domain_info *info;
+
+	if (!domain)
+		return false;
+
+	info = domain->host_data;
+	return info->flags & MSI_FLAG_CAN_EXPAND;
+}
+
 /**
  * pci_msi_domain_write_msg - Helper to write MSI message to PCI config space
  * @irq_data:	Pointer to interrupt data of the MSI interrupt
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -552,11 +552,13 @@ struct irq_domain *pci_msi_create_irq_do
 u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
 struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
 bool pci_dev_has_special_msi_domain(struct pci_dev *pdev);
+bool pci_msi_domain_supports_expand(struct pci_dev *dev);
 #else
 static inline struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev)
 {
 	return NULL;
 }
+static inline bool pci_msi_domain_supports_expand(struct pci_dev *dev) { return false; }
 #endif /* CONFIG_PCI_MSI_IRQ_DOMAIN */
 
 #endif /* LINUX_MSI_H */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
                   ` (9 preceding siblings ...)
  2021-11-27  1:25 ` [patch 07/10] PCI/MSI: Make free related functions range based Thomas Gleixner
@ 2021-11-27  1:25 ` Thomas Gleixner
  2021-11-27  1:24   ` Thomas Gleixner
  2021-12-02  1:08   ` Dey, Megha
  10 siblings, 2 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

Provide a new interface which allows to expand the MSI-X vector space if
the underlying irq domain implementation supports it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/msi.c |   41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h   |   13 +++++++++++++
 2 files changed, 54 insertions(+)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -1025,6 +1025,47 @@ int pci_alloc_irq_vectors_affinity(struc
 EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
 
 /**
+ * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
+ *
+ * @dev:	PCI device to operate on
+ * @at:		Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
+ *		the function expands automatically after the last
+ *		active index.
+ * @nvec:	Number of vectors to allocate
+ *
+ * Expand the MSI-X vectors of a device after an initial enablement and
+ * allocation.
+ *
+ * Return: 0 if the allocation was successful, an error code otherwise.
+ */
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+	struct msi_device_data *md = dev->dev.msi.data;
+	struct msi_range range = { .ndesc = nvec, };
+	unsigned int max_vecs;
+	int ret;
+
+	if (!pci_msi_enable || !dev || !dev->msix_enabled || !md)
+		return -ENOTSUPP;
+
+	if (!pci_msi_domain_supports_expand(dev))
+		return -ENOTSUPP;
+
+	max_vecs = pci_msix_vec_count(dev);
+	if (!nvec || nvec > max_vecs)
+		return -EINVAL;
+
+	range.first = at == PCI_MSIX_EXPAND_AUTO ? md->num_descs : at;
+
+	if (range.first >= max_vecs || nvec > max_vecs - range.first)
+		return -ENOSPC;
+
+	ret = msix_setup_interrupts(dev, dev->msix_base, &range, NULL, NULL, true);
+	return ret <= 0 ? ret : -ENOSPC;;
+}
+EXPORT_SYMBOL_GPL(pci_msix_expand_vectors_at);
+
+/**
  * pci_free_irq_vectors - free previously allocated IRQs for a device
  * @dev:		PCI device to operate on
  *
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1534,6 +1534,7 @@ static inline int pci_enable_msix_exact(
 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   unsigned int max_vecs, unsigned int flags,
 				   struct irq_affinity *affd);
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec);
 
 void pci_free_irq_vectors(struct pci_dev *dev);
 int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
@@ -1565,6 +1566,11 @@ pci_alloc_irq_vectors_affinity(struct pc
 	return -ENOSPC;
 }
 
+static inline int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+	return -ENOTSUPP;
+}
+
 static inline void pci_free_irq_vectors(struct pci_dev *dev)
 {
 }
@@ -1582,6 +1588,13 @@ static inline const struct cpumask *pci_
 }
 #endif
 
+#define PCI_MSIX_EXPAND_AUTO	(UINT_MAX)
+
+static inline int pci_msix_expand_vectors(struct pci_dev *dev, unsigned int nvec)
+{
+	return pci_msix_expand_vectors_at(dev, PCI_MSIX_EXPAND_AUTO, nvec);
+}
+
 /**
  * pci_irqd_intx_xlate() - Translate PCI INTx value to an IRQ domain hwirq
  * @d: the INTx IRQ domain


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 10/10] x86/apic/msi: Support MSI-X vector expansion
  2021-11-27  1:24 ` [patch 10/10] x86/apic/msi: Support MSI-X vector expansion Thomas Gleixner
@ 2021-11-27  1:25   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:25 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

The X86 PCI/MSI irq domaim implementation supports vector expansion out of
the box. Make it available.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/msi.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -178,7 +178,7 @@ static struct msi_domain_ops pci_msi_dom
 
 static struct msi_domain_info pci_msi_domain_info = {
 	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
-			  MSI_FLAG_PCI_MSIX,
+			  MSI_FLAG_PCI_MSIX | MSI_FLAG_CAN_EXPAND,
 	.ops		= &pci_msi_domain_ops,
 	.chip		= &pci_msi_controller,
 	.handler	= handle_edge_irq,
@@ -226,7 +226,7 @@ static struct irq_chip pci_msi_ir_contro
 
 static struct msi_domain_info pci_msi_ir_domain_info = {
 	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
-			  MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX,
+			  MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX | MSI_FLAG_CAN_EXPAND,
 	.ops		= &pci_msi_domain_ops,
 	.chip		= &pci_msi_ir_controller,
 	.handler	= handle_edge_irq,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations
  2021-11-27  1:24 ` [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations Thomas Gleixner
  2021-11-27  1:25   ` Thomas Gleixner
@ 2021-11-28 15:57   ` Marc Zyngier
  2021-11-28 19:17     ` Thomas Gleixner
  1 sibling, 1 reply; 31+ messages in thread
From: Marc Zyngier @ 2021-11-28 15:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

On Sat, 27 Nov 2021 01:24:34 +0000,
Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Convert the MSI descriptor related functions to ranges and fixup the call
> sites.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  drivers/base/platform-msi.c |    3 ++-
>  include/linux/msi.h         |    7 ++++---
>  kernel/irq/msi.c            |   38 +++++++++++++++++++-------------------
>  3 files changed, 25 insertions(+), 23 deletions(-)

This particular patch breaks one of my test boxes when allocating the
MSIs for the first SMMUv3 it encounters:

[   14.700206] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0
[   14.705848] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00041fff)
[   14.716184] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq
[   14.723285] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 128 entries for evtq
[   14.730170] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 256 entries for priq
[   41.282305] watchdog: BUG: soft lockup - CPU#43 stuck for 26s! [swapper/0:1]
[   41.289383] Modules linked in:
[   41.292430] irq event stamp: 3906684
[   41.295994] hardirqs last  enabled at (3906683): [<ffffb73f677c54d8>] ___slab_alloc+0x7c8/0x8c0
[   41.304698] hardirqs last disabled at (3906684): [<ffffb73f6806d478>] el1_interrupt+0x38/0xb0
[   41.313220] softirqs last  enabled at (3798058): [<ffffb73f6746099c>] __do_softirq+0x40c/0x58c
[   41.321825] softirqs last disabled at (3798053): [<ffffb73f674ee610>] __irq_exit_rcu+0x120/0x160
[   41.330607] CPU: 43 PID: 1 Comm: swapper/0 Not tainted 5.16.0-rc2-00078-g76af42494903 #41
[   41.338775] Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.3.20210110 2021/01/10
[   41.349634] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   41.356585] pc : lock_is_held_type+0x124/0x20c
[   41.361019] lr : lock_is_held_type+0xe8/0x20c
[   41.365365] sp : ffff80001051b840
[   41.368669] pmr_save: 000000e0
[   41.371712] x29: ffff80001051b840 x28: ffff000034d80000 x27: ffff07ff87606600
[   41.378838] x26: 00000000000000e0 x25: 00000000ffffffff x24: ffffb73f68733d00
[   41.385964] x23: 0000000000000028 x22: ffff07ff87606fb8 x21: ffffb73f68bffc38
[   41.393090] x20: ffff07ff87606fe0 x19: 0000000000000002 x18: 0000000000000014
[   41.400217] x17: 0000000076a13aac x16: 00000000486de301 x15: 00000000ed5ff5e1
[   41.407342] x14: 00000000fda7b077 x13: 0000000000000006 x12: 00000000b82b73dd
[   41.414468] x11: ffff07ff87606fb8 x10: ffffb73f6952d000 x9 : ffffb73f675a3750
[   41.421594] x8 : 0000000000000000 x7 : ffffb73f68e16000 x6 : ffffb73f677c6d64
[   41.428720] x5 : 0000000000000000 x4 : 0000000000000001 x3 : 000000000000ffff
[   41.435846] x2 : ffff50ff17454000 x1 : 0000000000000000 x0 : 0000000000000000
[   41.442972] Call trace:
[   41.445407]  lock_is_held_type+0x124/0x20c
[   41.449494]  rcu_read_lock_sched_held+0x68/0xac
[   41.454018]  trace_lock_acquire+0x78/0x1c0
[   41.458107]  lock_acquire+0x40/0x90
[   41.461585]  fs_reclaim_acquire+0x90/0x114
[   41.465676]  kmem_cache_alloc_trace+0x80/0x300
[   41.470110]  msi_add_simple_msi_descs+0x70/0x150
[   41.474718]  msi_domain_alloc_irqs_descs_locked+0x90/0xfc
[   41.480106]  msi_domain_alloc_irqs+0x58/0xa0
[   41.484364]  platform_msi_domain_alloc_irqs+0x5c/0xa0
[   41.489410]  arm_smmu_device_probe+0xfc0/0x1230
[   41.493936]  platform_probe+0x74/0xe4
[   41.497590]  really_probe+0xc4/0x470
[   41.501156]  __driver_probe_device+0x11c/0x190
[   41.505589]  driver_probe_device+0x48/0x110
[   41.509761]  __driver_attach+0xe0/0x200
[   41.513585]  bus_for_each_dev+0x7c/0xe0
[   41.517412]  driver_attach+0x30/0x3c
[   41.520976]  bus_add_driver+0x150/0x230
[   41.524801]  driver_register+0x84/0x140
[   41.528626]  __platform_driver_register+0x34/0x40
[   41.533319]  arm_smmu_driver_init+0x2c/0x38
[   41.537496]  do_one_initcall+0x80/0x3d0
[   41.541321]  kernel_init_freeable+0x318/0x3a0
[   41.545672]  kernel_init+0x30/0x14c
[   41.549151]  ret_from_fork+0x10/0x20

The issue seems to be originating in the previous patch, where the
following line was added:

+	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec, };

In that context, only 'ndesc' was used, and that was fine.

However, in the current patch, ndesc use is removed, only first/last
are considered, and UINT_MAX is... a lot of MSIs.

This fixes it:

diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index bef5b74a7268..a520bfd94a56 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -975,7 +975,7 @@ int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device
  */
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec)
 {
-	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec, };
+	struct msi_range range = { .first = 0, .last = nvec - 1, .ndesc = nvec, };
 	int ret;
 
 	msi_lock_descs(dev);

However, it'd be good to clarify the use of range->ndesc.

[...]

> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -101,19 +101,19 @@ int msi_add_msi_desc(struct device *dev,
>   *
>   * Return: 0 on success or an appropriate failure code.
>   */
> -static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
> +static int msi_add_simple_msi_descs(struct device *dev, struct msi_range *range)

nit: most of the functions changed in this patch need to have their
documentation tidied up.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations
  2021-11-28 15:57   ` Marc Zyngier
@ 2021-11-28 19:17     ` Thomas Gleixner
  2021-11-29 17:28       ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-28 19:17 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: LKML, Bjorn Helgaas, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

On Sun, Nov 28 2021 at 15:57, Marc Zyngier wrote:
> On Sat, 27 Nov 2021 01:24:34 +0000,
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
> The issue seems to be originating in the previous patch, where the
> following line was added:
>
> +	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec, };
>
> In that context, only 'ndesc' was used, and that was fine.
>
> However, in the current patch, ndesc use is removed, only first/last
> are considered, and UINT_MAX is... a lot of MSIs.
>
> This fixes it:
>
> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index bef5b74a7268..a520bfd94a56 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -975,7 +975,7 @@ int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device
>   */
>  int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec)
>  {
> -	struct msi_range range = { .first = 0, .last = UINT_MAX, .ndesc = nvec, };
> +	struct msi_range range = { .first = 0, .last = nvec - 1, .ndesc = nvec, };
>  	int ret;
>  
>  	msi_lock_descs(dev);
>
> However, it'd be good to clarify the use of range->ndesc.

Hrm. The stupid search should terminated nevertheless. Let me stare at
it again.

>> -static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
>> +static int msi_add_simple_msi_descs(struct device *dev, struct msi_range *range)
>
> nit: most of the functions changed in this patch need to have their
> documentation tidied up.

Duh, yes.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations
  2021-11-28 19:17     ` Thomas Gleixner
@ 2021-11-29 17:28       ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-11-29 17:28 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: LKML, Bjorn Helgaas, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, Michael Ellerman,
	Andrew Cooper, Juergen Gross, linux-pci, xen-devel

On Sun, Nov 28 2021 at 20:17, Thomas Gleixner wrote:
> On Sun, Nov 28 2021 at 15:57, Marc Zyngier wrote:
> Hrm. The stupid search should terminated nevertheless. Let me stare at
> it again.

Found it. Just my inability to read xarray documentation.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-11-27  1:25 ` [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() Thomas Gleixner
  2021-11-27  1:24   ` Thomas Gleixner
@ 2021-12-02  1:08   ` Dey, Megha
  2021-12-02 10:16     ` Thomas Gleixner
  1 sibling, 1 reply; 31+ messages in thread
From: Dey, Megha @ 2021-12-02  1:08 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Ashok Raj, Michael Ellerman, Andrew Cooper,
	Juergen Gross, linux-pci, xen-devel

Hi Thomas,
On 11/26/2021 5:25 PM, Thomas Gleixner wrote:
> Provide a new interface which allows to expand the MSI-X vector space if
> the underlying irq domain implementation supports it.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   drivers/pci/msi/msi.c |   41 +++++++++++++++++++++++++++++++++++++++++
>   include/linux/pci.h   |   13 +++++++++++++
>   2 files changed, 54 insertions(+)
>
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -1025,6 +1025,47 @@ int pci_alloc_irq_vectors_affinity(struc
>   EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
>   
>   /**
> + * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
> + *
> + * @dev:	PCI device to operate on
> + * @at:		Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
> + *		the function expands automatically after the last
Not sure why some of these changes related to PCI_MSIX_EXPAND_AUTO and 
num_descs did not make it to the 'msi' branch.
Is this intentional?
> + *		active index.
> + * @nvec:	Number of vectors to allocate
> + *
> + * Expand the MSI-X vectors of a device after an initial enablement and
> + * allocation.
> + *
> + * Return: 0 if the allocation was successful, an error code otherwise.
> + */
> +int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
> +{
> +	struct msi_device_data *md = dev->dev.msi.data;
> +	struct msi_range range = { .ndesc = nvec, };
> +	unsigned int max_vecs;
> +	int ret;
> +
> +	if (!pci_msi_enable || !dev || !dev->msix_enabled || !md)
> +		return -ENOTSUPP;
> +
> +	if (!pci_msi_domain_supports_expand(dev))
> +		return -ENOTSUPP;
> +
> +	max_vecs = pci_msix_vec_count(dev);
> +	if (!nvec || nvec > max_vecs)
> +		return -EINVAL;
> +
> +	range.first = at == PCI_MSIX_EXPAND_AUTO ? md->num_descs : at;
> +
> +	if (range.first >= max_vecs || nvec > max_vecs - range.first)
> +		return -ENOSPC;
> +
> +	ret = msix_setup_interrupts(dev, dev->msix_base, &range, NULL, NULL, true);
> +	return ret <= 0 ? ret : -ENOSPC;;
> +}
> +EXPORT_SYMBOL_GPL(pci_msix_expand_vectors_at);
> +
I am having trouble fully comprehending how this expansion scheme would 
work..

For instance, say:
1. Driver requests for 5 vectors:
pci_enable_msix_range(dev, NULL, 5, 5)
=>num_descs = 5

2. Driver frees vectors at index 1,2:
range = {1, 2, 2};
pci_msi_teardown_msi_irqs(dev, range)
=>num_descs = 3; Current active vectors are at index: 0, 3, 4

3. Driver requests for 3 more vectors using the new API:
pci_msix_expand_vectors(dev, 3)
=>range.first = 3 => It will try to allocate index 3-5, but we already 
have 3,4 active?
Ideally, we would want index 1,2 and 5 to be allocated for this request 
right?

Could you please let me know what I am missing?

With the 'range' approach, the issue is that we are trying to allocate 
contiguous indexes. Perhaps, we also need to check if all the indexes in 
the requested range are available,
if not, find a contiguous range large enough to accommodate the request. 
But there will be fragmentation issues if we choose to go with this way...

I had a version of the dynamic MSI-X patch series (which never got sent 
out). For the expansion, I had the following:
pci_add_msix_irq_vector(pdev): On each invocation, add 1 MSI-X vector to 
the device and return the msi-x index assigned by the kernel (using a 
bitmap)
Correspondingly, pci_free_msix_irq_vector(pdev, irq) frees all the 
allocated resources associated with MSI-X interrupt with Linux IRQ 
number 'irq'.
I had issues when trying to dynamically allocate more than 1 interrupt 
because I didn't have a clean way to communicate to the driver what 
indexes were assigned in the current allocation.

-Megha
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-12-02  1:08   ` Dey, Megha
@ 2021-12-02 10:16     ` Thomas Gleixner
  2021-12-02 19:21       ` Raj, Ashok
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-12-02 10:16 UTC (permalink / raw)
  To: Dey, Megha, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Ashok Raj, Michael Ellerman, Andrew Cooper,
	Juergen Gross, linux-pci, xen-devel

Megha,

On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> On 11/26/2021 5:25 PM, Thomas Gleixner wrote:
>>   /**
>> + * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
>> + *
>> + * @dev:	PCI device to operate on
>> + * @at:		Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
>> + *		the function expands automatically after the last
> Not sure why some of these changes related to PCI_MSIX_EXPAND_AUTO and 
> num_descs did not make it to the 'msi' branch.
> Is this intentional?

Yes, because I'm not happy about that magic.

>
> For instance, say:
> 1. Driver requests for 5 vectors:
> pci_enable_msix_range(dev, NULL, 5, 5)
> =>num_descs = 5

Driver should not use with pci_enable_msix_range() in the first
place. But yes, it got 5 vectors now.

> 2. Driver frees vectors at index 1,2:
> range = {1, 2, 2};
> pci_msi_teardown_msi_irqs(dev, range)

That function is not accessible by drivers for a reason.

> =>num_descs = 3; Current active vectors are at index: 0, 3, 4

> 3. Driver requests for 3 more vectors using the new API:
> pci_msix_expand_vectors(dev, 3)
> =>range.first = 3 => It will try to allocate index 3-5, but we already 
> have 3,4 active?
> Ideally, we would want index 1,2 and 5 to be allocated for this request 
> right?
>
> Could you please let me know what I am missing?

You're missing the real world use case. The above is fiction.

If a driver would release 1 and 2 then it should explicitely reallocate
1 and 2 and not let the core decide to magically allocate something.

If the driver wants three more after freeing 1, 2 then the core could
just allocate 5, 6, 7, and would still fulfil the callers request to
allocate three more, right?

And even if it just allocates one, then the caller still has to know the
index upfront. Why? Because it needs to know it in order to get the
Linux interrupt number via pci_irq_vector().

> Correspondingly, pci_free_msix_irq_vector(pdev, irq) frees all the 
> allocated resources associated with MSI-X interrupt with Linux IRQ 
> number 'irq'.
> I had issues when trying to dynamically allocate more than 1 interrupt 
> because I didn't have a clean way to communicate to the driver what 
> indexes were assigned in the current allocation.

If the driver is able to free a particular vector then it should exactly
know what it it doing and which index it is freeing. If it needs that
particular vector again, then it knows the index, right?

Let's look how MSI-X works in reality:

Each vector is associated to a particular function in the device. How
that association works is device dependent.

Some devices have hardwired associations, some allow the driver to
program the association in the device configuration and there is also a
combination of both.

So if the driver would free the vector for a particular functionality,
or not allocate it in the first place, then it exactly knows what it
freed and what it needs to allocate when it needs that functionality
(again).

What you are trying to create is a solution in search of a problem. You
cannot declare via a random allocation API how devices work. You neither
can fix the VFIO issue in a sensible way.

VFIO starts with vector #0 allocated. The guest then unmasks vector #50.

With your magic interface VFIO has to allocate 49 vectors and then free
48 of them again or just keep 48 around for nothing which defeats the
purpose of on demand allocation completely.

Thanks,

        tglx






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-12-02 10:16     ` Thomas Gleixner
@ 2021-12-02 19:21       ` Raj, Ashok
  2021-12-02 20:40         ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Raj, Ashok @ 2021-12-02 19:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dey, Megha, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson,
	Kevin Tian, Jason Gunthorpe, Michael Ellerman, Andrew Cooper,
	Juergen Gross, linux-pci, xen-devel, Ashok Raj

Hi Thomas,

On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
> Megha,
> 
> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> > On 11/26/2021 5:25 PM, Thomas Gleixner wrote:
> >>   /**
> >> + * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
> >> + *
> >> + * @dev:	PCI device to operate on
> >> + * @at:		Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
> >> + *		the function expands automatically after the last
> > Not sure why some of these changes related to PCI_MSIX_EXPAND_AUTO and 
> > num_descs did not make it to the 'msi' branch.
> > Is this intentional?
> 
> Yes, because I'm not happy about that magic.
> 
> >
> > For instance, say:
> > 1. Driver requests for 5 vectors:
> > pci_enable_msix_range(dev, NULL, 5, 5)
> > =>num_descs = 5
> 
> Driver should not use with pci_enable_msix_range() in the first
> place. But yes, it got 5 vectors now.

Bad start with a deprecated interface :-). 

> 
> > 2. Driver frees vectors at index 1,2:
> > range = {1, 2, 2};
> > pci_msi_teardown_msi_irqs(dev, range)
> 
> That function is not accessible by drivers for a reason.
> 
> > =>num_descs = 3; Current active vectors are at index: 0, 3, 4
> 
> > 3. Driver requests for 3 more vectors using the new API:
> > pci_msix_expand_vectors(dev, 3)
> > =>range.first = 3 => It will try to allocate index 3-5, but we already 
> > have 3,4 active?
> > Ideally, we would want index 1,2 and 5 to be allocated for this request 
> > right?
> >
> > Could you please let me know what I am missing?
> 
> You're missing the real world use case. The above is fiction.

I don't think there is a valid use case for freeing specific vectors. Its
true some are special, IDXD has vector#0 like that. But I expect drivers to
acquire these special vectors  once and never free them until driver 
tear down time.

But there is a need to free on demand, for a subdevice constructed for idxd
pass-through, when the guest is torn down, host would need to free them.
Only growing on demand seems to only catch one part of the dynamic part.

IDXD also allocates interrupt only when the WQ is enabled, and frees when its
disabled. 

> 
> If a driver would release 1 and 2 then it should explicitely reallocate
> 1 and 2 and not let the core decide to magically allocate something.
> 
> If the driver wants three more after freeing 1, 2 then the core could
> just allocate 5, 6, 7, and would still fulfil the callers request to
> allocate three more, right?

Since the core is already managing what's allocated and free, requiring
drivers to manage each allocated entries seem hard, while the core can
easily manage it. For IDXD cases, we don't really care which ones of the
IMS is being allocated and freed. It just wants one of the available IMS
entries. The assumption is since the driver would have acquired any special
ones upfront with the alloc_irqs().


> 
> And even if it just allocates one, then the caller still has to know the
> index upfront. Why? Because it needs to know it in order to get the
> Linux interrupt number via pci_irq_vector().

If we were to allocate one, the new API can simply return the index
directly to the caller, and they call pci_irq_vector() to get the IRQ
number.

> 
> > Correspondingly, pci_free_msix_irq_vector(pdev, irq) frees all the 
> > allocated resources associated with MSI-X interrupt with Linux IRQ 
> > number 'irq'.
> > I had issues when trying to dynamically allocate more than 1 interrupt 
> > because I didn't have a clean way to communicate to the driver what 
> > indexes were assigned in the current allocation.
> 
> If the driver is able to free a particular vector then it should exactly
> know what it it doing and which index it is freeing. If it needs that
> particular vector again, then it knows the index, right?
> 
> Let's look how MSI-X works in reality:
> 
> Each vector is associated to a particular function in the device. How
> that association works is device dependent.
> 
> Some devices have hardwired associations, some allow the driver to
> program the association in the device configuration and there is also a
> combination of both.
> 
> So if the driver would free the vector for a particular functionality,
> or not allocate it in the first place, then it exactly knows what it
> freed and what it needs to allocate when it needs that functionality
> (again).

It doesn't *have* to be that all vectors are special. Some of them are
special that they acquired all during driver load time. These are allocated
once and never freed. The rest are for say completion interrupts or such and 
such that go with work queues. These can dynamically be allocated and
freed.

The driver doesn't really care which index it wants or what the next index
should be. But it has to remember the allocated ones so it can pass down
for the free. Maybe the one we did a while back

https://lore.kernel.org/lkml/1561162778-12669-1-git-send-email-megha.dey@linux.intel.com/

This has a group handle, and kept adding things to it.

> 
> What you are trying to create is a solution in search of a problem. You
> cannot declare via a random allocation API how devices work. You neither
> can fix the VFIO issue in a sensible way.
> 
> VFIO starts with vector #0 allocated. The guest then unmasks vector #50.
> 
> With your magic interface VFIO has to allocate 49 vectors and then free
> 48 of them again or just keep 48 around for nothing which defeats the
> purpose of on demand allocation completely.

This use case is broken already, the VFIO case sort of assumes things are
growing in sequence. Today it doesn't have a hint on which entry is being
unmasked I suppose. So VFIO simply releases everything, adds N more than
currently allocated. 

If there is a real world need for allocating a
specific vector#50, maybe we should add a alloc_exact() type and core can
check if #50 is still available.

Maybe for MSIx we don't have a need to shrink based on current usage. IMS
requires both grow and shrink. But it might be odd to have 2 domains behave
quite differently.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-12-02 19:21       ` Raj, Ashok
@ 2021-12-02 20:40         ` Thomas Gleixner
  2021-12-03  0:45           ` Raj, Ashok
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-12-02 20:40 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Dey, Megha, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson,
	Kevin Tian, Jason Gunthorpe, Michael Ellerman, Andrew Cooper,
	Juergen Gross, linux-pci, xen-devel, Ashok Raj

Ashok,

On Thu, Dec 02 2021 at 11:21, Ashok Raj wrote:
> On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
>> You're missing the real world use case. The above is fiction.
>
> I don't think there is a valid use case for freeing specific vectors. Its
> true some are special, IDXD has vector#0 like that. But I expect drivers to
> acquire these special vectors  once and never free them until driver 
> tear down time.
>
> But there is a need to free on demand, for a subdevice constructed for idxd
> pass-through, when the guest is torn down, host would need to free them.
> Only growing on demand seems to only catch one part of the dynamic part.
>
> IDXD also allocates interrupt only when the WQ is enabled, and frees when its
> disabled.

You're talking about IMS not MSI-X here, right? IMS cannot be allocated
via the PCI/MSI interfaces as we established long ago.

And if you are talking about the 8 MSI-X interrupts for IDXD then I
really do not see the point of ever releasing it.

>> If a driver would release 1 and 2 then it should explicitely reallocate
>> 1 and 2 and not let the core decide to magically allocate something.
>> 
>> If the driver wants three more after freeing 1, 2 then the core could
>> just allocate 5, 6, 7, and would still fulfil the callers request to
>> allocate three more, right?
>
> Since the core is already managing what's allocated and free, requiring
> drivers to manage each allocated entries seem hard, while the core can
> easily manage it. For IDXD cases, we don't really care which ones of the
> IMS is being allocated and freed. It just wants one of the available IMS
> entries. The assumption is since the driver would have acquired any special
> ones upfront with the alloc_irqs().

For MSI-X the free vector use case does not exist today and even if it
would exist the driver has to know about the index.

If the index -> function accociation is hard wired, it needs to know it
obviously.

If it's not hardwired then it still needs to know the resulting index,
because it has to program that index into a device function register so
that the device knows which entry to use.

IMS is not any different. You need to know the index in order to
associate it to the queue, no? And you need the index in order to figure
out the Linux irq number.

But again, that's not a problem of this very API because this API is
about PCI/MSI and not about IMS.

>> And even if it just allocates one, then the caller still has to know the
>> index upfront. Why? Because it needs to know it in order to get the
>> Linux interrupt number via pci_irq_vector().
>
> If we were to allocate one, the new API can simply return the index
> directly to the caller, and they call pci_irq_vector() to get the IRQ
> number.

That can work, but then we need both variants:

     pci_msix_alloc_vector_at() and pci_msix_alloc_vector_any()

Why?

Because pci_msix_alloc_vector_any() cannot solve the VFIO on demand
allocation problem and it cannot be used to replace the sparse
allocations which are done via pci_enable_msix_exact() today.

If there is an MSI-X use case to allocate any vector then we can
implement that. If there is none, then we don't need it, right?

>> So if the driver would free the vector for a particular functionality,
>> or not allocate it in the first place, then it exactly knows what it
>> freed and what it needs to allocate when it needs that functionality
>> (again).
>
> It doesn't *have* to be that all vectors are special. Some of them are
> special that they acquired all during driver load time. These are allocated
> once and never freed. The rest are for say completion interrupts or such and 
> such that go with work queues. These can dynamically be allocated and
> freed.
>
> The driver doesn't really care which index it wants or what the next index
> should be. But it has to remember the allocated ones so it can pass down
> for the free. Maybe the one we did a while back
>
> https://lore.kernel.org/lkml/1561162778-12669-1-git-send-email-megha.dey@linux.intel.com/
>
> This has a group handle, and kept adding things to it.

Was it really necessary to bring those memories back?

If we want groups, then surely not with these kind of hacks. I still
need to see the usecase for the groups. The discussion back then just
provided handwaving about internal request which never materialized.

But talking about groups. That's very similar to the other discussion
vs. storing the IMS entries for these sliced devices, queues or
whatever. That's at least a use case.

>> What you are trying to create is a solution in search of a problem. You
>> cannot declare via a random allocation API how devices work. You neither
>> can fix the VFIO issue in a sensible way.
>> 
>> VFIO starts with vector #0 allocated. The guest then unmasks vector #50.
>> 
>> With your magic interface VFIO has to allocate 49 vectors and then free
>> 48 of them again or just keep 48 around for nothing which defeats the
>> purpose of on demand allocation completely.
>
> This use case is broken already, the VFIO case sort of assumes things are
> growing in sequence. Today it doesn't have a hint on which entry is being
> unmasked I suppose. So VFIO simply releases everything, adds N more than
> currently allocated.

VFIO exactly knows which entry is unmasked simply because the write into
the MSI-X table in the device config space is trapped so it knows
exactly which entry is unmasked, no? Guess how VFIO knows about $N more?

> Maybe for MSIx we don't have a need to shrink based on current usage. IMS
> requires both grow and shrink. But it might be odd to have 2 domains behave
> quite differently.

We are not implementing the full MSI[X] zoo for IMS either. So the
interfaces are different in the first place.

Making them artificially uniform is a horrible idea.

They are two different things, really. The only thing they have in common
is that at the end of the day the device sends a message over the bus
and they happen to share the underlying MSI code infrastructure.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-12-02 20:40         ` Thomas Gleixner
@ 2021-12-03  0:45           ` Raj, Ashok
  2021-12-03 12:29             ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Raj, Ashok @ 2021-12-03  0:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dey, Megha, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson,
	Kevin Tian, Jason Gunthorpe, Michael Ellerman, Andrew Cooper,
	Juergen Gross, linux-pci, xen-devel, Ashok Raj

Hi Thomas

On Thu, Dec 02, 2021 at 09:40:08PM +0100, Thomas Gleixner wrote:
> Ashok,
> 
> On Thu, Dec 02 2021 at 11:21, Ashok Raj wrote:
> > On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> >> You're missing the real world use case. The above is fiction.
> >
> > I don't think there is a valid use case for freeing specific vectors. Its
> > true some are special, IDXD has vector#0 like that. But I expect drivers to
> > acquire these special vectors  once and never free them until driver 
> > tear down time.
> >
> > But there is a need to free on demand, for a subdevice constructed for idxd
> > pass-through, when the guest is torn down, host would need to free them.
> > Only growing on demand seems to only catch one part of the dynamic part.
> >
> > IDXD also allocates interrupt only when the WQ is enabled, and frees when its
> > disabled.
> 
> You're talking about IMS not MSI-X here, right? IMS cannot be allocated
> via the PCI/MSI interfaces as we established long ago.
> 
> And if you are talking about the 8 MSI-X interrupts for IDXD then I
> really do not see the point of ever releasing it.

Not worried about MSI-x for IDXD :), I assumed the purpose of this exercise
was about 2 things.

- Fix the VFIO mask/unmask weirdness ending up disable, reenable with more
  interrupts. 
  - We are only fixing the case by not calling the disable_msi, but just
    growing on demand.

- Use this as a case to build IMS. but if we treat MSIx and IMS
  differently, IMS would be bit different in how the dynamic parts work.

Although there is no real need for MSIx being dynamic except to avoid host
vector exhausion, do you think we could still allocate specific entries.
Since unmask is per-vector, is there benefit to doing just that vs
allocating current+N?

> 
> >> If a driver would release 1 and 2 then it should explicitely reallocate
> >> 1 and 2 and not let the core decide to magically allocate something.
> >> 
> >> If the driver wants three more after freeing 1, 2 then the core could
> >> just allocate 5, 6, 7, and would still fulfil the callers request to
> >> allocate three more, right?
> >
> > Since the core is already managing what's allocated and free, requiring
> > drivers to manage each allocated entries seem hard, while the core can
> > easily manage it. For IDXD cases, we don't really care which ones of the
> > IMS is being allocated and freed. It just wants one of the available IMS
> > entries. The assumption is since the driver would have acquired any special
> > ones upfront with the alloc_irqs().
> 
> For MSI-X the free vector use case does not exist today and even if it
> would exist the driver has to know about the index.
> 
> If the index -> function accociation is hard wired, it needs to know it
> obviously.
> 
> If it's not hardwired then it still needs to know the resulting index,
> because it has to program that index into a device function register so
> that the device knows which entry to use.
> 
> IMS is not any different. You need to know the index in order to
> associate it to the queue, no? And you need the index in order to figure
> out the Linux irq number.
> 
> But again, that's not a problem of this very API because this API is
> about PCI/MSI and not about IMS.

fair enough..the thought was even though MSIx doesn't require that, but the
implementations can be consistent if we aren't breaking MSIx. 

but as you said they don't have to be the same and can differ in how they
are implemented.


> 
> >> And even if it just allocates one, then the caller still has to know the
> >> index upfront. Why? Because it needs to know it in order to get the
> >> Linux interrupt number via pci_irq_vector().
> >
> > If we were to allocate one, the new API can simply return the index
> > directly to the caller, and they call pci_irq_vector() to get the IRQ
> > number.
> 
> That can work, but then we need both variants:
> 
>      pci_msix_alloc_vector_at() and pci_msix_alloc_vector_any()
> 
> Why?
> 
> Because pci_msix_alloc_vector_any() cannot solve the VFIO on demand
> allocation problem and it cannot be used to replace the sparse
> allocations which are done via pci_enable_msix_exact() today.
> 
> If there is an MSI-X use case to allocate any vector then we can
> implement that. If there is none, then we don't need it, right?

agreed.

> 
> >> So if the driver would free the vector for a particular functionality,
> >> or not allocate it in the first place, then it exactly knows what it
> >> freed and what it needs to allocate when it needs that functionality
> >> (again).
> >
> > It doesn't *have* to be that all vectors are special. Some of them are
> > special that they acquired all during driver load time. These are allocated
> > once and never freed. The rest are for say completion interrupts or such and 
> > such that go with work queues. These can dynamically be allocated and
> > freed.
> >
> > The driver doesn't really care which index it wants or what the next index
> > should be. But it has to remember the allocated ones so it can pass down
> > for the free. Maybe the one we did a while back
> >
> > https://lore.kernel.org/lkml/1561162778-12669-1-git-send-email-megha.dey@linux.intel.com/
> >
> > This has a group handle, and kept adding things to it.
> 
> Was it really necessary to bring those memories back?

:-)

> 
> If we want groups, then surely not with these kind of hacks. I still
> need to see the usecase for the groups. The discussion back then just
> provided handwaving about internal request which never materialized.

true, we didn't hear back from the groups that asked for them.
> 
> But talking about groups. That's very similar to the other discussion
> vs. storing the IMS entries for these sliced devices, queues or
> whatever. That's at least a use case.

Correct.

> 
> >> What you are trying to create is a solution in search of a problem. You
> >> cannot declare via a random allocation API how devices work. You neither
> >> can fix the VFIO issue in a sensible way.
> >> 
> >> VFIO starts with vector #0 allocated. The guest then unmasks vector #50.
> >> 
> >> With your magic interface VFIO has to allocate 49 vectors and then free
> >> 48 of them again or just keep 48 around for nothing which defeats the
> >> purpose of on demand allocation completely.
> >
> > This use case is broken already, the VFIO case sort of assumes things are
> > growing in sequence. Today it doesn't have a hint on which entry is being
> > unmasked I suppose. So VFIO simply releases everything, adds N more than
> > currently allocated.
> 
> VFIO exactly knows which entry is unmasked simply because the write into
> the MSI-X table in the device config space is trapped so it knows
> exactly which entry is unmasked, no? Guess how VFIO knows about $N more?

bah.. i missed that little fact.

When VFIO knows exactly which entry is being unmasked, is it enough to just
allocate exact one, or do we need to all all N? I didn't see why we need to
grown by N additional vectors instead of only allocating 1 for the entry
being unmasked?

> 
> > Maybe for MSIx we don't have a need to shrink based on current usage. IMS
> > requires both grow and shrink. But it might be odd to have 2 domains behave
> > quite differently.
> 
> We are not implementing the full MSI[X] zoo for IMS either. So the
> interfaces are different in the first place.

The ones that actually differ between the MSIx and IMS are:

- Discovery on where to find that unlike the PCIe standard mechanism.
  (Although in initial implementation we have forced a common way to manage
  this, but I think people already hinted this isn't going to work
  for different vendors or even between gen1/gen2 devices.
- Managing the location of the add/data write, mask/unmask etc.
- Might have other attributes such as PASID etc for the IDXD case exposed
  to guest/user space.


Are there other differences between them?

> 
> Making them artificially uniform is a horrible idea.

Totally!

> 
> They are two different things, really. The only thing they have in common
> is that at the end of the day the device sends a message over the bus
> and they happen to share the underlying MSI code infrastructure.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
  2021-12-03  0:45           ` Raj, Ashok
@ 2021-12-03 12:29             ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2021-12-03 12:29 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Dey, Megha, LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson,
	Kevin Tian, Jason Gunthorpe, Michael Ellerman, Andrew Cooper,
	Juergen Gross, linux-pci, xen-devel, Ashok Raj

Ashok,

On Thu, Dec 02 2021 at 16:45, Ashok Raj wrote:
> On Thu, Dec 02, 2021 at 09:40:08PM +0100, Thomas Gleixner wrote:
> Not worried about MSI-x for IDXD :), I assumed the purpose of this exercise
> was about 2 things.
>
> - Fix the VFIO mask/unmask weirdness ending up disable, reenable with more
>   interrupts. 
>   - We are only fixing the case by not calling the disable_msi, but just
>     growing on demand.
>
> - Use this as a case to build IMS. but if we treat MSIx and IMS
>   differently, IMS would be bit different in how the dynamic parts
>   work.

Conceptually they are not different, really. You are mixing concepts
and implementation details.

> Although there is no real need for MSIx being dynamic except to avoid host
> vector exhausion, do you think we could still allocate specific entries.
> Since unmask is per-vector, is there benefit to doing just that vs
> allocating current+N?

This is either a rethoric question or a trick question, right?

>> VFIO exactly knows which entry is unmasked simply because the write into
>> the MSI-X table in the device config space is trapped so it knows
>> exactly which entry is unmasked, no? Guess how VFIO knows about $N more?
>
> bah.. i missed that little fact.
>
> When VFIO knows exactly which entry is being unmasked, is it enough to just
> allocate exact one, or do we need to all all N? I didn't see why we need to
> grown by N additional vectors instead of only allocating 1 for the entry
> being unmasked?

That's exactly the point. The current implementation does so, because
the PCI/MSI infrastructure does not provide a mechanism to allocate a
particular vector post init.
 
>> > Maybe for MSIx we don't have a need to shrink based on current usage. IMS
>> > requires both grow and shrink. But it might be odd to have 2 domains behave
>> > quite differently.
>> 
>> We are not implementing the full MSI[X] zoo for IMS either. So the
>> interfaces are different in the first place.
>
> The ones that actually differ between the MSIx and IMS are:
>
> - Discovery on where to find that unlike the PCIe standard mechanism.
>   (Although in initial implementation we have forced a common way to manage
>   this, but I think people already hinted this isn't going to work
>   for different vendors or even between gen1/gen2 devices.
> - Managing the location of the add/data write, mask/unmask etc.
> - Might have other attributes such as PASID etc for the IDXD case exposed
>   to guest/user space.

You are looking at this from the wrong direction, i.e. top down. Why?

Because if you look at it from bottom up, then you'll see the following:

  The base of all this is a function block which translates a write of
  message data to the message address into an interrupt raised in a
  CPU.

  The interrupt remap unit is just another translator which converts the
  write to the remap table into a write to the CPU function block, if
  the encoded protections are valid.

From a device perspective the message address and the message data are
completely opaque and all the device knows about them is that it has to
write message.data to message.address in order to raise an interrupt in
some CPU.

Of course the device needs to have some sort of storage where the OS can
save the composed message data and the message address so that the
device itself can access it when it wants to raise an interrupt.

IOW. The message storage is device specific.

For IO/APIC the message is saved in the per pin routing entry.

HPET has it's own routing entry

PCI/MSI provides standartised storage for the messages.

IMS allows the device to define where the message is stored. That's a
fundametally new concept, right?

No. It is not. All of the above are already IMS implementations. And
each implementation has their own quirks and oddities which is why
we end up with different irqdomains and irqchips.

If you look at other architectures there are various other flavours of
devices which have their own device specific message store, IOW they
all are device specific IMS flavours.

They all have two things in common:

  - They provide storage for messages
  - To raise an interruupt they write message.data to message.address

So IMS is conceptually nothing new. It's just a marketing brandname for
something which exists since the invention of message based interrupt
for obvious reasons.

What's different for all of the above variants is the way how it is
exposed to the devices. Wired interrupts which end up at the IO/APIC pin
obviously cannot expose the underlying message mechanism because the
device cannot make use of it. And a device cannot allocate a pin either
because it obviously cannot rewire the PCB.

For PCI/MSI we have an PCI/MSI sepcific interface which is aware of the
PCI/MSI way to store the messages and to deal with the quirks and
limitations of PCI/MSI.

For IMS we surely will model an interface which handles all IMS variants
in a uniform way too, but that interface will be different from PCI/MSI
because it does not need any of the PCI/MSI quirks at all.

That interface will more look like the underlying msi irqdomain
interface simply because pretending it is PCI specific is an artificial
exercise.

There is nothing PCI specific about it. The only connection which needs
to be made is through msi_desc::dev and perhaps some meta info so that
the IOMMU can figure out from which PCI device this message will
originate..

Adding a pci_foo() wrapper around it which reliefs the programmer from
writing &pdev->dev and filling in some meta info is just an obvious
conveniance add on.

See?

That interface is really the least of all problems.

We need to settle the other way more important qeustion how to
store/manage MSI descriptors and how to handle the per device IMS
irqdomain first.

You can handwave about the interface as long as you want. It won't
materialize before the underlying infrastructure is not sorted out.

Once that is done then the interface is mostly defined by that
infrastructure and writing it up is not going to be rocket science.

See?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-12-03 12:29 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-27  1:24 [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
2021-11-27  1:24 ` [patch 01/10] genirq/msi: Add range argument to alloc/free MSI domain ops Thomas Gleixner
2021-11-27  1:24   ` Thomas Gleixner
2021-11-27  1:24 ` [patch 02/10] genirq/msi: Add range argument to msi_domain_alloc/free_descs_locked() Thomas Gleixner
2021-11-27  1:25   ` Thomas Gleixner
2021-11-27  1:24 ` [patch 03/10] genirq/msi: Make MSI descriptor alloc/free ready for range allocations Thomas Gleixner
2021-11-27  1:25   ` Thomas Gleixner
2021-11-28 15:57   ` Marc Zyngier
2021-11-28 19:17     ` Thomas Gleixner
2021-11-29 17:28       ` Thomas Gleixner
2021-11-27  1:24 ` [patch 05/10] genirq/msi: Add domain info flag MSI_FLAG_CAN_EXPAND Thomas Gleixner
2021-11-27  1:25   ` Thomas Gleixner
2021-11-27  1:24 ` [patch 06/10] PCI/MSI: Use range in allocation path Thomas Gleixner
2021-11-27  1:25   ` Thomas Gleixner
2021-11-27  1:24 ` [patch 08/10] PCI/MSI: Provide pci_msi_domain_supports_expand() Thomas Gleixner
2021-11-27  1:25   ` Thomas Gleixner
2021-11-27  1:24 ` [patch 10/10] x86/apic/msi: Support MSI-X vector expansion Thomas Gleixner
2021-11-27  1:25   ` Thomas Gleixner
2021-11-27  1:24 ` [patch 00/10] genirq/msi, PCI/MSI: Support for dynamic MSI-X vector expansion - Part 4 Thomas Gleixner
2021-11-27  1:25 ` [patch 04/10] genirq/msi: Prepare MSI domain alloc/free for range irq allocation Thomas Gleixner
2021-11-27  1:24   ` Thomas Gleixner
2021-11-27  1:25 ` [patch 07/10] PCI/MSI: Make free related functions range based Thomas Gleixner
2021-11-27  1:24   ` Thomas Gleixner
2021-11-27  1:25 ` [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() Thomas Gleixner
2021-11-27  1:24   ` Thomas Gleixner
2021-12-02  1:08   ` Dey, Megha
2021-12-02 10:16     ` Thomas Gleixner
2021-12-02 19:21       ` Raj, Ashok
2021-12-02 20:40         ` Thomas Gleixner
2021-12-03  0:45           ` Raj, Ashok
2021-12-03 12:29             ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).