ntb.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2
@ 2021-11-27  1:23 Thomas Gleixner
  2021-11-27  1:23 ` [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data Thomas Gleixner
                   ` (31 more replies)
  0 siblings, 32 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

This is the third part of [PCI]MSI refactoring which aims to provide the
ability of expanding MSI-X vectors after enabling MSI-X.

The first two parts of this work can be found here:

    https://lore.kernel.org/r/20211126222700.862407977@linutronix.de
    https://lore.kernel.org/r/20211126224100.303046749@linutronix.de

This third part has the following important changes:

   1) Add locking to protect the MSI descriptor storage

      Right now the MSI descriptor storage (linked list) is not protected
      by anything under the assumption that the list is installed before
      use and destroyed after use. As this is about to change there has to
      be protection

   2) A new set of iterators which allow filtering on the state of the
      descriptors namely whether a descriptor is associated to a Linux
      interrupt or not.

      This cleans up a lot of use cases which have to do this filtering
      manually.

   3) A new set of MSI descriptor allocation functions which make the usage
      sites simpler and confine the storage handling to the core code.

      Trivial MSI descriptors (non PCI) are now allocated by the core code
      automatically when the underlying irq domain requests that.

   4) Rework of sysfs handling to prepare for dynamic extension of MSI-X

      The current mechanism which creates the directory and the attributes
      for all MSI descriptors in one go is obviously not suitable for
      dynamic extension. The rework splits the directory creation out and
      lets the MSI interrupt allocation create the per descriptor
      attributes.

   5) Conversion of the MSI descriptor storage to xarray

      The linked list based storage is suboptimal even without dynamic
      expansion as it requires full list walks to get to a specific
      descriptor. With dynamic expansion this gets even more
      convoluted. Xarray is way more suitable and simplifies the
      final goal of dynamic expansion of the MSI-X space.

The series is based on:

     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-2

and also available from git:

     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3

For the curious who can't wait for the next part to arrive the full series
is available via:

     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-4

Thanks,

	tglx
---
 .clang-format                          |    1 
 arch/powerpc/platforms/4xx/hsta_msi.c  |    7 
 arch/powerpc/platforms/cell/axon_msi.c |    7 
 arch/powerpc/platforms/pasemi/msi.c    |    9 
 arch/powerpc/sysdev/fsl_msi.c          |    8 
 arch/powerpc/sysdev/mpic_u3msi.c       |    9 
 arch/s390/pci/pci_irq.c                |    6 
 arch/x86/pci/xen.c                     |   14 
 drivers/base/core.c                    |    3 
 drivers/base/platform-msi.c            |  110 -----
 drivers/bus/fsl-mc/fsl-mc-msi.c        |   61 --
 drivers/ntb/msi.c                      |   19 
 drivers/pci/controller/pci-hyperv.c    |   15 
 drivers/pci/msi/irqdomain.c            |   11 
 drivers/pci/msi/legacy.c               |   20 
 drivers/pci/msi/msi.c                  |  255 +++++------
 drivers/pci/xen-pcifront.c             |    2 
 drivers/soc/ti/ti_sci_inta_msi.c       |   77 +--
 include/linux/device.h                 |    4 
 include/linux/msi.h                    |  135 +++++-
 include/linux/soc/ti/ti_sci_inta_msi.h |    1 
 kernel/irq/msi.c                       |  719 ++++++++++++++++++++++-----------
 22 files changed, 841 insertions(+), 652 deletions(-)



^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27 12:19   ` Greg Kroah-Hartman
  2021-11-27  1:23 ` [patch 02/32] genirq/msi: Add mutex for MSI list protection Thomas Gleixner
                   ` (30 subsequent siblings)
  31 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

It's only required when MSI is in use.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/base/core.c    |    3 ---
 include/linux/device.h |    4 ----
 include/linux/msi.h    |    4 +++-
 kernel/irq/msi.c       |    5 ++++-
 4 files changed, 7 insertions(+), 9 deletions(-)

--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2874,9 +2874,6 @@ void device_initialize(struct device *de
 	INIT_LIST_HEAD(&dev->devres_head);
 	device_pm_init(dev);
 	set_dev_node(dev, NUMA_NO_NODE);
-#ifdef CONFIG_GENERIC_MSI_IRQ
-	INIT_LIST_HEAD(&dev->msi_list);
-#endif
 	INIT_LIST_HEAD(&dev->links.consumers);
 	INIT_LIST_HEAD(&dev->links.suppliers);
 	INIT_LIST_HEAD(&dev->links.defer_sync);
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -422,7 +422,6 @@ struct dev_msi_info {
  * @em_pd:	device's energy model performance domain
  * @pins:	For device pin management.
  *		See Documentation/driver-api/pin-control.rst for details.
- * @msi_list:	Hosts MSI descriptors
  * @numa_node:	NUMA node this device is close to.
  * @dma_ops:    DMA mapping operations for this device.
  * @dma_mask:	Dma mask (if dma'ble device).
@@ -518,9 +517,6 @@ struct device {
 	struct dev_pin_info	*pins;
 #endif
 	struct dev_msi_info	msi;
-#ifdef CONFIG_GENERIC_MSI_IRQ
-	struct list_head	msi_list;
-#endif
 #ifdef CONFIG_DMA_OPS
 	const struct dma_map_ops *dma_ops;
 #endif
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -145,12 +145,14 @@ struct msi_desc {
  * @properties:		MSI properties which are interesting to drivers
  * @attrs:		Pointer to the sysfs attribute group
  * @platform_data:	Platform-MSI specific data
+ * @list:		List of MSI descriptors associated to the device
  */
 struct msi_device_data {
 	raw_spinlock_t			lock;
 	unsigned long			properties;
 	const struct attribute_group    **attrs;
 	struct platform_msi_priv_data	*platform_data;
+	struct list_head		list;
 };
 
 int msi_setup_device_data(struct device *dev);
@@ -187,7 +189,7 @@ static inline unsigned int msi_get_virq(
 
 /* Helpers to hide struct msi_desc implementation details */
 #define msi_desc_to_dev(desc)		((desc)->dev)
-#define dev_to_msi_list(dev)		(&(dev)->msi_list)
+#define dev_to_msi_list(dev)		(&(dev)->msi.data->list)
 #define first_msi_entry(dev)		\
 	list_first_entry(dev_to_msi_list((dev)), struct msi_desc, list)
 #define for_each_msi_entry(desc, dev)	\
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -87,7 +87,9 @@ EXPORT_SYMBOL_GPL(get_cached_msi_msg);
 
 static void msi_device_data_release(struct device *dev, void *res)
 {
-	WARN_ON_ONCE(!list_empty(&dev->msi_list));
+	struct msi_device_data *md = res;
+
+	WARN_ON_ONCE(!list_empty(&md->list));
 	dev->msi.data = NULL;
 }
 
@@ -113,6 +115,7 @@ int msi_setup_device_data(struct device
 		return -ENOMEM;
 
 	raw_spin_lock_init(&md->lock);
+	INIT_LIST_HEAD(&md->list);
 	dev->msi.data = md;
 	devres_add(dev, md);
 	return 0;


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 02/32] genirq/msi: Add mutex for MSI list protection
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
  2021-11-27  1:23 ` [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 03/32] genirq/msi: Provide msi_domain_alloc/free_irqs_descs_locked() Thomas Gleixner
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

For upcoming runtime extensions of MSI-X interrupts it's required to
protect the MSI descriptor list. Add a mutex to struct msi_device_data and
provide lock/unlock functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    6 ++++++
 kernel/irq/msi.c    |   25 +++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -3,6 +3,7 @@
 #define LINUX_MSI_H
 
 #include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/bits.h>
 #include <asm/msi.h>
@@ -146,6 +147,7 @@ struct msi_desc {
  * @attrs:		Pointer to the sysfs attribute group
  * @platform_data:	Platform-MSI specific data
  * @list:		List of MSI descriptors associated to the device
+ * @mutex:		Mutex protecting the MSI list
  */
 struct msi_device_data {
 	raw_spinlock_t			lock;
@@ -153,6 +155,7 @@ struct msi_device_data {
 	const struct attribute_group    **attrs;
 	struct platform_msi_priv_data	*platform_data;
 	struct list_head		list;
+	struct mutex			mutex;
 };
 
 int msi_setup_device_data(struct device *dev);
@@ -187,6 +190,9 @@ static inline unsigned int msi_get_virq(
 	return ret < 0 ? 0 : ret;
 }
 
+void msi_lock_descs(struct device *dev);
+void msi_unlock_descs(struct device *dev);
+
 /* Helpers to hide struct msi_desc implementation details */
 #define msi_desc_to_dev(desc)		((desc)->dev)
 #define dev_to_msi_list(dev)		(&(dev)->msi.data->list)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -116,12 +116,37 @@ int msi_setup_device_data(struct device
 
 	raw_spin_lock_init(&md->lock);
 	INIT_LIST_HEAD(&md->list);
+	mutex_init(&md->mutex);
 	dev->msi.data = md;
 	devres_add(dev, md);
 	return 0;
 }
 
 /**
+ * msi_lock_descs - Lock the MSI descriptor storage of a device
+ * @dev:	Device to operate on
+ */
+void msi_lock_descs(struct device *dev)
+{
+	if (WARN_ON_ONCE(!dev->msi.data))
+		return;
+	mutex_lock(&dev->msi.data->mutex);
+}
+EXPORT_SYMBOL_GPL(msi_lock_descs);
+
+/**
+ * msi_unlock_descs - Unlock the MSI descriptor storage of a device
+ * @dev:	Device to operate on
+ */
+void msi_unlock_descs(struct device *dev)
+{
+	if (WARN_ON_ONCE(!dev->msi.data))
+		return;
+	mutex_unlock(&dev->msi.data->mutex);
+}
+EXPORT_SYMBOL_GPL(msi_unlock_descs);
+
+/**
  * __msi_get_virq - Return Linux interrupt number of a MSI interrupt
  * @dev:	Device to operate on
  * @index:	MSI interrupt index to look for (0-based)


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 03/32] genirq/msi: Provide msi_domain_alloc/free_irqs_descs_locked()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
  2021-11-27  1:23 ` [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data Thomas Gleixner
  2021-11-27  1:23 ` [patch 02/32] genirq/msi: Add mutex for MSI list protection Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators Thomas Gleixner
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Usage sites which do allocations of the MSI descriptors before invoking
msi_domain_alloc_irqs() require to lock the MSI decriptors accross the
operation.

Provide entry points which can be called with the MSI mutex held and lock
the mutex in the existing entry points.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    3 ++
 kernel/irq/msi.c    |   74 ++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 61 insertions(+), 16 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -413,9 +413,12 @@ struct irq_domain *msi_create_irq_domain
 					 struct irq_domain *parent);
 int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			    int nvec);
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+				       int nvec);
 int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			  int nvec);
 void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev);
 void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev);
 struct msi_domain_info *msi_get_domain_info(struct irq_domain *domain);
 
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -691,10 +691,8 @@ int __msi_domain_alloc_irqs(struct irq_d
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
 					       dev_to_node(dev), &arg, false,
 					       desc->affinity);
-		if (virq < 0) {
-			ret = msi_handle_pci_fail(domain, desc, allocated);
-			goto cleanup;
-		}
+		if (virq < 0)
+			return msi_handle_pci_fail(domain, desc, allocated);
 
 		for (i = 0; i < desc->nvec_used; i++) {
 			irq_set_msi_desc_off(virq, i, desc);
@@ -728,7 +726,7 @@ int __msi_domain_alloc_irqs(struct irq_d
 		}
 		ret = irq_domain_activate_irq(irq_data, can_reserve);
 		if (ret)
-			goto cleanup;
+			return ret;
 	}
 
 skip_activate:
@@ -743,38 +741,63 @@ int __msi_domain_alloc_irqs(struct irq_d
 		}
 	}
 	return 0;
-
-cleanup:
-	msi_domain_free_irqs(domain, dev);
-	return ret;
 }
 
 /**
- * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
+ * msi_domain_alloc_irqs_descs_locked - Allocate interrupts from a MSI interrupt domain
  * @domain:	The domain to allocate from
  * @dev:	Pointer to device struct of the device for which the interrupts
  *		are allocated
  * @nvec:	The number of interrupts to allocate
  *
+ * Must be invoked from within a msi_lock_descs() / msi_unlock_descs()
+ * pair. Use this for MSI irqdomains which implement their own vector
+ * allocation/free.
+ *
  * Return: %0 on success or an error code.
  */
-int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
-			  int nvec)
+int msi_domain_alloc_irqs_descs_locked(struct irq_domain *domain, struct device *dev,
+				       int nvec)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
 	int ret;
 
+	lockdep_assert_held(&dev->msi.data->mutex);
+
 	ret = ops->domain_alloc_irqs(domain, dev, nvec);
 	if (ret)
-		return ret;
+		goto cleanup;
 
 	if (!(info->flags & MSI_FLAG_DEV_SYSFS))
 		return 0;
 
 	ret = msi_device_populate_sysfs(dev);
 	if (ret)
-		msi_domain_free_irqs(domain, dev);
+		goto cleanup;
+	return 0;
+
+cleanup:
+	msi_domain_free_irqs_descs_locked(domain, dev);
+	return ret;
+}
+
+/**
+ * msi_domain_alloc_irqs - Allocate interrupts from a MSI interrupt domain
+ * @domain:	The domain to allocate from
+ * @dev:	Pointer to device struct of the device for which the interrupts
+ *		are allocated
+ * @nvec:	The number of interrupts to allocate
+ *
+ * Return: %0 on success or an error code.
+ */
+int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, int nvec)
+{
+	int ret;
+
+	msi_lock_descs(dev);
+	ret = msi_domain_alloc_irqs_descs_locked(domain, dev, nvec);
+	msi_unlock_descs(dev);
 	return ret;
 }
 
@@ -804,22 +827,41 @@ void __msi_domain_free_irqs(struct irq_d
 }
 
 /**
- * msi_domain_free_irqs - Free interrupts from a MSI interrupt @domain associated to @dev
+ * msi_domain_free_irqs_descs_locked - Free interrupts from a MSI interrupt @domain associated to @dev
  * @domain:	The domain to managing the interrupts
  * @dev:	Pointer to device struct of the device for which the interrupts
  *		are free
+ *
+ * Must be invoked from within a msi_lock_descs() / msi_unlock_descs()
+ * pair. Use this for MSI irqdomains which implement their own vector
+ * allocation.
  */
-void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+void msi_domain_free_irqs_descs_locked(struct irq_domain *domain, struct device *dev)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
 
+	lockdep_assert_held(&dev->msi.data->mutex);
+
 	if (info->flags & MSI_FLAG_DEV_SYSFS)
 		msi_device_destroy_sysfs(dev);
 	ops->domain_free_irqs(domain, dev);
 }
 
 /**
+ * msi_domain_free_irqs - Free interrupts from a MSI interrupt @domain associated to @dev
+ * @domain:	The domain to managing the interrupts
+ * @dev:	Pointer to device struct of the device for which the interrupts
+ *		are free
+ */
+void msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
+{
+	msi_lock_descs(dev);
+	msi_domain_free_irqs_descs_locked(domain, dev);
+	msi_unlock_descs(dev);
+}
+
+/**
  * msi_get_domain_info - Get the MSI interrupt domain info for @domain
  * @domain:	The interrupt domain to retrieve data from
  *


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (2 preceding siblings ...)
  2021-11-27  1:23 ` [patch 03/32] genirq/msi: Provide msi_domain_alloc/free_irqs_descs_locked() Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-28  1:00   ` Jason Gunthorpe
  2021-11-27  1:23 ` [patch 05/32] genirq/msi: Provide msi_alloc_msi_desc() and a simple allocator Thomas Gleixner
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

In preparation for dynamic handling of MSI-X interrupts provide a new set
of MSI descriptor accessor functions and iterators. They are benefitial per
se as they allow to cleanup quite some code in various MSI domain
implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |   58 ++++++++++++++++++++++++++++
 kernel/irq/msi.c    |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 165 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -140,6 +140,18 @@ struct msi_desc {
 	struct pci_msi_desc		pci;
 };
 
+/*
+ * Filter values for the MSI descriptor iterators and accessor functions.
+ */
+enum msi_desc_filter {
+	/* All descriptors */
+	MSI_DESC_ALL,
+	/* Descriptors which have no interrupt associated */
+	MSI_DESC_NOTASSOCIATED,
+	/* Descriptors which have an interrupt associated */
+	MSI_DESC_ASSOCIATED,
+};
+
 /**
  * msi_device_data - MSI per device data
  * @lock:		Spinlock to protect register access
@@ -148,6 +160,8 @@ struct msi_desc {
  * @platform_data:	Platform-MSI specific data
  * @list:		List of MSI descriptors associated to the device
  * @mutex:		Mutex protecting the MSI list
+ * @__next:		Cached pointer to the next entry for iterators
+ * @__filter:		Cached descriptor filter
  */
 struct msi_device_data {
 	raw_spinlock_t			lock;
@@ -156,6 +170,8 @@ struct msi_device_data {
 	struct platform_msi_priv_data	*platform_data;
 	struct list_head		list;
 	struct mutex			mutex;
+	struct msi_desc			*__next;
+	enum msi_desc_filter		__filter;
 };
 
 int msi_setup_device_data(struct device *dev);
@@ -193,6 +209,48 @@ static inline unsigned int msi_get_virq(
 void msi_lock_descs(struct device *dev);
 void msi_unlock_descs(struct device *dev);
 
+struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
+struct msi_desc *msi_next_desc(struct device *dev);
+
+/**
+ * msi_first_desc - Get the first MSI descriptor associated to the device
+ * @dev:	Device to search
+ */
+static inline struct msi_desc *msi_first_desc(struct device *dev)
+{
+	return __msi_first_desc(dev, MSI_DESC_ALL, 0);
+}
+
+
+/**
+ * msi_for_each_desc_from - Iterate the MSI descriptors from a given index
+ *
+ * @desc:	struct msi_desc pointer used as iterator
+ * @dev:	struct device pointer - device to iterate
+ * @filter:	Filter for descriptor selection
+ * @base_index:	MSI index to iterate from
+ *
+ * Notes:
+ *  - The loop must be protected with a msi_lock_descs()/msi_unlock_descs()
+ *    pair.
+ *  - It is safe to remove a retrieved MSI descriptor in the loop.
+ */
+#define msi_for_each_desc_from(desc, dev, filter, base_index)			\
+	for ((desc) = __msi_first_desc((dev), (filter), (base_index)); (desc);	\
+	     (desc) = msi_next_desc((dev)))
+
+/**
+ * msi_for_each_desc - Iterate the MSI descriptors
+ *
+ * @desc:	struct msi_desc pointer used as iterator
+ * @dev:	struct device pointer - device to iterate
+ * @filter:	Filter for descriptor selection
+ *
+ * See msi_for_each_desc_from()for further information.
+ */
+#define msi_for_each_desc(desc, dev, filter)				\
+	msi_for_each_desc_from(desc, dev, filter, 0)
+
 /* Helpers to hide struct msi_desc implementation details */
 #define msi_desc_to_dev(desc)		((desc)->dev)
 #define dev_to_msi_list(dev)		(&(dev)->msi.data->list)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -142,10 +142,117 @@ void msi_unlock_descs(struct device *dev
 {
 	if (WARN_ON_ONCE(!dev->msi.data))
 		return;
+	/* Clear the next pointer which was cached by the iterator */
+	dev->msi.data->__next = NULL;
 	mutex_unlock(&dev->msi.data->mutex);
 }
 EXPORT_SYMBOL_GPL(msi_unlock_descs);
 
+static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
+{
+	switch (filter) {
+	case MSI_DESC_ALL:
+		return true;
+	case MSI_DESC_NOTASSOCIATED:
+		return !desc->irq;
+	case MSI_DESC_ASSOCIATED:
+		return !!desc->irq;
+	}
+	WARN_ON_ONCE(1);
+	return false;
+}
+
+static struct msi_desc *msi_find_first_desc(struct device *dev, enum msi_desc_filter filter,
+					    unsigned int base_index)
+{
+	struct msi_desc *desc;
+
+	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
+		if (desc->msi_index < base_index)
+			continue;
+		if (msi_desc_match(desc, filter))
+			return desc;
+	}
+	return NULL;
+}
+
+/**
+ * __msi_first_desc - Get the first MSI descriptor of a device
+ * @dev:	Device to operate on
+ * @filter:	Descriptor state filter
+ * @base_index:	MSI index to start from for range based operations
+ *
+ * Must be called with the MSI descriptor mutex held, i.e. msi_lock_descs()
+ * must be invoked before the call.
+ *
+ * Return: Pointer to the first MSI descriptor matching the search
+ *	   criteria, NULL if none found.
+ */
+struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter,
+				  unsigned int base_index)
+{
+	struct msi_desc *desc;
+
+	if (WARN_ON_ONCE(!dev->msi.data))
+		return NULL;
+
+	lockdep_assert_held(&dev->msi.data->mutex);
+
+	/* Invalidate a previous invocation within the same lock section */
+	dev->msi.data->__next = NULL;
+
+	desc = msi_find_first_desc(dev, filter, base_index);
+	if (desc) {
+		dev->msi.data->__next = list_next_entry(desc, list);
+		dev->msi.data->__filter = filter;
+	}
+	return desc;
+}
+EXPORT_SYMBOL_GPL(__msi_first_desc);
+
+static struct msi_desc *__msi_next_desc(struct device *dev, enum msi_desc_filter filter,
+					struct msi_desc *from)
+{
+	struct msi_desc *desc = from;
+
+	list_for_each_entry_from(desc, dev_to_msi_list(dev), list) {
+		if (msi_desc_match(desc, filter))
+			return desc;
+	}
+	return NULL;
+}
+
+/**
+ * msi_next_desc - Get the next MSI descriptor of a device
+ * @dev:	Device to operate on
+ *
+ * The first invocation of msi_next_desc() has to be preceeded by a
+ * successful incovation of __msi_first_desc(). Consecutive invocations are
+ * only valid if the previous one was successful. All these operations have
+ * to be done within the same MSI mutex held region.
+ *
+ * Return: Pointer to the next MSI descriptor matching the search
+ *	   criteria, NULL if none found.
+ */
+struct msi_desc *msi_next_desc(struct device *dev)
+{
+	struct msi_device_data *data = dev->msi.data;
+	struct msi_desc *desc;
+
+	if (WARN_ON_ONCE(!data))
+		return NULL;
+
+	lockdep_assert_held(&data->mutex);
+
+	if (!data->__next)
+		return NULL;
+
+	desc = __msi_next_desc(dev, data->__filter, data->__next);
+	dev->msi.data->__next = desc ? list_next_entry(desc, list) : NULL;
+	return desc;
+}
+EXPORT_SYMBOL_GPL(msi_next_desc);
+
 /**
  * __msi_get_virq - Return Linux interrupt number of a MSI interrupt
  * @dev:	Device to operate on


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 05/32] genirq/msi: Provide msi_alloc_msi_desc() and a simple allocator
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (3 preceding siblings ...)
  2021-11-27  1:23 ` [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 06/32] genirq/msi: Provide domain flags to allocate/free MSI descriptors automatically Thomas Gleixner
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Provide msi_alloc_msi_desc() which takes a template MSI descriptor for
initializing a newly allocated descriptor. This allows to simplify various
usage sites of alloc_msi_entry() and moves the storage handling into the
core code.

For simple cases where only a linear vector space is required provide
msi_add_simple_msi_descs() which just allocates a linear range of MSI
descriptors and fills msi_desc::msi_index accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    2 +
 kernel/irq/msi.c    |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -302,6 +302,8 @@ static inline void pci_write_msi_msg(uns
 }
 #endif /* CONFIG_PCI_MSI */
 
+int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
+
 struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
 				 const struct irq_affinity_desc *affinity);
 void free_msi_entry(struct msi_desc *entry);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -61,6 +61,65 @@ void free_msi_entry(struct msi_desc *ent
 }
 
 /**
+ * msi_add_msi_desc - Allocate and initialize a MSI descriptor
+ * @dev:	Pointer to the device for which the descriptor is allocated
+ * @init_desc:	Pointer to an MSI descriptor to initialize the new descriptor
+ *
+ * Return: 0 on success or an appropriate failure code.
+ */
+int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc)
+{
+	struct msi_desc *desc;
+
+	lockdep_assert_held(&dev->msi.data->mutex);
+
+	desc = alloc_msi_entry(dev, init_desc->nvec_used, init_desc->affinity);
+	if (!desc)
+		return -ENOMEM;
+
+	/* Copy the MSI index and type specific data to the new descriptor. */
+	desc->msi_index = init_desc->msi_index;
+	desc->pci = init_desc->pci;
+
+	list_add_tail(&desc->list, &dev->msi.data->list);
+	return 0;
+}
+
+/**
+ * msi_add_simple_msi_descs - Allocate and initialize MSI descriptors
+ * @dev:	Pointer to the device for which the descriptors are allocated
+ * @index:	Index for the first MSI descriptor
+ * @ndesc:	Number of descriptors to allocate
+ *
+ * Return: 0 on success or an appropriate failure code.
+ */
+static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
+{
+	struct msi_desc *desc, *tmp;
+	LIST_HEAD(list);
+	unsigned int i;
+
+	lockdep_assert_held(&dev->msi.data->mutex);
+
+	for (i = 0; i < ndesc; i++) {
+		desc = alloc_msi_entry(dev, 1, NULL);
+		if (!desc)
+			goto fail;
+		desc->msi_index = index + i;
+		list_add_tail(&desc->list, &list);
+	}
+	list_splice_tail(&list, &dev->msi.data->list);
+	return 0;
+
+fail:
+	list_for_each_entry_safe(desc, tmp, &list, list) {
+		list_del(&desc->list);
+		free_msi_entry(desc);
+	}
+	return -ENOMEM;
+}
+
+/**
  * msi_device_has_property - Check whether a device has a specific MSI property
  * @dev:	Pointer to the device which is queried
  * @prop:	Property to check for


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 06/32] genirq/msi: Provide domain flags to allocate/free MSI descriptors automatically
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (4 preceding siblings ...)
  2021-11-27  1:23 ` [patch 05/32] genirq/msi: Provide msi_alloc_msi_desc() and a simple allocator Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 07/32] genirq/msi: Count the allocated MSI descriptors Thomas Gleixner
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Provide domain info flags which tell the core to allocate simple
descriptors or to free descriptors when the interrupts are freed and
implement the required functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |   15 +++++++++++++++
 kernel/irq/msi.c    |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -303,6 +303,17 @@ static inline void pci_write_msi_msg(uns
 #endif /* CONFIG_PCI_MSI */
 
 int msi_add_msi_desc(struct device *dev, struct msi_desc *init_desc);
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
+			      unsigned int base_index, unsigned int ndesc);
+
+/**
+ * msi_free_msi_descs - Free MSI descriptors of a device
+ * @dev:	Device to free the descriptors
+ */
+static inline void msi_free_msi_descs(struct device *dev)
+{
+	msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
+}
 
 struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
 				 const struct irq_affinity_desc *affinity);
@@ -463,6 +474,10 @@ enum {
 	MSI_FLAG_DEV_SYSFS		= (1 << 7),
 	/* MSI-X entries must be contiguous */
 	MSI_FLAG_MSIX_CONTIGUOUS	= (1 << 8),
+	/* Allocate simple MSI descriptors */
+	MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS	= (1 << 9),
+	/* Free MSI descriptors */
+	MSI_FLAG_FREE_MSI_DESCS		= (1 << 10),
 };
 
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -120,6 +120,32 @@ static int msi_add_simple_msi_descs(stru
 }
 
 /**
+ * msi_free_msi_descs_range - Free MSI descriptors of a device
+ * @dev:	Device to free the descriptors
+ * @filter:	Descriptor state filter
+ * @base_index:	Index to start freeing from
+ * @ndesc:	Number of descriptors to free
+ */
+void msi_free_msi_descs_range(struct device *dev, enum msi_desc_filter filter,
+			      unsigned int base_index, unsigned int ndesc)
+{
+	struct msi_desc *desc;
+
+	lockdep_assert_held(&dev->msi.data->mutex);
+
+	msi_for_each_desc(desc, dev, filter) {
+		/*
+		 * Stupid for now to handle MSI device domain until the
+		 * storage is switched over to an xarray.
+		 */
+		if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
+			continue;
+		list_del(&desc->list);
+		free_msi_entry(desc);
+	}
+}
+
+/**
  * msi_device_has_property - Check whether a device has a specific MSI property
  * @dev:	Pointer to the device which is queried
  * @prop:	Property to check for
@@ -905,6 +931,16 @@ int __msi_domain_alloc_irqs(struct irq_d
 	return 0;
 }
 
+static int msi_domain_add_simple_msi_descs(struct msi_domain_info *info,
+					   struct device *dev,
+					   unsigned int num_descs)
+{
+	if (!(info->flags & MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS))
+		return 0;
+
+	return msi_add_simple_msi_descs(dev, 0, num_descs);
+}
+
 /**
  * msi_domain_alloc_irqs_descs_locked - Allocate interrupts from a MSI interrupt domain
  * @domain:	The domain to allocate from
@@ -927,6 +963,10 @@ int msi_domain_alloc_irqs_descs_locked(s
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
+	ret = msi_domain_add_simple_msi_descs(info, dev, nvec);
+	if (ret)
+		return ret;
+
 	ret = ops->domain_alloc_irqs(domain, dev, nvec);
 	if (ret)
 		goto cleanup;
@@ -988,6 +1028,13 @@ void __msi_domain_free_irqs(struct irq_d
 	}
 }
 
+static void msi_domain_free_msi_descs(struct msi_domain_info *info,
+				      struct device *dev)
+{
+	if (info->flags & MSI_FLAG_FREE_MSI_DESCS)
+		msi_free_msi_descs(dev);
+}
+
 /**
  * msi_domain_free_irqs_descs_locked - Free interrupts from a MSI interrupt @domain associated to @dev
  * @domain:	The domain to managing the interrupts
@@ -1008,6 +1055,7 @@ void msi_domain_free_irqs_descs_locked(s
 	if (info->flags & MSI_FLAG_DEV_SYSFS)
 		msi_device_destroy_sysfs(dev);
 	ops->domain_free_irqs(domain, dev);
+	msi_domain_free_msi_descs(info, dev);
 }
 
 /**


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 07/32] genirq/msi: Count the allocated MSI descriptors
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (5 preceding siblings ...)
  2021-11-27  1:23 ` [patch 06/32] genirq/msi: Provide domain flags to allocate/free MSI descriptors automatically Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27 12:19   ` Greg Kroah-Hartman
  2021-11-27  1:23 ` [patch 08/32] PCI/MSI: Protect MSI operations Thomas Gleixner
                   ` (24 subsequent siblings)
  31 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    3 +++
 kernel/irq/msi.c    |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -156,6 +156,7 @@ enum msi_desc_filter {
  * msi_device_data - MSI per device data
  * @lock:		Spinlock to protect register access
  * @properties:		MSI properties which are interesting to drivers
+ * @num_descs:		The number of allocated MSI descriptors for the device
  * @attrs:		Pointer to the sysfs attribute group
  * @platform_data:	Platform-MSI specific data
  * @list:		List of MSI descriptors associated to the device
@@ -166,6 +167,7 @@ enum msi_desc_filter {
 struct msi_device_data {
 	raw_spinlock_t			lock;
 	unsigned long			properties;
+	unsigned int			num_descs;
 	const struct attribute_group    **attrs;
 	struct platform_msi_priv_data	*platform_data;
 	struct list_head		list;
@@ -208,6 +210,7 @@ static inline unsigned int msi_get_virq(
 
 void msi_lock_descs(struct device *dev);
 void msi_unlock_descs(struct device *dev);
+unsigned int msi_device_num_descs(struct device *dev);
 
 struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
 struct msi_desc *msi_next_desc(struct device *dev);
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -82,6 +82,7 @@ int msi_add_msi_desc(struct device *dev,
 	desc->pci = init_desc->pci;
 
 	list_add_tail(&desc->list, &dev->msi.data->list);
+	dev->msi.data->num_descs++;
 	return 0;
 }
 
@@ -109,6 +110,7 @@ int msi_add_simple_msi_descs(struct devi
 		list_add_tail(&desc->list, &list);
 	}
 	list_splice_tail(&list, &dev->msi.data->list);
+	dev->msi.data->num_descs += ndesc;
 	return 0;
 
 fail:
@@ -142,6 +144,7 @@ void msi_free_msi_descs_range(struct dev
 			continue;
 		list_del(&desc->list);
 		free_msi_entry(desc);
+		dev->msi.data->num_descs--;
 	}
 }
 
@@ -157,6 +160,21 @@ bool msi_device_has_property(struct devi
 	return !!(dev->msi.data->properties & prop);
 }
 
+/**
+ * msi_device_num_descs - Query the number of allocated MSI descriptors of a device
+ * @dev:	The device to read from
+ *
+ * Note: This is a lockless snapshot of msi_device_data::num_descs
+ *
+ * Returns the number of MSI descriptors which are allocated for @dev
+ */
+unsigned int msi_device_num_descs(struct device *dev)
+{
+	if (dev->msi.data)
+		return dev->msi.data->num_descs;
+	return 0;
+}
+
 void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 {
 	*msg = entry->msg;


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 08/32] PCI/MSI: Protect MSI operations
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (6 preceding siblings ...)
  2021-11-27  1:23 ` [patch 07/32] genirq/msi: Count the allocated MSI descriptors Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 09/32] PCI/MSI: Use msi_add_msi_desc() Thomas Gleixner
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

To prepare for dynamic extension of MSI-X vectors, protect the MSI
operations for MSI and MSI-X. This requires to move the invocation of
irq_create_affinity_masks() out of the descriptor lock section to avoid
reverse lock ordering vs. CPU hotplug lock as some callers of the PCI/MSI
allocation interfaces already hold it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    4 -
 drivers/pci/msi/msi.c       |  120 ++++++++++++++++++++++++++------------------
 2 files changed, 73 insertions(+), 51 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -14,7 +14,7 @@ int pci_msi_setup_msi_irqs(struct pci_de
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		return msi_domain_alloc_irqs(domain, &dev->dev, nvec);
+		return msi_domain_alloc_irqs_descs_locked(domain, &dev->dev, nvec);
 
 	return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
 }
@@ -25,7 +25,7 @@ void pci_msi_teardown_msi_irqs(struct pc
 
 	domain = dev_get_msi_domain(&dev->dev);
 	if (domain && irq_domain_is_hierarchy(domain))
-		msi_domain_free_irqs(domain, &dev->dev);
+		msi_domain_free_irqs_descs_locked(domain, &dev->dev);
 	else
 		pci_msi_legacy_teardown_msi_irqs(dev);
 }
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -322,11 +322,13 @@ static void __pci_restore_msix_state(str
 
 	write_msg = arch_restore_msi_irqs(dev);
 
+	msi_lock_descs(&dev->dev);
 	for_each_pci_msi_entry(entry, dev) {
 		if (write_msg)
 			__pci_write_msi_msg(entry, &entry->msg);
 		pci_msix_write_vector_ctrl(entry, entry->pci.msix_ctrl);
 	}
+	msi_unlock_descs(&dev->dev);
 
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
 }
@@ -339,19 +341,15 @@ void pci_restore_msi_state(struct pci_de
 EXPORT_SYMBOL_GPL(pci_restore_msi_state);
 
 static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
+msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity_desc *masks)
 {
-	struct irq_affinity_desc *masks = NULL;
 	struct msi_desc *entry;
 	u16 control;
 
-	if (affd)
-		masks = irq_create_affinity_masks(nvec, affd);
-
 	/* MSI Entry Initialization */
 	entry = alloc_msi_entry(&dev->dev, nvec, masks);
 	if (!entry)
-		goto out;
+		return NULL;
 
 	pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
 	/* Lies, damned lies, and MSIs */
@@ -377,8 +375,7 @@ msi_setup_entry(struct pci_dev *dev, int
 	dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
 	if (entry->pci.msi_attrib.is_64)
 		dev->dev.msi.data->properties |= MSI_PROP_64BIT;
-out:
-	kfree(masks);
+
 	return entry;
 }
 
@@ -414,14 +411,21 @@ static int msi_verify_entries(struct pci
 static int msi_capability_init(struct pci_dev *dev, int nvec,
 			       struct irq_affinity *affd)
 {
+	struct irq_affinity_desc *masks = NULL;
 	struct msi_desc *entry;
 	int ret;
 
 	pci_msi_set_enable(dev, 0);	/* Disable MSI during set up */
 
-	entry = msi_setup_entry(dev, nvec, affd);
-	if (!entry)
-		return -ENOMEM;
+	if (affd)
+		masks = irq_create_affinity_masks(nvec, affd);
+
+	msi_lock_descs(&dev->dev);
+	entry = msi_setup_entry(dev, nvec, masks);
+	if (!entry) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
 
 	/* All MSIs are unmasked by default; mask them all */
 	pci_msi_mask(entry, msi_multi_mask(entry));
@@ -444,11 +448,14 @@ static int msi_capability_init(struct pc
 
 	pcibios_free_irq(dev);
 	dev->irq = entry->irq;
-	return 0;
+	goto unlock;
 
 err:
 	pci_msi_unmask(entry, msi_multi_mask(entry));
 	free_msi_irqs(dev);
+unlock:
+	msi_unlock_descs(&dev->dev);
+	kfree(masks);
 	return ret;
 }
 
@@ -475,23 +482,18 @@ static void __iomem *msix_map_region(str
 
 static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			      struct msix_entry *entries, int nvec,
-			      struct irq_affinity *affd)
+			      struct irq_affinity_desc *masks)
 {
-	struct irq_affinity_desc *curmsk, *masks = NULL;
+	int i, vec_count = pci_msix_vec_count(dev);
+	struct irq_affinity_desc *curmsk;
 	struct msi_desc *entry;
 	void __iomem *addr;
-	int ret, i;
-	int vec_count = pci_msix_vec_count(dev);
-
-	if (affd)
-		masks = irq_create_affinity_masks(nvec, affd);
 
 	for (i = 0, curmsk = masks; i < nvec; i++) {
 		entry = alloc_msi_entry(&dev->dev, 1, curmsk);
 		if (!entry) {
 			/* No enough memory. Don't try again */
-			ret = -ENOMEM;
-			goto out;
+			return -ENOMEM;
 		}
 
 		entry->pci.msi_attrib.is_msix	= 1;
@@ -520,10 +522,7 @@ static int msix_setup_entries(struct pci
 			curmsk++;
 	}
 	dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
-	ret = 0;
-out:
-	kfree(masks);
-	return ret;
+	return 0;
 }
 
 static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
@@ -550,6 +549,41 @@ static void msix_mask_all(void __iomem *
 		writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
 }
 
+static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
+				 struct msix_entry *entries, int nvec,
+				 struct irq_affinity *affd)
+{
+	struct irq_affinity_desc *masks = NULL;
+	int ret;
+
+	if (affd)
+		masks = irq_create_affinity_masks(nvec, affd);
+
+	msi_lock_descs(&dev->dev);
+	ret = msix_setup_entries(dev, base, entries, nvec, masks);
+	if (ret)
+		goto out_free;
+
+	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+	if (ret)
+		goto out_free;
+
+	/* Check if all MSI entries honor device restrictions */
+	ret = msi_verify_entries(dev);
+	if (ret)
+		goto out_free;
+
+	msix_update_entries(dev, entries);
+	goto out_unlock;
+
+out_free:
+	free_msi_irqs(dev);
+out_unlock:
+	msi_unlock_descs(&dev->dev);
+	kfree(masks);
+	return ret;
+}
+
 /**
  * msix_capability_init - configure device's MSI-X capability
  * @dev: pointer to the pci_dev data structure of MSI-X device function
@@ -590,20 +624,9 @@ static int msix_capability_init(struct p
 	/* Ensure that all table entries are masked. */
 	msix_mask_all(base, tsize);
 
-	ret = msix_setup_entries(dev, base, entries, nvec, affd);
+	ret = msix_setup_interrupts(dev, base, entries, nvec, affd);
 	if (ret)
-		goto out_free;
-
-	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
-	if (ret)
-		goto out_free;
-
-	/* Check if all MSI entries honor device restrictions */
-	ret = msi_verify_entries(dev);
-	if (ret)
-		goto out_free;
-
-	msix_update_entries(dev, entries);
+		goto out_disable;
 
 	/* Set MSI-X enabled bits and unmask the function */
 	pci_intx_for_msi(dev, 0);
@@ -613,12 +636,8 @@ static int msix_capability_init(struct p
 	pcibios_free_irq(dev);
 	return 0;
 
-out_free:
-	free_msi_irqs(dev);
-
 out_disable:
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
-
 	return ret;
 }
 
@@ -723,8 +742,10 @@ void pci_disable_msi(struct pci_dev *dev
 	if (!pci_msi_enable || !dev || !dev->msi_enabled)
 		return;
 
+	msi_lock_descs(&dev->dev);
 	pci_msi_shutdown(dev);
 	free_msi_irqs(dev);
+	msi_unlock_descs(&dev->dev);
 }
 EXPORT_SYMBOL(pci_disable_msi);
 
@@ -810,8 +831,10 @@ void pci_disable_msix(struct pci_dev *de
 	if (!pci_msi_enable || !dev || !dev->msix_enabled)
 		return;
 
+	msi_lock_descs(&dev->dev);
 	pci_msix_shutdown(dev);
 	free_msi_irqs(dev);
+	msi_unlock_descs(&dev->dev);
 }
 EXPORT_SYMBOL(pci_disable_msix);
 
@@ -872,7 +895,6 @@ int pci_enable_msi(struct pci_dev *dev)
 
 	if (!rc)
 		rc = __pci_enable_msi_range(dev, 1, 1, NULL);
-
 	return rc < 0 ? rc : 0;
 }
 EXPORT_SYMBOL(pci_enable_msi);
@@ -959,11 +981,7 @@ int pci_alloc_irq_vectors_affinity(struc
 				   struct irq_affinity *affd)
 {
 	struct irq_affinity msi_default_affd = {0};
-	int ret = msi_setup_device_data(&dev->dev);
-	int nvecs = -ENOSPC;
-
-	if (ret)
-		return ret;
+	int ret, nvecs;
 
 	if (flags & PCI_IRQ_AFFINITY) {
 		if (!affd)
@@ -973,6 +991,10 @@ int pci_alloc_irq_vectors_affinity(struc
 			affd = NULL;
 	}
 
+	ret = msi_setup_device_data(&dev->dev);
+	if (ret)
+		return ret;
+
 	if (flags & PCI_IRQ_MSIX) {
 		nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
 						affd, flags);
@@ -1001,7 +1023,7 @@ int pci_alloc_irq_vectors_affinity(struc
 		}
 	}
 
-	return nvecs;
+	return -ENOSPC;
 }
 EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
 


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 09/32] PCI/MSI: Use msi_add_msi_desc()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (7 preceding siblings ...)
  2021-11-27  1:23 ` [patch 08/32] PCI/MSI: Protect MSI operations Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 10/32] PCI/MSI: Let core code free MSI descriptors Thomas Gleixner
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Simplify the allocation of MSI descriptors by using msi_add_msi_desc()
which moves the storage handling to core code and prepares for dynamic
extension of the MSI-X vector space.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/msi.c |  121 ++++++++++++++++++++++++--------------------------
 1 file changed, 59 insertions(+), 62 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -340,43 +340,49 @@ void pci_restore_msi_state(struct pci_de
 }
 EXPORT_SYMBOL_GPL(pci_restore_msi_state);
 
-static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity_desc *masks)
+static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
+			      struct irq_affinity_desc *masks)
 {
-	struct msi_desc *entry;
+	struct msi_desc desc;
 	u16 control;
+	int ret;
 
 	/* MSI Entry Initialization */
-	entry = alloc_msi_entry(&dev->dev, nvec, masks);
-	if (!entry)
-		return NULL;
+	memset(&desc, 0, sizeof(desc));
 
 	pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control);
 	/* Lies, damned lies, and MSIs */
 	if (dev->dev_flags & PCI_DEV_FLAGS_HAS_MSI_MASKING)
 		control |= PCI_MSI_FLAGS_MASKBIT;
+	/* Respect XEN's mask disabling */
+	if (pci_msi_ignore_mask)
+		control &= ~PCI_MSI_FLAGS_MASKBIT;
 
-	entry->pci.msi_attrib.is_64	= !!(control & PCI_MSI_FLAGS_64BIT);
-	entry->pci.msi_attrib.can_mask	= !pci_msi_ignore_mask &&
-					  !!(control & PCI_MSI_FLAGS_MASKBIT);
-	entry->pci.msi_attrib.default_irq = dev->irq;
-	entry->pci.msi_attrib.multi_cap	= (control & PCI_MSI_FLAGS_QMASK) >> 1;
-	entry->pci.msi_attrib.multiple	= ilog2(__roundup_pow_of_two(nvec));
+	desc.nvec_used			= nvec;
+	desc.pci.msi_attrib.is_64	= !!(control & PCI_MSI_FLAGS_64BIT);
+	desc.pci.msi_attrib.can_mask	= !!(control & PCI_MSI_FLAGS_MASKBIT);
+	desc.pci.msi_attrib.default_irq	= dev->irq;
+	desc.pci.msi_attrib.multi_cap	= (control & PCI_MSI_FLAGS_QMASK) >> 1;
+	desc.pci.msi_attrib.multiple	= ilog2(__roundup_pow_of_two(nvec));
+	desc.affinity			= masks;
 
 	if (control & PCI_MSI_FLAGS_64BIT)
-		entry->pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
+		desc.pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_64;
 	else
-		entry->pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_32;
+		desc.pci.mask_pos = dev->msi_cap + PCI_MSI_MASK_32;
 
 	/* Save the initial mask status */
-	if (entry->pci.msi_attrib.can_mask)
-		pci_read_config_dword(dev, entry->pci.mask_pos, &entry->pci.msi_mask);
+	if (desc.pci.msi_attrib.can_mask)
+		pci_read_config_dword(dev, desc.pci.mask_pos, &desc.pci.msi_mask);
 
-	dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
-	if (entry->pci.msi_attrib.is_64)
-		dev->dev.msi.data->properties |= MSI_PROP_64BIT;
+	ret = msi_add_msi_desc(&dev->dev, &desc);
+	if (!ret) {
+		dev->dev.msi.data->properties = MSI_PROP_PCI_MSI;
+		if (desc.pci.msi_attrib.is_64)
+			dev->dev.msi.data->properties |= MSI_PROP_64BIT;
+	}
 
-	return entry;
+	return ret;
 }
 
 static int msi_verify_entries(struct pci_dev *dev)
@@ -421,17 +427,14 @@ static int msi_capability_init(struct pc
 		masks = irq_create_affinity_masks(nvec, affd);
 
 	msi_lock_descs(&dev->dev);
-	entry = msi_setup_entry(dev, nvec, masks);
-	if (!entry) {
-		ret = -ENOMEM;
+	ret = msi_setup_msi_desc(dev, nvec, masks);
+	if (ret)
 		goto unlock;
-	}
 
 	/* All MSIs are unmasked by default; mask them all */
+	entry = first_pci_msi_entry(dev);
 	pci_msi_mask(entry, msi_multi_mask(entry));
 
-	list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
-
 	/* Configure MSI capability structure */
 	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
 	if (ret)
@@ -480,49 +483,41 @@ static void __iomem *msix_map_region(str
 	return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
 }
 
-static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
-			      struct msix_entry *entries, int nvec,
-			      struct irq_affinity_desc *masks)
+static int msix_setup_msi_descs(struct pci_dev *dev, void __iomem *base,
+				struct msix_entry *entries, int nvec,
+				struct irq_affinity_desc *masks)
 {
-	int i, vec_count = pci_msix_vec_count(dev);
+	int ret, i, vec_count = pci_msix_vec_count(dev);
 	struct irq_affinity_desc *curmsk;
-	struct msi_desc *entry;
+	struct msi_desc desc;
 	void __iomem *addr;
 
-	for (i = 0, curmsk = masks; i < nvec; i++) {
-		entry = alloc_msi_entry(&dev->dev, 1, curmsk);
-		if (!entry) {
-			/* No enough memory. Don't try again */
-			return -ENOMEM;
-		}
-
-		entry->pci.msi_attrib.is_msix	= 1;
-		entry->pci.msi_attrib.is_64	= 1;
-
-		if (entries)
-			entry->msi_index = entries[i].entry;
-		else
-			entry->msi_index = i;
-
-		entry->pci.msi_attrib.is_virtual = entry->msi_index >= vec_count;
-
-		entry->pci.msi_attrib.can_mask	= !pci_msi_ignore_mask &&
-						  !entry->pci.msi_attrib.is_virtual;
-
-		entry->pci.msi_attrib.default_irq	= dev->irq;
-		entry->pci.mask_base			= base;
+	memset(&desc, 0, sizeof(desc));
 
-		if (entry->pci.msi_attrib.can_mask) {
-			addr = pci_msix_desc_addr(entry);
-			entry->pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
+	desc.nvec_used			= 1;
+	desc.pci.msi_attrib.is_msix	= 1;
+	desc.pci.msi_attrib.is_64	= 1;
+	desc.pci.msi_attrib.default_irq	= dev->irq;
+	desc.pci.mask_base		= base;
+
+	for (i = 0, curmsk = masks; i < nvec; i++, curmsk++) {
+		desc.msi_index = entries ? entries[i].entry : i;
+		desc.affinity = masks ? curmsk : NULL;
+		desc.pci.msi_attrib.is_virtual = desc.msi_index >= vec_count;
+		desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
+					       !desc.pci.msi_attrib.is_virtual;
+
+		if (!desc.pci.msi_attrib.can_mask) {
+			addr = pci_msix_desc_addr(&desc);
+			desc.pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
 		}
 
-		list_add_tail(&entry->list, dev_to_msi_list(&dev->dev));
-		if (masks)
-			curmsk++;
+		ret = msi_add_msi_desc(&dev->dev, &desc);
+		if (ret)
+			break;
 	}
-	dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
-	return 0;
+
+	return ret;
 }
 
 static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
@@ -560,10 +555,12 @@ static int msix_setup_interrupts(struct
 		masks = irq_create_affinity_masks(nvec, affd);
 
 	msi_lock_descs(&dev->dev);
-	ret = msix_setup_entries(dev, base, entries, nvec, masks);
+	ret = msix_setup_msi_descs(dev, base, entries, nvec, masks);
 	if (ret)
 		goto out_free;
 
+	dev->dev.msi.data->properties = MSI_PROP_PCI_MSIX | MSI_PROP_64BIT;
+
 	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
 	if (ret)
 		goto out_free;


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 10/32] PCI/MSI: Let core code free MSI descriptors
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (8 preceding siblings ...)
  2021-11-27  1:23 ` [patch 09/32] PCI/MSI: Use msi_add_msi_desc() Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 11/32] PCI/MSI: Use msi_on_each_desc() Thomas Gleixner
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Set the domain info flag which tells the core code to free the MSI
descriptors from msi_domain_free_irqs() and add an explicit call to the
core function into the legacy code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    3 ++-
 drivers/pci/msi/legacy.c    |    1 +
 drivers/pci/msi/msi.c       |   14 --------------
 3 files changed, 3 insertions(+), 15 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -171,7 +171,8 @@ struct irq_domain *pci_msi_create_irq_do
 	if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
 		pci_msi_domain_update_chip_ops(info);
 
-	info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS;
+	info->flags |= MSI_FLAG_ACTIVATE_EARLY | MSI_FLAG_DEV_SYSFS |
+		       MSI_FLAG_FREE_MSI_DESCS;
 	if (IS_ENABLED(CONFIG_GENERIC_IRQ_RESERVATION_MODE))
 		info->flags |= MSI_FLAG_MUST_REACTIVATE;
 
--- a/drivers/pci/msi/legacy.c
+++ b/drivers/pci/msi/legacy.c
@@ -81,4 +81,5 @@ void pci_msi_legacy_teardown_msi_irqs(st
 {
 	msi_device_destroy_sysfs(&dev->dev);
 	arch_teardown_msi_irqs(dev);
+	msi_free_msi_descs(&dev->dev);
 }
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -224,22 +224,8 @@ EXPORT_SYMBOL_GPL(pci_write_msi_msg);
 
 static void free_msi_irqs(struct pci_dev *dev)
 {
-	struct list_head *msi_list = dev_to_msi_list(&dev->dev);
-	struct msi_desc *entry, *tmp;
-	int i;
-
-	for_each_pci_msi_entry(entry, dev)
-		if (entry->irq)
-			for (i = 0; i < entry->nvec_used; i++)
-				BUG_ON(irq_has_action(entry->irq + i));
-
 	pci_msi_teardown_msi_irqs(dev);
 
-	list_for_each_entry_safe(entry, tmp, msi_list, list) {
-		list_del(&entry->list);
-		free_msi_entry(entry);
-	}
-
 	if (dev->msix_base) {
 		iounmap(dev->msix_base);
 		dev->msix_base = NULL;


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 11/32] PCI/MSI: Use msi_on_each_desc()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (9 preceding siblings ...)
  2021-11-27  1:23 ` [patch 10/32] PCI/MSI: Let core code free MSI descriptors Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 12/32] x86/pci/xen: Use msi_for_each_desc() Thomas Gleixner
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Use the new iterator functions which pave the way for dynamically extending
MSI-X vectors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    4 ++--
 drivers/pci/msi/legacy.c    |   19 ++++++++-----------
 drivers/pci/msi/msi.c       |   30 ++++++++++++++----------------
 3 files changed, 24 insertions(+), 29 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -83,7 +83,7 @@ static int pci_msi_domain_check_cap(stru
 				    struct msi_domain_info *info,
 				    struct device *dev)
 {
-	struct msi_desc *desc = first_pci_msi_entry(to_pci_dev(dev));
+	struct msi_desc *desc = msi_first_desc(dev);
 
 	/* Special handling to support __pci_enable_msi_range() */
 	if (pci_msi_desc_is_multi_msi(desc) &&
@@ -98,7 +98,7 @@ static int pci_msi_domain_check_cap(stru
 			unsigned int idx = 0;
 
 			/* Check for gaps in the entry indices */
-			for_each_msi_entry(desc, dev) {
+			msi_for_each_desc(desc, dev, MSI_DESC_ALL) {
 				if (desc->msi_index != idx++)
 					return -ENOTSUPP;
 			}
--- a/drivers/pci/msi/legacy.c
+++ b/drivers/pci/msi/legacy.c
@@ -29,7 +29,7 @@ int __weak arch_setup_msi_irqs(struct pc
 	if (type == PCI_CAP_ID_MSI && nvec > 1)
 		return 1;
 
-	for_each_pci_msi_entry(desc, dev) {
+	msi_for_each_desc(desc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
 		ret = arch_setup_msi_irq(dev, desc);
 		if (ret)
 			return ret < 0 ? ret : -ENOSPC;
@@ -43,27 +43,24 @@ void __weak arch_teardown_msi_irqs(struc
 	struct msi_desc *desc;
 	int i;
 
-	for_each_pci_msi_entry(desc, dev) {
-		if (desc->irq) {
-			for (i = 0; i < entry->nvec_used; i++)
-				arch_teardown_msi_irq(desc->irq + i);
-		}
+	msi_for_each_desc(desc, &dev->dev, MSI_DESC_ASSOCIATED) {
+		for (i = 0; i < desc->nvec_used; i++)
+			arch_teardown_msi_irq(desc->irq + i);
 	}
 }
 
 static int pci_msi_setup_check_result(struct pci_dev *dev, int type, int ret)
 {
-	struct msi_desc *entry;
+	struct msi_desc *desc;
 	int avail = 0;
 
 	if (type != PCI_CAP_ID_MSIX || ret >= 0)
 		return ret;
 
 	/* Scan the MSI descriptors for successfully allocated ones. */
-	for_each_pci_msi_entry(entry, dev) {
-		if (entry->irq != 0)
-			avail++;
-	}
+	msi_for_each_desc(desc, &dev->dev, MSI_DESC_ASSOCIATED)
+		avail++;
+
 	return avail ? avail : ret;
 }
 
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -299,7 +299,6 @@ static void __pci_restore_msix_state(str
 
 	if (!dev->msix_enabled)
 		return;
-	BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
 
 	/* route the table */
 	pci_intx_for_msi(dev, 0);
@@ -309,7 +308,7 @@ static void __pci_restore_msix_state(str
 	write_msg = arch_restore_msi_irqs(dev);
 
 	msi_lock_descs(&dev->dev);
-	for_each_pci_msi_entry(entry, dev) {
+	msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
 		if (write_msg)
 			__pci_write_msi_msg(entry, &entry->msg);
 		pci_msix_write_vector_ctrl(entry, entry->pci.msix_ctrl);
@@ -378,14 +377,14 @@ static int msi_verify_entries(struct pci
 	if (!dev->no_64bit_msi)
 		return 0;
 
-	for_each_pci_msi_entry(entry, dev) {
+	msi_for_each_desc(entry, &dev->dev, MSI_DESC_ALL) {
 		if (entry->msg.address_hi) {
 			pci_err(dev, "arch assigned 64-bit MSI address %#x%08x but device only supports 32 bits\n",
 				entry->msg.address_hi, entry->msg.address_lo);
-			return -EIO;
+			break;
 		}
 	}
-	return 0;
+	return !entry ? 0 : -EIO;
 }
 
 /**
@@ -418,7 +417,7 @@ static int msi_capability_init(struct pc
 		goto unlock;
 
 	/* All MSIs are unmasked by default; mask them all */
-	entry = first_pci_msi_entry(dev);
+	entry = msi_first_desc(&dev->dev);
 	pci_msi_mask(entry, msi_multi_mask(entry));
 
 	/* Configure MSI capability structure */
@@ -508,11 +507,11 @@ static int msix_setup_msi_descs(struct p
 
 static void msix_update_entries(struct pci_dev *dev, struct msix_entry *entries)
 {
-	struct msi_desc *entry;
+	struct msi_desc *desc;
 
 	if (entries) {
-		for_each_pci_msi_entry(entry, dev) {
-			entries->vector = entry->irq;
+		msi_for_each_desc(desc, &dev->dev, MSI_DESC_ALL) {
+			entries->vector = desc->irq;
 			entries++;
 		}
 	}
@@ -705,15 +704,14 @@ static void pci_msi_shutdown(struct pci_
 	if (!pci_msi_enable || !dev || !dev->msi_enabled)
 		return;
 
-	BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
-	desc = first_pci_msi_entry(dev);
-
 	pci_msi_set_enable(dev, 0);
 	pci_intx_for_msi(dev, 1);
 	dev->msi_enabled = 0;
 
 	/* Return the device with MSI unmasked as initial states */
-	pci_msi_unmask(desc, msi_multi_mask(desc));
+	desc = msi_first_desc(&dev->dev);
+	if (!WARN_ON_ONCE(!desc))
+		pci_msi_unmask(desc, msi_multi_mask(desc));
 
 	/* Restore dev->irq to its default pin-assertion IRQ */
 	dev->irq = desc->pci.msi_attrib.default_irq;
@@ -789,7 +787,7 @@ static int __pci_enable_msix(struct pci_
 
 static void pci_msix_shutdown(struct pci_dev *dev)
 {
-	struct msi_desc *entry;
+	struct msi_desc *desc;
 
 	if (!pci_msi_enable || !dev || !dev->msix_enabled)
 		return;
@@ -800,8 +798,8 @@ static void pci_msix_shutdown(struct pci
 	}
 
 	/* Return the device with MSI-X masked as initial states */
-	for_each_pci_msi_entry(entry, dev)
-		pci_msix_mask(entry);
+	msi_for_each_desc(desc, &dev->dev, MSI_DESC_ALL)
+		pci_msix_mask(desc);
 
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0);
 	pci_intx_for_msi(dev, 1);


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 12/32] x86/pci/xen: Use msi_for_each_desc()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (10 preceding siblings ...)
  2021-11-27  1:23 ` [patch 11/32] PCI/MSI: Use msi_on_each_desc() Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 13/32] xen/pcifront: Rework MSI handling Thomas Gleixner
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/pci/xen.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -184,7 +184,7 @@ static int xen_setup_msi_irqs(struct pci
 	if (ret)
 		goto error;
 	i = 0;
-	for_each_pci_msi_entry(msidesc, dev) {
+	msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
 		irq = xen_bind_pirq_msi_to_irq(dev, msidesc, v[i],
 					       (type == PCI_CAP_ID_MSI) ? nvec : 1,
 					       (type == PCI_CAP_ID_MSIX) ?
@@ -235,7 +235,7 @@ static int xen_hvm_setup_msi_irqs(struct
 	if (type == PCI_CAP_ID_MSI && nvec > 1)
 		return 1;
 
-	for_each_pci_msi_entry(msidesc, dev) {
+	msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
 		pirq = xen_allocate_pirq_msi(dev, msidesc);
 		if (pirq < 0) {
 			irq = -ENODEV;
@@ -270,7 +270,7 @@ static int xen_initdom_setup_msi_irqs(st
 	int ret = 0;
 	struct msi_desc *msidesc;
 
-	for_each_pci_msi_entry(msidesc, dev) {
+	msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_NOTASSOCIATED) {
 		struct physdev_map_pirq map_irq;
 		domid_t domid;
 
@@ -389,11 +389,9 @@ static void xen_teardown_msi_irqs(struct
 	struct msi_desc *msidesc;
 	int i;
 
-	for_each_pci_msi_entry(msidesc, dev) {
-		if (msidesc->irq) {
-			for (i = 0; i < msidesc->nvec_used; i++)
-				xen_destroy_irq(msidesc->irq + i);
-		}
+	msi_for_each_desc(msidesc, &dev->dev, MSI_DESC_ASSOCIATED) {
+		for (i = 0; i < msidesc->nvec_used; i++)
+			xen_destroy_irq(msidesc->irq + i);
 	}
 }
 


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 13/32] xen/pcifront: Rework MSI handling
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (11 preceding siblings ...)
  2021-11-27  1:23 ` [patch 12/32] x86/pci/xen: Use msi_for_each_desc() Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 14/32] s390/pci: Rework MSI descriptor walk Thomas Gleixner
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/xen-pcifront.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -262,7 +262,7 @@ static int pci_frontend_enable_msix(stru
 	}
 
 	i = 0;
-	for_each_pci_msi_entry(entry, dev) {
+	msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
 		op.msix_entries[i].entry = entry->msi_index;
 		/* Vector is useless at this point. */
 		op.msix_entries[i].vector = -1;


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 14/32] s390/pci: Rework MSI descriptor walk
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (12 preceding siblings ...)
  2021-11-27  1:23 ` [patch 13/32] xen/pcifront: Rework MSI handling Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-29 10:31   ` Niklas Schnelle
  2021-11-27  1:23 ` [patch 15/32] powerpc/4xx/hsta: Rework MSI handling Thomas Gleixner
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-s390@vger.kernel.org
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
---
 arch/s390/pci/pci_irq.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/arch/s390/pci/pci_irq.c
+++ b/arch/s390/pci/pci_irq.c
@@ -303,7 +303,7 @@ int arch_setup_msi_irqs(struct pci_dev *
 
 	/* Request MSI interrupts */
 	hwirq = bit;
-	for_each_pci_msi_entry(msi, pdev) {
+	msi_for_each_desc(msi, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
 		rc = -EIO;
 		if (hwirq - bit >= msi_vecs)
 			break;
@@ -362,9 +362,7 @@ void arch_teardown_msi_irqs(struct pci_d
 		return;
 
 	/* Release MSI interrupts */
-	for_each_pci_msi_entry(msi, pdev) {
-		if (!msi->irq)
-			continue;
+	msi_for_each_desc(msi, &pdev->dev, MSI_DESC_ASSOCIATED) {
 		irq_set_msi_desc(msi->irq, NULL);
 		irq_free_desc(msi->irq);
 		msi->msg.address_lo = 0;


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 15/32] powerpc/4xx/hsta: Rework MSI handling
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (13 preceding siblings ...)
  2021-11-27  1:23 ` [patch 14/32] s390/pci: Rework MSI descriptor walk Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 16/32] powerpc/cell/axon_msi: Convert to msi_on_each_desc() Thomas Gleixner
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/platforms/4xx/hsta_msi.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/powerpc/platforms/4xx/hsta_msi.c
+++ b/arch/powerpc/platforms/4xx/hsta_msi.c
@@ -47,7 +47,7 @@ static int hsta_setup_msi_irqs(struct pc
 		return -EINVAL;
 	}
 
-	for_each_pci_msi_entry(entry, dev) {
+	msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
 		irq = msi_bitmap_alloc_hwirqs(&ppc4xx_hsta_msi.bmp, 1);
 		if (irq < 0) {
 			pr_debug("%s: Failed to allocate msi interrupt\n",
@@ -105,10 +105,7 @@ static void hsta_teardown_msi_irqs(struc
 	struct msi_desc *entry;
 	int irq;
 
-	for_each_pci_msi_entry(entry, dev) {
-		if (!entry->irq)
-			continue;
-
+	msi_for_each_desc(entry, &dev->dev, MSI_DESC_ASSOCIATED) {
 		irq = hsta_find_hwirq_offset(entry->irq);
 
 		/* entry->irq should always be in irq_map */


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 16/32] powerpc/cell/axon_msi: Convert to msi_on_each_desc()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (14 preceding siblings ...)
  2021-11-27  1:23 ` [patch 15/32] powerpc/4xx/hsta: Rework MSI handling Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 17/32] powerpc/pasemi/msi: Convert to msi_on_each_dec() Thomas Gleixner
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/platforms/cell/axon_msi.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/arch/powerpc/platforms/cell/axon_msi.c
+++ b/arch/powerpc/platforms/cell/axon_msi.c
@@ -265,7 +265,7 @@ static int axon_msi_setup_msi_irqs(struc
 	if (rc)
 		return rc;
 
-	for_each_pci_msi_entry(entry, dev) {
+	msi_for_each_desc(entry, &dev->dev, MSI_DESC_NOTASSOCIATED) {
 		virq = irq_create_direct_mapping(msic->irq_domain);
 		if (!virq) {
 			dev_warn(&dev->dev,
@@ -288,10 +288,7 @@ static void axon_msi_teardown_msi_irqs(s
 
 	dev_dbg(&dev->dev, "axon_msi: tearing down msi irqs\n");
 
-	for_each_pci_msi_entry(entry, dev) {
-		if (!entry->irq)
-			continue;
-
+	msi_for_each_desc(entry, &dev->dev, MSI_DESC_ASSOCIATED) {
 		irq_set_msi_desc(entry->irq, NULL);
 		irq_dispose_mapping(entry->irq);
 	}


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 17/32] powerpc/pasemi/msi: Convert to msi_on_each_dec()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (15 preceding siblings ...)
  2021-11-27  1:23 ` [patch 16/32] powerpc/cell/axon_msi: Convert to msi_on_each_desc() Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 18/32] powerpc/fsl_msi: Use msi_for_each_desc() Thomas Gleixner
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/platforms/pasemi/msi.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

--- a/arch/powerpc/platforms/pasemi/msi.c
+++ b/arch/powerpc/platforms/pasemi/msi.c
@@ -62,17 +62,12 @@ static void pasemi_msi_teardown_msi_irqs
 
 	pr_debug("pasemi_msi_teardown_msi_irqs, pdev %p\n", pdev);
 
-	for_each_pci_msi_entry(entry, pdev) {
-		if (!entry->irq)
-			continue;
-
+	msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
 		hwirq = virq_to_hw(entry->irq);
 		irq_set_msi_desc(entry->irq, NULL);
 		irq_dispose_mapping(entry->irq);
 		msi_bitmap_free_hwirqs(&msi_mpic->msi_bitmap, hwirq, ALLOC_CHUNK);
 	}
-
-	return;
 }
 
 static int pasemi_msi_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
@@ -90,7 +85,7 @@ static int pasemi_msi_setup_msi_irqs(str
 	msg.address_hi = 0;
 	msg.address_lo = PASEMI_MSI_ADDR;
 
-	for_each_pci_msi_entry(entry, pdev) {
+	msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
 		/* Allocate 16 interrupts for now, since that's the grouping for
 		 * affinity. This can be changed later if it turns out 32 is too
 		 * few MSIs for someone, but restrictions will apply to how the


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 18/32] powerpc/fsl_msi: Use msi_for_each_desc()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (16 preceding siblings ...)
  2021-11-27  1:23 ` [patch 17/32] powerpc/pasemi/msi: Convert to msi_on_each_dec() Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:23 ` [patch 19/32] powerpc/mpic_u3msi: Use msi_for_each-desc() Thomas Gleixner
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/sysdev/fsl_msi.c |    8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

--- a/arch/powerpc/sysdev/fsl_msi.c
+++ b/arch/powerpc/sysdev/fsl_msi.c
@@ -125,17 +125,13 @@ static void fsl_teardown_msi_irqs(struct
 	struct fsl_msi *msi_data;
 	irq_hw_number_t hwirq;
 
-	for_each_pci_msi_entry(entry, pdev) {
-		if (!entry->irq)
-			continue;
+	msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
 		hwirq = virq_to_hw(entry->irq);
 		msi_data = irq_get_chip_data(entry->irq);
 		irq_set_msi_desc(entry->irq, NULL);
 		irq_dispose_mapping(entry->irq);
 		msi_bitmap_free_hwirqs(&msi_data->bitmap, hwirq, 1);
 	}
-
-	return;
 }
 
 static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
@@ -215,7 +211,7 @@ static int fsl_setup_msi_irqs(struct pci
 		}
 	}
 
-	for_each_pci_msi_entry(entry, pdev) {
+	msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
 		/*
 		 * Loop over all the MSI devices until we find one that has an
 		 * available interrupt.


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 19/32] powerpc/mpic_u3msi: Use msi_for_each-desc()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (17 preceding siblings ...)
  2021-11-27  1:23 ` [patch 18/32] powerpc/fsl_msi: Use msi_for_each_desc() Thomas Gleixner
@ 2021-11-27  1:23 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 20/32] PCI: hv: Rework MSI handling Thomas Gleixner
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:23 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators and make use of the filtering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/powerpc/sysdev/mpic_u3msi.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

--- a/arch/powerpc/sysdev/mpic_u3msi.c
+++ b/arch/powerpc/sysdev/mpic_u3msi.c
@@ -104,17 +104,12 @@ static void u3msi_teardown_msi_irqs(stru
 	struct msi_desc *entry;
 	irq_hw_number_t hwirq;
 
-	for_each_pci_msi_entry(entry, pdev) {
-		if (!entry->irq)
-			continue;
-
+	msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
 		hwirq = virq_to_hw(entry->irq);
 		irq_set_msi_desc(entry->irq, NULL);
 		irq_dispose_mapping(entry->irq);
 		msi_bitmap_free_hwirqs(&msi_mpic->msi_bitmap, hwirq, 1);
 	}
-
-	return;
 }
 
 static int u3msi_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
@@ -136,7 +131,7 @@ static int u3msi_setup_msi_irqs(struct p
 		return -ENXIO;
 	}
 
-	for_each_pci_msi_entry(entry, pdev) {
+	msi_for_each_desc(entry, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
 		hwirq = msi_bitmap_alloc_hwirqs(&msi_mpic->msi_bitmap, 1);
 		if (hwirq < 0) {
 			pr_debug("u3msi: failed allocating hwirq\n");


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 20/32] PCI: hv: Rework MSI handling
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (18 preceding siblings ...)
  2021-11-27  1:23 ` [patch 19/32] powerpc/mpic_u3msi: Use msi_for_each-desc() Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 21/32] NTB/msi: Convert to msi_on_each_desc() Thomas Gleixner
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Replace the about to vanish iterators and make use of the filtering. Take
the descriptor lock around the iterators.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/controller/pci-hyperv.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3445,18 +3445,23 @@ static int hv_pci_suspend(struct hv_devi
 
 static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
 {
-	struct msi_desc *entry;
 	struct irq_data *irq_data;
+	struct msi_desc *entry;
+	int ret = 0;
 
-	for_each_pci_msi_entry(entry, pdev) {
+	msi_lock_descs(&pdev->dev);
+	msi_for_each_desc(entry, &pdev->dev, MSI_DESC_ASSOCIATED) {
 		irq_data = irq_get_irq_data(entry->irq);
-		if (WARN_ON_ONCE(!irq_data))
-			return -EINVAL;
+		if (WARN_ON_ONCE(!irq_data)) {
+			ret = -EINVAL;
+			break;
+		}
 
 		hv_compose_msi_msg(irq_data, &entry->msg);
 	}
+	msi_unlock_descs(&pdev->dev);
 
-	return 0;
+	return ret;
 }
 
 /*


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (19 preceding siblings ...)
  2021-11-27  1:24 ` [patch 20/32] PCI: hv: Rework MSI handling Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-29 18:21   ` Logan Gunthorpe
  2021-11-27  1:24 ` [patch 22/32] soc: ti: ti_sci_inta_msi: Rework MSI descriptor allocation Thomas Gleixner
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger

Replace the about to vanish iterators, make use of the filtering and take
the descriptor lock around the iteration.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
Cc: linux-ntb@googlegroups.com
---
 drivers/ntb/msi.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

--- a/drivers/ntb/msi.c
+++ b/drivers/ntb/msi.c
@@ -108,8 +108,10 @@ int ntb_msi_setup_mws(struct ntb_dev *nt
 	if (!ntb->msi)
 		return -EINVAL;
 
-	desc = first_msi_entry(&ntb->pdev->dev);
+	msi_lock_descs(&ntb->pdev->dev);
+	desc = msi_first_desc(&ntb->pdev->dev);
 	addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+	msi_unlock_descs(&ntb->pdev->dev);
 
 	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
 		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
@@ -281,13 +283,15 @@ int ntbm_msi_request_threaded_irq(struct
 				  const char *name, void *dev_id,
 				  struct ntb_msi_desc *msi_desc)
 {
+	struct device *dev = &ntb->pdev->dev;
 	struct msi_desc *entry;
 	int ret;
 
 	if (!ntb->msi)
 		return -EINVAL;
 
-	for_each_pci_msi_entry(entry, ntb->pdev) {
+	msi_lock_descs(dev);
+	msi_for_each_desc(entry, dev, MSI_DESC_ASSOCIATED) {
 		if (irq_has_action(entry->irq))
 			continue;
 
@@ -304,14 +308,17 @@ int ntbm_msi_request_threaded_irq(struct
 		ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
 		if (ret) {
 			devm_free_irq(&ntb->dev, entry->irq, dev_id);
-			return ret;
+			goto unlock;
 		}
 
-
-		return entry->irq;
+		ret = entry->irq;
+		goto unlock;
 	}
+	ret = -ENODEV;
 
-	return -ENODEV;
+unlock:
+	msi_unlock_descs(dev);
+	return ret;
 }
 EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);
 


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 22/32] soc: ti: ti_sci_inta_msi: Rework MSI descriptor allocation
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (20 preceding siblings ...)
  2021-11-27  1:24 ` [patch 21/32] NTB/msi: Convert to msi_on_each_desc() Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 23/32] soc: ti: ti_sci_inta_msi: Remove ti_sci_inta_msi_domain_free_irqs() Thomas Gleixner
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Protect the allocation properly and use the core allocation and free
mechanism.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/soc/ti/ti_sci_inta_msi.c |   71 +++++++++++++--------------------------
 1 file changed, 25 insertions(+), 46 deletions(-)

--- a/drivers/soc/ti/ti_sci_inta_msi.c
+++ b/drivers/soc/ti/ti_sci_inta_msi.c
@@ -51,6 +51,7 @@ struct irq_domain *ti_sci_inta_msi_creat
 	struct irq_domain *domain;
 
 	ti_sci_inta_msi_update_chip_ops(info);
+	info->flags |= MSI_FLAG_FREE_MSI_DESCS;
 
 	domain = msi_create_irq_domain(fwnode, info, parent);
 	if (domain)
@@ -60,50 +61,31 @@ struct irq_domain *ti_sci_inta_msi_creat
 }
 EXPORT_SYMBOL_GPL(ti_sci_inta_msi_create_irq_domain);
 
-static void ti_sci_inta_msi_free_descs(struct device *dev)
-{
-	struct msi_desc *desc, *tmp;
-
-	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
-		list_del(&desc->list);
-		free_msi_entry(desc);
-	}
-}
-
 static int ti_sci_inta_msi_alloc_descs(struct device *dev,
 				       struct ti_sci_resource *res)
 {
-	struct msi_desc *msi_desc;
+	struct msi_desc msi_desc;
 	int set, i, count = 0;
 
+	memset(&msi_desc, 0, sizeof(msi_desc));
+
 	for (set = 0; set < res->sets; set++) {
-		for (i = 0; i < res->desc[set].num; i++) {
-			msi_desc = alloc_msi_entry(dev, 1, NULL);
-			if (!msi_desc) {
-				ti_sci_inta_msi_free_descs(dev);
-				return -ENOMEM;
-			}
-
-			msi_desc->msi_index = res->desc[set].start + i;
-			INIT_LIST_HEAD(&msi_desc->list);
-			list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
-			count++;
+		for (i = 0; i < res->desc[set].num; i++, count++) {
+			msi_desc.msi_index = res->desc[set].start + i;
+			if (msi_add_msi_desc(dev, &msi_desc))
+				goto fail;
 		}
-		for (i = 0; i < res->desc[set].num_sec; i++) {
-			msi_desc = alloc_msi_entry(dev, 1, NULL);
-			if (!msi_desc) {
-				ti_sci_inta_msi_free_descs(dev);
-				return -ENOMEM;
-			}
-
-			msi_desc->msi_index = res->desc[set].start_sec + i;
-			INIT_LIST_HEAD(&msi_desc->list);
-			list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
-			count++;
+
+		for (i = 0; i < res->desc[set].num_sec; i++, count++) {
+			msi_desc.msi_index = res->desc[set].start_sec + i;
+			if (msi_add_msi_desc(dev, &msi_desc))
+				goto fail;
 		}
 	}
-
 	return count;
+fail:
+	msi_free_msi_descs(dev);
+	return -ENOMEM;
 }
 
 int ti_sci_inta_msi_domain_alloc_irqs(struct device *dev,
@@ -124,20 +106,18 @@ int ti_sci_inta_msi_domain_alloc_irqs(st
 	if (ret)
 		return ret;
 
+	msi_lock_descs(dev);
 	nvec = ti_sci_inta_msi_alloc_descs(dev, res);
-	if (nvec <= 0)
-		return nvec;
-
-	ret = msi_domain_alloc_irqs(msi_domain, dev, nvec);
-	if (ret) {
-		dev_err(dev, "Failed to allocate IRQs %d\n", ret);
-		goto cleanup;
+	if (nvec <= 0) {
+		ret = nvec;
+		goto unlock;
 	}
 
-	return 0;
-
-cleanup:
-	ti_sci_inta_msi_free_descs(&pdev->dev);
+	ret = msi_domain_alloc_irqs_descs_locked(msi_domain, dev, nvec);
+	if (ret)
+		dev_err(dev, "Failed to allocate IRQs %d\n", ret);
+unlock:
+	msi_unlock_descs(dev);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_alloc_irqs);
@@ -145,6 +125,5 @@ EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain
 void ti_sci_inta_msi_domain_free_irqs(struct device *dev)
 {
 	msi_domain_free_irqs(dev->msi.domain, dev);
-	ti_sci_inta_msi_free_descs(dev);
 }
 EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_free_irqs);


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 23/32] soc: ti: ti_sci_inta_msi: Remove ti_sci_inta_msi_domain_free_irqs()
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (21 preceding siblings ...)
  2021-11-27  1:24 ` [patch 22/32] soc: ti: ti_sci_inta_msi: Rework MSI descriptor allocation Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 24/32] bus: fsl-mc-msi: Simplify MSI descriptor handling Thomas Gleixner
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

The function has no users and is pointless now that the core frees the MSI
descriptors, which means potential users can just use msi_domain_free_irqs().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/soc/ti/ti_sci_inta_msi.c       |    6 ------
 include/linux/soc/ti/ti_sci_inta_msi.h |    1 -
 2 files changed, 7 deletions(-)

--- a/drivers/soc/ti/ti_sci_inta_msi.c
+++ b/drivers/soc/ti/ti_sci_inta_msi.c
@@ -121,9 +121,3 @@ int ti_sci_inta_msi_domain_alloc_irqs(st
 	return ret;
 }
 EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_alloc_irqs);
-
-void ti_sci_inta_msi_domain_free_irqs(struct device *dev)
-{
-	msi_domain_free_irqs(dev->msi.domain, dev);
-}
-EXPORT_SYMBOL_GPL(ti_sci_inta_msi_domain_free_irqs);
--- a/include/linux/soc/ti/ti_sci_inta_msi.h
+++ b/include/linux/soc/ti/ti_sci_inta_msi.h
@@ -18,5 +18,4 @@ struct irq_domain
 				   struct irq_domain *parent);
 int ti_sci_inta_msi_domain_alloc_irqs(struct device *dev,
 				      struct ti_sci_resource *res);
-void ti_sci_inta_msi_domain_free_irqs(struct device *dev);
 #endif /* __INCLUDE_LINUX_IRQCHIP_TI_SCI_INTA_H */


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 24/32] bus: fsl-mc-msi: Simplify MSI descriptor handling
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (22 preceding siblings ...)
  2021-11-27  1:24 ` [patch 23/32] soc: ti: ti_sci_inta_msi: Remove ti_sci_inta_msi_domain_free_irqs() Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 25/32] platform-msi: Let core code handle MSI descriptors Thomas Gleixner
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Let the MSI irq domain code handle descriptor allocation and free.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/bus/fsl-mc/fsl-mc-msi.c |   61 ++--------------------------------------
 1 file changed, 4 insertions(+), 57 deletions(-)

--- a/drivers/bus/fsl-mc/fsl-mc-msi.c
+++ b/drivers/bus/fsl-mc/fsl-mc-msi.c
@@ -170,6 +170,7 @@ struct irq_domain *fsl_mc_msi_create_irq
 		fsl_mc_msi_update_dom_ops(info);
 	if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
 		fsl_mc_msi_update_chip_ops(info);
+	info->flags |= MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS | MSI_FLAG_FREE_MSI_DESCS;
 
 	domain = msi_create_irq_domain(fwnode, info, parent);
 	if (domain)
@@ -210,45 +211,7 @@ struct irq_domain *fsl_mc_find_msi_domai
 	return msi_domain;
 }
 
-static void fsl_mc_msi_free_descs(struct device *dev)
-{
-	struct msi_desc *desc, *tmp;
-
-	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
-		list_del(&desc->list);
-		free_msi_entry(desc);
-	}
-}
-
-static int fsl_mc_msi_alloc_descs(struct device *dev, unsigned int irq_count)
-
-{
-	unsigned int i;
-	int error;
-	struct msi_desc *msi_desc;
-
-	for (i = 0; i < irq_count; i++) {
-		msi_desc = alloc_msi_entry(dev, 1, NULL);
-		if (!msi_desc) {
-			dev_err(dev, "Failed to allocate msi entry\n");
-			error = -ENOMEM;
-			goto cleanup_msi_descs;
-		}
-
-		msi_desc->msi_index = i;
-		INIT_LIST_HEAD(&msi_desc->list);
-		list_add_tail(&msi_desc->list, dev_to_msi_list(dev));
-	}
-
-	return 0;
-
-cleanup_msi_descs:
-	fsl_mc_msi_free_descs(dev);
-	return error;
-}
-
-int fsl_mc_msi_domain_alloc_irqs(struct device *dev,
-				 unsigned int irq_count)
+int fsl_mc_msi_domain_alloc_irqs(struct device *dev,  unsigned int irq_count)
 {
 	struct irq_domain *msi_domain;
 	int error;
@@ -261,28 +224,17 @@ int fsl_mc_msi_domain_alloc_irqs(struct
 	if (error)
 		return error;
 
-	if (!list_empty(dev_to_msi_list(dev)))
+	if (msi_device_num_descs(dev))
 		return -EINVAL;
 
-	error = fsl_mc_msi_alloc_descs(dev, irq_count);
-	if (error < 0)
-		return error;
-
 	/*
 	 * NOTE: Calling this function will trigger the invocation of the
 	 * its_fsl_mc_msi_prepare() callback
 	 */
 	error = msi_domain_alloc_irqs(msi_domain, dev, irq_count);
 
-	if (error) {
+	if (error)
 		dev_err(dev, "Failed to allocate IRQs\n");
-		goto cleanup_msi_descs;
-	}
-
-	return 0;
-
-cleanup_msi_descs:
-	fsl_mc_msi_free_descs(dev);
 	return error;
 }
 
@@ -295,9 +247,4 @@ void fsl_mc_msi_domain_free_irqs(struct
 		return;
 
 	msi_domain_free_irqs(msi_domain, dev);
-
-	if (list_empty(dev_to_msi_list(dev)))
-		return;
-
-	fsl_mc_msi_free_descs(dev);
 }


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 25/32] platform-msi: Let core code handle MSI descriptors
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (23 preceding siblings ...)
  2021-11-27  1:24 ` [patch 24/32] bus: fsl-mc-msi: Simplify MSI descriptor handling Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 26/32] platform-msi: Simplify platform device MSI code Thomas Gleixner
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Use the core functionality for platform MSI interrupt domains. The platform
device MSI interrupt domains will be converted in a later step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/base/platform-msi.c |  112 ++++++++++++++++++--------------------------
 1 file changed, 48 insertions(+), 64 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -107,57 +107,6 @@ static void platform_msi_update_chip_ops
 		info->flags &= ~MSI_FLAG_LEVEL_CAPABLE;
 }
 
-static void platform_msi_free_descs(struct device *dev, int base, int nvec)
-{
-	struct msi_desc *desc, *tmp;
-
-	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
-		if (desc->msi_index >= base &&
-		    desc->msi_index < (base + nvec)) {
-			list_del(&desc->list);
-			free_msi_entry(desc);
-		}
-	}
-}
-
-static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
-					     int nvec)
-{
-	struct msi_desc *desc;
-	int i, base = 0;
-
-	if (!list_empty(dev_to_msi_list(dev))) {
-		desc = list_last_entry(dev_to_msi_list(dev),
-				       struct msi_desc, list);
-		base = desc->msi_index + 1;
-	}
-
-	for (i = 0; i < nvec; i++) {
-		desc = alloc_msi_entry(dev, 1, NULL);
-		if (!desc)
-			break;
-
-		desc->msi_index = base + i;
-		desc->irq = virq ? virq + i : 0;
-
-		list_add_tail(&desc->list, dev_to_msi_list(dev));
-	}
-
-	if (i != nvec) {
-		/* Clean up the mess */
-		platform_msi_free_descs(dev, base, nvec);
-
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-static int platform_msi_alloc_descs(struct device *dev, int nvec)
-{
-	return platform_msi_alloc_descs_with_irq(dev, 0, nvec);
-}
-
 /**
  * platform_msi_create_irq_domain - Create a platform MSI interrupt domain
  * @fwnode:		Optional fwnode of the interrupt controller
@@ -180,7 +129,8 @@ struct irq_domain *platform_msi_create_i
 		platform_msi_update_dom_ops(info);
 	if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
 		platform_msi_update_chip_ops(info);
-	info->flags |= MSI_FLAG_DEV_SYSFS;
+	info->flags |= MSI_FLAG_DEV_SYSFS | MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS |
+		       MSI_FLAG_FREE_MSI_DESCS;
 
 	domain = msi_create_irq_domain(fwnode, info, parent);
 	if (domain)
@@ -262,20 +212,10 @@ int platform_msi_domain_alloc_irqs(struc
 	if (err)
 		return err;
 
-	err = platform_msi_alloc_descs(dev, nvec);
-	if (err)
-		goto out_free_priv_data;
-
 	err = msi_domain_alloc_irqs(dev->msi.domain, dev, nvec);
 	if (err)
-		goto out_free_desc;
-
-	return 0;
+		platform_msi_free_priv_data(dev);
 
-out_free_desc:
-	platform_msi_free_descs(dev, 0, nvec);
-out_free_priv_data:
-	platform_msi_free_priv_data(dev);
 	return err;
 }
 EXPORT_SYMBOL_GPL(platform_msi_domain_alloc_irqs);
@@ -287,7 +227,6 @@ EXPORT_SYMBOL_GPL(platform_msi_domain_al
 void platform_msi_domain_free_irqs(struct device *dev)
 {
 	msi_domain_free_irqs(dev->msi.domain, dev);
-	platform_msi_free_descs(dev, 0, MAX_DEV_MSIS);
 	platform_msi_free_priv_data(dev);
 }
 EXPORT_SYMBOL_GPL(platform_msi_domain_free_irqs);
@@ -361,6 +300,51 @@ struct irq_domain *
 	return NULL;
 }
 
+static void platform_msi_free_descs(struct device *dev, int base, int nvec)
+{
+	struct msi_desc *desc, *tmp;
+
+	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
+		if (desc->msi_index >= base &&
+		    desc->msi_index < (base + nvec)) {
+			list_del(&desc->list);
+			free_msi_entry(desc);
+		}
+	}
+}
+
+static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
+					     int nvec)
+{
+	struct msi_desc *desc;
+	int i, base = 0;
+
+	if (!list_empty(dev_to_msi_list(dev))) {
+		desc = list_last_entry(dev_to_msi_list(dev),
+				       struct msi_desc, list);
+		base = desc->msi_index + 1;
+	}
+
+	for (i = 0; i < nvec; i++) {
+		desc = alloc_msi_entry(dev, 1, NULL);
+		if (!desc)
+			break;
+
+		desc->msi_index = base + i;
+		desc->irq = virq + i;
+
+		list_add_tail(&desc->list, dev_to_msi_list(dev));
+	}
+
+	if (i != nvec) {
+		/* Clean up the mess */
+		platform_msi_free_descs(dev, base, nvec);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
 /**
  * platform_msi_device_domain_free - Free interrupts associated with a platform-msi
  *				     device domain


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 26/32] platform-msi: Simplify platform device MSI code
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (24 preceding siblings ...)
  2021-11-27  1:24 ` [patch 25/32] platform-msi: Let core code handle MSI descriptors Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 27/32] genirq/msi: Make interrupt allocation less convoluted Thomas Gleixner
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

The allocation code is overly complex. It tries to have the MSI index space
packed, which is not working when an interrupt is freed. There is no
requirement for this. The only requirement is that the MSI index is unique.

Move the MSI descriptor allocation into msi_domain_populate_irqs() and use
the Linux interrupt number as MSI index which fulfils the unique
requirement.

This requires to lock the MSI descriptors which makes the lock order
reverse to the regular MSI alloc/free functions vs. the domain
mutex. Assign a seperate lockdep class for these MSI device domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/base/platform-msi.c |   88 +++++++++-----------------------------------
 kernel/irq/msi.c            |   46 +++++++++++------------
 2 files changed, 40 insertions(+), 94 deletions(-)

--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -246,6 +246,8 @@ void *platform_msi_get_host_data(struct
 	return data->host_data;
 }
 
+static struct lock_class_key platform_device_msi_lock_class;
+
 /**
  * __platform_msi_create_device_domain - Create a platform-msi device domain
  *
@@ -278,6 +280,13 @@ struct irq_domain *
 	if (err)
 		return NULL;
 
+	/*
+	 * Use a separate lock class for the MSI descriptor mutex on
+	 * platform MSI device domains because the descriptor mutex nests
+	 * into the domain mutex. See alloc/free below.
+	 */
+	lockdep_set_class(&dev->msi.data->mutex, &platform_device_msi_lock_class);
+
 	data = dev->msi.data->platform_data;
 	data->host_data = host_data;
 	domain = irq_domain_create_hierarchy(dev->msi.domain, 0,
@@ -300,75 +309,23 @@ struct irq_domain *
 	return NULL;
 }
 
-static void platform_msi_free_descs(struct device *dev, int base, int nvec)
-{
-	struct msi_desc *desc, *tmp;
-
-	list_for_each_entry_safe(desc, tmp, dev_to_msi_list(dev), list) {
-		if (desc->msi_index >= base &&
-		    desc->msi_index < (base + nvec)) {
-			list_del(&desc->list);
-			free_msi_entry(desc);
-		}
-	}
-}
-
-static int platform_msi_alloc_descs_with_irq(struct device *dev, int virq,
-					     int nvec)
-{
-	struct msi_desc *desc;
-	int i, base = 0;
-
-	if (!list_empty(dev_to_msi_list(dev))) {
-		desc = list_last_entry(dev_to_msi_list(dev),
-				       struct msi_desc, list);
-		base = desc->msi_index + 1;
-	}
-
-	for (i = 0; i < nvec; i++) {
-		desc = alloc_msi_entry(dev, 1, NULL);
-		if (!desc)
-			break;
-
-		desc->msi_index = base + i;
-		desc->irq = virq + i;
-
-		list_add_tail(&desc->list, dev_to_msi_list(dev));
-	}
-
-	if (i != nvec) {
-		/* Clean up the mess */
-		platform_msi_free_descs(dev, base, nvec);
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
 /**
  * platform_msi_device_domain_free - Free interrupts associated with a platform-msi
  *				     device domain
  *
  * @domain:	The platform-msi device domain
  * @virq:	The base irq from which to perform the free operation
- * @nvec:	How many interrupts to free from @virq
+ * @nr_irqs:	How many interrupts to free from @virq
  */
 void platform_msi_device_domain_free(struct irq_domain *domain, unsigned int virq,
-				     unsigned int nvec)
+				     unsigned int nr_irqs)
 {
 	struct platform_msi_priv_data *data = domain->host_data;
-	struct msi_desc *desc, *tmp;
 
-	for_each_msi_entry_safe(desc, tmp, data->dev) {
-		if (WARN_ON(!desc->irq || desc->nvec_used != 1))
-			return;
-		if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
-			continue;
-
-		irq_domain_free_irqs_common(domain, desc->irq, 1);
-		list_del(&desc->list);
-		free_msi_entry(desc);
-	}
+	msi_lock_descs(data->dev);
+	irq_domain_free_irqs_common(domain, virq, nr_irqs);
+	msi_free_msi_descs_range(data->dev, MSI_DESC_ALL, virq, nr_irqs);
+	msi_unlock_descs(data->dev);
 }
 
 /**
@@ -377,7 +334,7 @@ void platform_msi_device_domain_free(str
  *
  * @domain:	The platform-msi device domain
  * @virq:	The base irq from which to perform the allocate operation
- * @nr_irqs:	How many interrupts to free from @virq
+ * @nr_irqs:	How many interrupts to allocate from @virq
  *
  * Return 0 on success, or an error code on failure. Must be called
  * with irq_domain_mutex held (which can only be done as part of a
@@ -387,16 +344,7 @@ int platform_msi_device_domain_alloc(str
 				     unsigned int nr_irqs)
 {
 	struct platform_msi_priv_data *data = domain->host_data;
-	int err;
-
-	err = platform_msi_alloc_descs_with_irq(data->dev, virq, nr_irqs);
-	if (err)
-		return err;
-
-	err = msi_domain_populate_irqs(domain->parent, data->dev,
-				       virq, nr_irqs, &data->arg);
-	if (err)
-		platform_msi_device_domain_free(domain, virq, nr_irqs);
+	struct device *dev = data->dev;
 
-	return err;
+	return msi_domain_populate_irqs(domain->parent, dev, virq, nr_irqs, &data->arg);
 }
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -775,43 +775,41 @@ int msi_domain_prepare_irqs(struct irq_d
 }
 
 int msi_domain_populate_irqs(struct irq_domain *domain, struct device *dev,
-			     int virq, int nvec, msi_alloc_info_t *arg)
+			     int virq_base, int nvec, msi_alloc_info_t *arg)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
 	struct msi_desc *desc;
-	int ret = 0;
+	int ret, virq;
 
-	for_each_msi_entry(desc, dev) {
-		/* Don't even try the multi-MSI brain damage. */
-		if (WARN_ON(!desc->irq || desc->nvec_used != 1)) {
-			ret = -EINVAL;
-			break;
+	msi_lock_descs(dev);
+	for (virq = virq_base; virq < virq_base + nvec; virq++) {
+		desc = alloc_msi_entry(dev, 1, NULL);
+		if (!desc) {
+			ret = -ENOMEM;
+			goto fail;
 		}
 
-		if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
-			continue;
+		desc->msi_index = virq;
+		desc->irq = virq;
+		list_add_tail(&desc->list, &dev->msi.data->list);
+		dev->msi.data->num_descs++;
 
 		ops->set_desc(arg, desc);
-		/* Assumes the domain mutex is held! */
-		ret = irq_domain_alloc_irqs_hierarchy(domain, desc->irq, 1,
-						      arg);
+		ret = irq_domain_alloc_irqs_hierarchy(domain, virq, 1, arg);
 		if (ret)
-			break;
+			goto fail;
 
-		irq_set_msi_desc_off(desc->irq, 0, desc);
-	}
-
-	if (ret) {
-		/* Mop up the damage */
-		for_each_msi_entry(desc, dev) {
-			if (!(desc->irq >= virq && desc->irq < (virq + nvec)))
-				continue;
-
-			irq_domain_free_irqs_common(domain, desc->irq, 1);
-		}
+		irq_set_msi_desc(virq, desc);
 	}
+	msi_unlock_descs(dev);
+	return 0;
 
+fail:
+	for (--virq; virq >= virq_base; virq--)
+		irq_domain_free_irqs_common(domain, virq, 1);
+	msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+	msi_unlock_descs(dev);
 	return ret;
 }
 


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 27/32] genirq/msi: Make interrupt allocation less convoluted
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (25 preceding siblings ...)
  2021-11-27  1:24 ` [patch 26/32] platform-msi: Simplify platform device MSI code Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 28/32] genirq/msi: Convert to new functions Thomas Gleixner
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

There is no real reason to do several loops over the MSI descriptors
instead of just doing one loop. In case of an error everything is undone
anyway so it does not matter whether it's a partial or a full rollback.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 .clang-format       |    1 
 include/linux/msi.h |    7 --
 kernel/irq/msi.c    |  129 +++++++++++++++++++++++++++-------------------------
 3 files changed, 70 insertions(+), 67 deletions(-)

--- a/.clang-format
+++ b/.clang-format
@@ -216,7 +216,6 @@ ExperimentalAutoDetectBinPacking: false
   - 'for_each_migratetype_order'
   - 'for_each_msi_entry'
   - 'for_each_msi_entry_safe'
-  - 'for_each_msi_vector'
   - 'for_each_net'
   - 'for_each_net_continue_reverse'
   - 'for_each_netdev'
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -263,12 +263,7 @@ static inline struct msi_desc *msi_first
 	list_for_each_entry((desc), dev_to_msi_list((dev)), list)
 #define for_each_msi_entry_safe(desc, tmp, dev)	\
 	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
-#define for_each_msi_vector(desc, __irq, dev)				\
-	for_each_msi_entry((desc), (dev))				\
-		if ((desc)->irq)					\
-			for (__irq = (desc)->irq;			\
-			     __irq < ((desc)->irq + (desc)->nvec_used);	\
-			     __irq++)
+
 #ifdef CONFIG_IRQ_MSI_IOMMU
 static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
 {
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -873,23 +873,74 @@ static int msi_handle_pci_fail(struct ir
 	return allocated ? allocated : -ENOSPC;
 }
 
+#define VIRQ_CAN_RESERVE	0x01
+#define VIRQ_ACTIVATE		0x02
+#define VIRQ_NOMASK_QUIRK	0x04
+
+static int msi_init_virq(struct irq_domain *domain, int virq, unsigned int vflags)
+{
+	struct irq_data *irqd = irq_domain_get_irq_data(domain, virq);
+	int ret;
+
+	if (!vflags & VIRQ_CAN_RESERVE) {
+		irqd_clr_can_reserve(irqd);
+		if (vflags & VIRQ_NOMASK_QUIRK)
+			irqd_set_msi_nomask_quirk(irqd);
+	}
+
+	if (!(vflags & VIRQ_ACTIVATE))
+		return 0;
+
+	ret = irq_domain_activate_irq(irqd, vflags & VIRQ_CAN_RESERVE);
+	if (ret)
+		return ret;
+	/*
+	 * If the interrupt uses reservation mode, clear the activated bit
+	 * so request_irq() will assign the final vector.
+	 */
+	if (vflags & VIRQ_CAN_RESERVE)
+		irqd_clr_activated(irqd);
+	return 0;
+}
+
 int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
 			    int nvec)
 {
 	struct msi_domain_info *info = domain->host_data;
 	struct msi_domain_ops *ops = info->ops;
-	struct irq_data *irq_data;
-	struct msi_desc *desc;
 	msi_alloc_info_t arg = { };
+	unsigned int vflags = 0;
+	struct msi_desc *desc;
 	int allocated = 0;
 	int i, ret, virq;
-	bool can_reserve;
 
 	ret = msi_domain_prepare_irqs(domain, dev, nvec, &arg);
 	if (ret)
 		return ret;
 
-	for_each_msi_entry(desc, dev) {
+	/*
+	 * This flag is set by the PCI layer as we need to activate
+	 * the MSI entries before the PCI layer enables MSI in the
+	 * card. Otherwise the card latches a random msi message.
+	 */
+	if (info->flags & MSI_FLAG_ACTIVATE_EARLY)
+		vflags |= VIRQ_ACTIVATE;
+
+	/*
+	 * Interrupt can use a reserved vector and will not occupy
+	 * a real device vector until the interrupt is requested.
+	 */
+	if (msi_check_reservation_mode(domain, info, dev)) {
+		vflags |= VIRQ_CAN_RESERVE;
+		/*
+		 * MSI affinity setting requires a special quirk (X86) when
+		 * reservation mode is active.
+		 */
+		if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK)
+			vflags |= VIRQ_NOMASK_QUIRK;
+	}
+
+	msi_for_each_desc(desc, dev, MSI_DESC_NOTASSOCIATED) {
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,
@@ -901,49 +952,12 @@ int __msi_domain_alloc_irqs(struct irq_d
 		for (i = 0; i < desc->nvec_used; i++) {
 			irq_set_msi_desc_off(virq, i, desc);
 			irq_debugfs_copy_devname(virq + i, dev);
+			ret = msi_init_virq(domain, virq + i, vflags);
+			if (ret)
+				return ret;
 		}
 		allocated++;
 	}
-
-	can_reserve = msi_check_reservation_mode(domain, info, dev);
-
-	/*
-	 * This flag is set by the PCI layer as we need to activate
-	 * the MSI entries before the PCI layer enables MSI in the
-	 * card. Otherwise the card latches a random msi message.
-	 */
-	if (!(info->flags & MSI_FLAG_ACTIVATE_EARLY))
-		goto skip_activate;
-
-	for_each_msi_vector(desc, i, dev) {
-		if (desc->irq == i) {
-			virq = desc->irq;
-			dev_dbg(dev, "irq [%d-%d] for MSI\n",
-				virq, virq + desc->nvec_used - 1);
-		}
-
-		irq_data = irq_domain_get_irq_data(domain, i);
-		if (!can_reserve) {
-			irqd_clr_can_reserve(irq_data);
-			if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK)
-				irqd_set_msi_nomask_quirk(irq_data);
-		}
-		ret = irq_domain_activate_irq(irq_data, can_reserve);
-		if (ret)
-			return ret;
-	}
-
-skip_activate:
-	/*
-	 * If these interrupts use reservation mode, clear the activated bit
-	 * so request_irq() will assign the final vector.
-	 */
-	if (can_reserve) {
-		for_each_msi_vector(desc, i, dev) {
-			irq_data = irq_domain_get_irq_data(domain, i);
-			irqd_clr_activated(irq_data);
-		}
-	}
 	return 0;
 }
 
@@ -1021,26 +1035,21 @@ int msi_domain_alloc_irqs(struct irq_dom
 
 void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 {
-	struct irq_data *irq_data;
+	struct irq_data *irqd;
 	struct msi_desc *desc;
 	int i;
 
-	for_each_msi_vector(desc, i, dev) {
-		irq_data = irq_domain_get_irq_data(domain, i);
-		if (irqd_is_activated(irq_data))
-			irq_domain_deactivate_irq(irq_data);
-	}
-
-	for_each_msi_entry(desc, dev) {
-		/*
-		 * We might have failed to allocate an MSI early
-		 * enough that there is no IRQ associated to this
-		 * entry. If that's the case, don't do anything.
-		 */
-		if (desc->irq) {
-			irq_domain_free_irqs(desc->irq, desc->nvec_used);
-			desc->irq = 0;
+	/* Only handle MSI entries which have an interrupt associated */
+	msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+		/* Make sure all interrupts are deactivated */
+		for (i = 0; i < desc->nvec_used; i++) {
+			irqd = irq_domain_get_irq_data(domain, desc->irq + i);
+			if (irqd && irqd_is_activated(irqd))
+				irq_domain_deactivate_irq(irqd);
 		}
+
+		irq_domain_free_irqs(desc->irq, desc->nvec_used);
+		desc->irq = 0;
 	}
 }
 


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 28/32] genirq/msi: Convert to new functions
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (26 preceding siblings ...)
  2021-11-27  1:24 ` [patch 27/32] genirq/msi: Make interrupt allocation less convoluted Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 29/32] genirq/msi: Mop up old interfaces Thomas Gleixner
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Use the new iterator functions and add locking where required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/irq/msi.c |   23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -354,6 +354,7 @@ struct msi_desc *msi_next_desc(struct de
 int __msi_get_virq(struct device *dev, unsigned int index)
 {
 	struct msi_desc *desc;
+	int ret = -ENOENT;
 	bool pcimsi;
 
 	if (!dev->msi.data)
@@ -361,11 +362,12 @@ int __msi_get_virq(struct device *dev, u
 
 	pcimsi = msi_device_has_property(dev, MSI_PROP_PCI_MSI);
 
-	for_each_msi_entry(desc, dev) {
+	msi_lock_descs(dev);
+	msi_for_each_desc_from(desc, dev, MSI_DESC_ASSOCIATED, index) {
 		/* PCI-MSI has only one descriptor for multiple interrupts. */
 		if (pcimsi) {
-			if (desc->irq && index < desc->nvec_used)
-				return desc->irq + index;
+			if (index < desc->nvec_used)
+				ret = desc->irq + index;
 			break;
 		}
 
@@ -373,10 +375,13 @@ int __msi_get_virq(struct device *dev, u
 		 * PCI-MSIX and platform MSI use a descriptor per
 		 * interrupt.
 		 */
-		if (desc->msi_index == index)
-			return desc->irq;
+		if (desc->msi_index == index) {
+			ret = desc->irq;
+			break;
+		}
 	}
-	return -ENOENT;
+	msi_unlock_descs(dev);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__msi_get_virq);
 
@@ -407,7 +412,7 @@ static const struct attribute_group **ms
 	int i;
 
 	/* Determine how many msi entries we have */
-	for_each_msi_entry(entry, dev)
+	msi_for_each_desc(entry, dev, MSI_DESC_ALL)
 		num_msi += entry->nvec_used;
 	if (!num_msi)
 		return NULL;
@@ -417,7 +422,7 @@ static const struct attribute_group **ms
 	if (!msi_attrs)
 		return ERR_PTR(-ENOMEM);
 
-	for_each_msi_entry(entry, dev) {
+	msi_for_each_desc(entry, dev, MSI_DESC_ALL) {
 		for (i = 0; i < entry->nvec_used; i++) {
 			msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
 			if (!msi_dev_attr)
@@ -838,7 +843,7 @@ static bool msi_check_reservation_mode(s
 	 * Checking the first MSI descriptor is sufficient. MSIX supports
 	 * masking and MSI does so when the can_mask attribute is set.
 	 */
-	desc = first_msi_entry(dev);
+	desc = msi_first_desc(dev);
 	return desc->pci.msi_attrib.is_msix || desc->pci.msi_attrib.can_mask;
 }
 


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 29/32] genirq/msi: Mop up old interfaces
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (27 preceding siblings ...)
  2021-11-27  1:24 ` [patch 28/32] genirq/msi: Convert to new functions Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 30/32] genirq/msi: Add abuse prevention comment to msi header Thomas Gleixner
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Get rid of the old iterators, alloc/free functions and adjust the core code
accordingly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |   15 ---------------
 kernel/irq/msi.c    |   31 +++++++++++++++----------------
 2 files changed, 15 insertions(+), 31 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -254,15 +254,7 @@ static inline struct msi_desc *msi_first
 #define msi_for_each_desc(desc, dev, filter)				\
 	msi_for_each_desc_from(desc, dev, filter, 0)
 
-/* Helpers to hide struct msi_desc implementation details */
 #define msi_desc_to_dev(desc)		((desc)->dev)
-#define dev_to_msi_list(dev)		(&(dev)->msi.data->list)
-#define first_msi_entry(dev)		\
-	list_first_entry(dev_to_msi_list((dev)), struct msi_desc, list)
-#define for_each_msi_entry(desc, dev)	\
-	list_for_each_entry((desc), dev_to_msi_list((dev)), list)
-#define for_each_msi_entry_safe(desc, tmp, dev)	\
-	list_for_each_entry_safe((desc), (tmp), dev_to_msi_list((dev)), list)
 
 #ifdef CONFIG_IRQ_MSI_IOMMU
 static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
@@ -288,10 +280,6 @@ static inline void msi_desc_set_iommu_co
 #endif
 
 #ifdef CONFIG_PCI_MSI
-#define first_pci_msi_entry(pdev)	first_msi_entry(&(pdev)->dev)
-#define for_each_pci_msi_entry(desc, pdev)	\
-	for_each_msi_entry((desc), &(pdev)->dev)
-
 struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc);
 void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg);
 #else /* CONFIG_PCI_MSI */
@@ -314,9 +302,6 @@ static inline void msi_free_msi_descs(st
 	msi_free_msi_descs_range(dev, MSI_DESC_ALL, 0, UINT_MAX);
 }
 
-struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
-				 const struct irq_affinity_desc *affinity);
-void free_msi_entry(struct msi_desc *entry);
 void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -19,8 +19,10 @@
 
 #include "internals.h"
 
+#define dev_to_msi_list(dev)	(&(dev)->msi.data->list)
+
 /**
- * alloc_msi_entry - Allocate an initialized msi_desc
+ * msi_alloc_desc - Allocate an initialized msi_desc
  * @dev:	Pointer to the device for which this is allocated
  * @nvec:	The number of vectors used in this entry
  * @affinity:	Optional pointer to an affinity mask array size of @nvec
@@ -30,12 +32,11 @@
  *
  * Return: pointer to allocated &msi_desc on success or %NULL on failure
  */
-struct msi_desc *alloc_msi_entry(struct device *dev, int nvec,
-				 const struct irq_affinity_desc *affinity)
+static struct msi_desc *msi_alloc_desc(struct device *dev, int nvec,
+					const struct irq_affinity_desc *affinity)
 {
-	struct msi_desc *desc;
+	struct msi_desc *desc = kzalloc(sizeof(*desc), GFP_KERNEL);
 
-	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
 	if (!desc)
 		return NULL;
 
@@ -43,21 +44,19 @@ struct msi_desc *alloc_msi_entry(struct
 	desc->dev = dev;
 	desc->nvec_used = nvec;
 	if (affinity) {
-		desc->affinity = kmemdup(affinity,
-			nvec * sizeof(*desc->affinity), GFP_KERNEL);
+		desc->affinity = kmemdup(affinity, nvec * sizeof(*desc->affinity), GFP_KERNEL);
 		if (!desc->affinity) {
 			kfree(desc);
 			return NULL;
 		}
 	}
-
 	return desc;
 }
 
-void free_msi_entry(struct msi_desc *entry)
+static void msi_free_desc(struct msi_desc *desc)
 {
-	kfree(entry->affinity);
-	kfree(entry);
+	kfree(desc->affinity);
+	kfree(desc);
 }
 
 /**
@@ -73,7 +72,7 @@ int msi_add_msi_desc(struct device *dev,
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	desc = alloc_msi_entry(dev, init_desc->nvec_used, init_desc->affinity);
+	desc = msi_alloc_desc(dev, init_desc->nvec_used, init_desc->affinity);
 	if (!desc)
 		return -ENOMEM;
 
@@ -103,7 +102,7 @@ int msi_add_simple_msi_descs(struct devi
 	lockdep_assert_held(&dev->msi.data->mutex);
 
 	for (i = 0; i < ndesc; i++) {
-		desc = alloc_msi_entry(dev, 1, NULL);
+		desc = msi_alloc_desc(dev, 1, NULL);
 		if (!desc)
 			goto fail;
 		desc->msi_index = index + i;
@@ -116,7 +115,7 @@ int msi_add_simple_msi_descs(struct devi
 fail:
 	list_for_each_entry_safe(desc, tmp, &list, list) {
 		list_del(&desc->list);
-		free_msi_entry(desc);
+		msi_free_desc(desc);
 	}
 	return -ENOMEM;
 }
@@ -143,7 +142,7 @@ void msi_free_msi_descs_range(struct dev
 		if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
 			continue;
 		list_del(&desc->list);
-		free_msi_entry(desc);
+		msi_free_desc(desc);
 		dev->msi.data->num_descs--;
 	}
 }
@@ -779,7 +778,7 @@ int msi_domain_populate_irqs(struct irq_
 
 	msi_lock_descs(dev);
 	for (virq = virq_base; virq < virq_base + nvec; virq++) {
-		desc = alloc_msi_entry(dev, 1, NULL);
+		desc = msi_alloc_desc(dev, 1, NULL);
 		if (!desc) {
 			ret = -ENOMEM;
 			goto fail;


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 30/32] genirq/msi: Add abuse prevention comment to msi header
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (28 preceding siblings ...)
  2021-11-27  1:24 ` [patch 29/32] genirq/msi: Mop up old interfaces Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27  1:24 ` [patch 31/32] genirq/msi: Simplify sysfs handling Thomas Gleixner
  2021-11-27  1:24 ` [patch 32/32] genirq/msi: Convert storage to xarray Thomas Gleixner
  31 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -2,6 +2,20 @@
 #ifndef LINUX_MSI_H
 #define LINUX_MSI_H
 
+/*
+ * This header file contains MSI data structures and functions which are
+ * only relevant for:
+ *	- Interrupt core code
+ *	- PCI/MSI core code
+ *	- MSI interrupt domain implementations
+ *	- IOMMU, low level VFIO, NTB and other justified exceptions
+ *	  dealing with low level MSI details.
+ *
+ * Regular device drivers have no business with any of these functions and
+ * especially storing MSI descriptor pointers in random code is considered
+ * abuse. The only function which is relevant for drivers is msi_get_virq().
+ */
+
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
 #include <linux/list.h>


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 31/32] genirq/msi: Simplify sysfs handling
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (29 preceding siblings ...)
  2021-11-27  1:24 ` [patch 30/32] genirq/msi: Add abuse prevention comment to msi header Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27 12:32   ` Greg Kroah-Hartman
  2021-11-27  1:24 ` [patch 32/32] genirq/msi: Convert storage to xarray Thomas Gleixner
  31 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

The sysfs handling for MSI is a convoluted maze and it is in the way of
supporting dynamic expansion of the MSI-X vectors because it only supports
a one off bulk population/free of the sysfs entries.

Change it to do:

   1) Creating an empty sysfs attribute group when msi_device_data is
      allocated

   2) Populate the entries when the MSI descriptor is initialized

   3) Free the entries when a MSI descriptor is detached from a Linux
      interrupt.

   4) Provide functions for the legacy non-irqdomain fallback code to
      do a bulk population/free. This code won't support dynamic
      expansion.

This makes the code simpler and reduces the number of allocations as the
empty attribute group can be shared.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    7 +
 kernel/irq/msi.c    |  196 +++++++++++++++++++++++-----------------------------
 2 files changed, 95 insertions(+), 108 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -72,6 +72,7 @@ struct irq_data;
 struct msi_desc;
 struct pci_dev;
 struct platform_msi_priv_data;
+struct device_attribute;
 
 void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 #ifdef CONFIG_GENERIC_MSI_IRQ
@@ -127,6 +128,7 @@ struct pci_msi_desc {
  * @dev:	Pointer to the device which uses this descriptor
  * @msg:	The last set MSI message cached for reuse
  * @affinity:	Optional pointer to a cpu affinity mask for this descriptor
+ * @sysfs_attr:	Pointer to sysfs device attribute
  *
  * @write_msi_msg:	Callback that may be called when the MSI message
  *			address or data changes
@@ -146,6 +148,9 @@ struct msi_desc {
 #ifdef CONFIG_IRQ_MSI_IOMMU
 	const void			*iommu_cookie;
 #endif
+#ifdef CONFIG_SYSFS
+	struct device_attribute		*sysfs_attrs;
+#endif
 
 	void (*write_msi_msg)(struct msi_desc *entry, void *data);
 	void *write_msi_msg_data;
@@ -171,7 +176,6 @@ enum msi_desc_filter {
  * @lock:		Spinlock to protect register access
  * @properties:		MSI properties which are interesting to drivers
  * @num_descs:		The number of allocated MSI descriptors for the device
- * @attrs:		Pointer to the sysfs attribute group
  * @platform_data:	Platform-MSI specific data
  * @list:		List of MSI descriptors associated to the device
  * @mutex:		Mutex protecting the MSI list
@@ -182,7 +186,6 @@ struct msi_device_data {
 	raw_spinlock_t			lock;
 	unsigned long			properties;
 	unsigned int			num_descs;
-	const struct attribute_group    **attrs;
 	struct platform_msi_priv_data	*platform_data;
 	struct list_head		list;
 	struct mutex			mutex;
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -19,6 +19,7 @@
 
 #include "internals.h"
 
+static inline int msi_sysfs_create_group(struct device *dev);
 #define dev_to_msi_list(dev)	(&(dev)->msi.data->list)
 
 /**
@@ -208,6 +209,7 @@ static void msi_device_data_release(stru
 int msi_setup_device_data(struct device *dev)
 {
 	struct msi_device_data *md;
+	int ret;
 
 	if (dev->msi.data)
 		return 0;
@@ -216,6 +218,12 @@ int msi_setup_device_data(struct device
 	if (!md)
 		return -ENOMEM;
 
+	ret = msi_sysfs_create_group(dev);
+	if (ret) {
+		devres_free(md);
+		return ret;
+	}
+
 	raw_spin_lock_init(&md->lock);
 	INIT_LIST_HEAD(&md->list);
 	mutex_init(&md->mutex);
@@ -395,6 +403,20 @@ int __msi_get_virq(struct device *dev, u
 EXPORT_SYMBOL_GPL(__msi_get_virq);
 
 #ifdef CONFIG_SYSFS
+static struct attribute *msi_dev_attrs[] = {
+	NULL
+};
+
+static const struct attribute_group msi_irqs_group = {
+	.name	= "msi_irqs",
+	.attrs	= msi_dev_attrs,
+};
+
+static inline int msi_sysfs_create_group(struct device *dev)
+{
+	return devm_device_add_group(dev, &msi_irqs_group);
+}
+
 static ssize_t msi_mode_show(struct device *dev, struct device_attribute *attr,
 			     char *buf)
 {
@@ -404,97 +426,74 @@ static ssize_t msi_mode_show(struct devi
 	return sysfs_emit(buf, "%s\n", is_msix ? "msix" : "msi");
 }
 
-/**
- * msi_populate_sysfs - Populate msi_irqs sysfs entries for devices
- * @dev:	The device(PCI, platform etc) who will get sysfs entries
- */
-static const struct attribute_group **msi_populate_sysfs(struct device *dev)
+static void msi_sysfs_remove_desc(struct device *dev, struct msi_desc *desc)
 {
-	const struct attribute_group **msi_irq_groups;
-	struct attribute **msi_attrs, *msi_attr;
-	struct device_attribute *msi_dev_attr;
-	struct attribute_group *msi_irq_group;
-	struct msi_desc *entry;
-	int ret = -ENOMEM;
-	int num_msi = 0;
-	int count = 0;
+	struct device_attribute *attrs = desc->sysfs_attrs;
 	int i;
 
-	/* Determine how many msi entries we have */
-	msi_for_each_desc(entry, dev, MSI_DESC_ALL)
-		num_msi += entry->nvec_used;
-	if (!num_msi)
-		return NULL;
+	if (!attrs)
+		return;
 
-	/* Dynamically create the MSI attributes for the device */
-	msi_attrs = kcalloc(num_msi + 1, sizeof(void *), GFP_KERNEL);
-	if (!msi_attrs)
-		return ERR_PTR(-ENOMEM);
-
-	msi_for_each_desc(entry, dev, MSI_DESC_ALL) {
-		for (i = 0; i < entry->nvec_used; i++) {
-			msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
-			if (!msi_dev_attr)
-				goto error_attrs;
-			msi_attrs[count] = &msi_dev_attr->attr;
-
-			sysfs_attr_init(&msi_dev_attr->attr);
-			msi_dev_attr->attr.name = kasprintf(GFP_KERNEL, "%d",
-							    entry->irq + i);
-			if (!msi_dev_attr->attr.name)
-				goto error_attrs;
-			msi_dev_attr->attr.mode = 0444;
-			msi_dev_attr->show = msi_mode_show;
-			++count;
-		}
+	desc->sysfs_attrs = NULL;
+	for (i = 0; i < desc->nvec_used; i++) {
+		if (attrs[i].show)
+			sysfs_remove_file_from_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
+		kfree(attrs[i].attr.name);
 	}
+	kfree(attrs);
+}
 
-	msi_irq_group = kzalloc(sizeof(*msi_irq_group), GFP_KERNEL);
-	if (!msi_irq_group)
-		goto error_attrs;
-	msi_irq_group->name = "msi_irqs";
-	msi_irq_group->attrs = msi_attrs;
-
-	msi_irq_groups = kcalloc(2, sizeof(void *), GFP_KERNEL);
-	if (!msi_irq_groups)
-		goto error_irq_group;
-	msi_irq_groups[0] = msi_irq_group;
+static int msi_sysfs_populate_desc(struct device *dev, struct msi_desc *desc)
+{
+	struct device_attribute *attrs;
+	int ret, i;
 
-	ret = sysfs_create_groups(&dev->kobj, msi_irq_groups);
-	if (ret)
-		goto error_irq_groups;
+	attrs = kcalloc(desc->nvec_used, sizeof(*attrs), GFP_KERNEL);
+	if (!attrs)
+		return -ENOMEM;
+
+	desc->sysfs_attrs = attrs;
+	for (i = 0; i < desc->nvec_used; i++) {
+		sysfs_attr_init(&attrs[i].attr);
+		attrs[i].attr.name = kasprintf(GFP_KERNEL, "%d", desc->irq + i);
+		if (!attrs[i].attr.name) {
+			ret = -ENOMEM;
+			goto fail;
+		}
 
-	return msi_irq_groups;
+		attrs[i].attr.mode = 0444;
+		attrs[i].show = msi_mode_show;
 
-error_irq_groups:
-	kfree(msi_irq_groups);
-error_irq_group:
-	kfree(msi_irq_group);
-error_attrs:
-	count = 0;
-	msi_attr = msi_attrs[count];
-	while (msi_attr) {
-		msi_dev_attr = container_of(msi_attr, struct device_attribute, attr);
-		kfree(msi_attr->name);
-		kfree(msi_dev_attr);
-		++count;
-		msi_attr = msi_attrs[count];
+		ret = sysfs_add_file_to_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
+		if (ret) {
+			attrs[i].show = NULL;
+			goto fail;
+		}
 	}
-	kfree(msi_attrs);
-	return ERR_PTR(ret);
+	return 0;
+
+fail:
+	msi_sysfs_remove_desc(dev, desc);
+	return ret;
 }
 
+#ifdef CONFIG_PCI_MSI_ARCH_FALLBACK
 /**
  * msi_device_populate_sysfs - Populate msi_irqs sysfs entries for a device
  * @dev:	The device(PCI, platform etc) which will get sysfs entries
  */
 int msi_device_populate_sysfs(struct device *dev)
 {
-	const struct attribute_group **group = msi_populate_sysfs(dev);
+	struct msi_desc *desc;
+	int ret;
 
-	if (IS_ERR(group))
-		return PTR_ERR(group);
-	dev->msi.data->attrs = group;
+	msi_for_each_desc(desc, dev, MSI_DESC_ASSOCIATED) {
+		if (desc->sysfs_attrs)
+			continue;
+		ret = msi_sysfs_populate_desc(dev, desc);
+		if (ret)
+			return ret;
+	}
 	return 0;
 }
 
@@ -505,28 +504,17 @@ int msi_device_populate_sysfs(struct dev
  */
 void msi_device_destroy_sysfs(struct device *dev)
 {
-	const struct attribute_group **msi_irq_groups = dev->msi.data->attrs;
-	struct device_attribute *dev_attr;
-	struct attribute **msi_attrs;
-	int count = 0;
-
-	dev->msi.data->attrs = NULL;
-	if (!msi_irq_groups)
-		return;
+	struct msi_desc *desc;
 
-	sysfs_remove_groups(&dev->kobj, msi_irq_groups);
-	msi_attrs = msi_irq_groups[0]->attrs;
-	while (msi_attrs[count]) {
-		dev_attr = container_of(msi_attrs[count], struct device_attribute, attr);
-		kfree(dev_attr->attr.name);
-		kfree(dev_attr);
-		++count;
-	}
-	kfree(msi_attrs);
-	kfree(msi_irq_groups[0]);
-	kfree(msi_irq_groups);
+	msi_for_each_desc(desc, dev, MSI_DESC_ALL)
+		msi_sysfs_remove_desc(dev, desc);
 }
-#endif
+#endif /* CONFIG_PCI_MSI_ARCH_FALLBACK */
+#else /* CONFIG_SYSFS */
+static inline int msi_sysfs_create_group(struct device *dev) { return 0; }
+static inline int msi_sysfs_populate_desc(struct device *dev, struct msi_desc *desc) { return 0; }
+static inline void msi_sysfs_remove_desc(struct device *dev, struct msi_desc *desc) { }
+#endif /* !CONFIG_SYSFS */
 
 #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
 static inline void irq_chip_write_msi_msg(struct irq_data *data,
@@ -959,6 +947,12 @@ int __msi_domain_alloc_irqs(struct irq_d
 			ret = msi_init_virq(domain, virq + i, vflags);
 			if (ret)
 				return ret;
+
+			if (info->flags & MSI_FLAG_DEV_SYSFS) {
+				ret = msi_sysfs_populate_desc(dev, desc);
+				if (ret)
+					return ret;
+			}
 		}
 		allocated++;
 	}
@@ -1003,18 +997,7 @@ int msi_domain_alloc_irqs_descs_locked(s
 
 	ret = ops->domain_alloc_irqs(domain, dev, nvec);
 	if (ret)
-		goto cleanup;
-
-	if (!(info->flags & MSI_FLAG_DEV_SYSFS))
-		return 0;
-
-	ret = msi_device_populate_sysfs(dev);
-	if (ret)
-		goto cleanup;
-	return 0;
-
-cleanup:
-	msi_domain_free_irqs_descs_locked(domain, dev);
+		msi_domain_free_irqs_descs_locked(domain, dev);
 	return ret;
 }
 
@@ -1039,6 +1022,7 @@ int msi_domain_alloc_irqs(struct irq_dom
 
 void __msi_domain_free_irqs(struct irq_domain *domain, struct device *dev)
 {
+	struct msi_domain_info *info = domain->host_data;
 	struct irq_data *irqd;
 	struct msi_desc *desc;
 	int i;
@@ -1053,6 +1037,8 @@ void __msi_domain_free_irqs(struct irq_d
 		}
 
 		irq_domain_free_irqs(desc->irq, desc->nvec_used);
+		if (info->flags & MSI_FLAG_DEV_SYSFS)
+			msi_sysfs_remove_desc(dev, desc);
 		desc->irq = 0;
 	}
 }
@@ -1081,8 +1067,6 @@ void msi_domain_free_irqs_descs_locked(s
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	if (info->flags & MSI_FLAG_DEV_SYSFS)
-		msi_device_destroy_sysfs(dev);
 	ops->domain_free_irqs(domain, dev);
 	msi_domain_free_msi_descs(info, dev);
 }


^ permalink raw reply	[flat|nested] 141+ messages in thread

* [patch 32/32] genirq/msi: Convert storage to xarray
  2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
                   ` (30 preceding siblings ...)
  2021-11-27  1:24 ` [patch 31/32] genirq/msi: Simplify sysfs handling Thomas Gleixner
@ 2021-11-27  1:24 ` Thomas Gleixner
  2021-11-27 12:33   ` Greg Kroah-Hartman
  31 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:24 UTC (permalink / raw)
  To: LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

The current linked list storage for MSI descriptors is suboptimal in
several ways:

  1) Looking up a MSI desciptor requires a O(n) list walk in the worst case

  2) The upcoming support of runtime expansion of MSI-X vectors would need
     to do a full list walk to figure out whether a particular index is
     already associated.

  3) Runtime expansion of sparse allocations is even more complex as the
     current implementation assumes an ordered list (increasing MSI index).

Use an xarray which solves all of the above problems nicely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |   19 ++---
 kernel/irq/msi.c    |  188 ++++++++++++++++++++++------------------------------
 2 files changed, 90 insertions(+), 117 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -17,6 +17,7 @@
  */
 
 #include <linux/spinlock.h>
+#include <linux/xarray.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/bits.h>
@@ -122,7 +123,6 @@ struct pci_msi_desc {
 
 /**
  * struct msi_desc - Descriptor structure for MSI based interrupts
- * @list:	List head for management
  * @irq:	The base interrupt number
  * @nvec_used:	The number of vectors used
  * @dev:	Pointer to the device which uses this descriptor
@@ -139,7 +139,6 @@ struct pci_msi_desc {
  */
 struct msi_desc {
 	/* Shared device/bus type independent data */
-	struct list_head		list;
 	unsigned int			irq;
 	unsigned int			nvec_used;
 	struct device			*dev;
@@ -177,20 +176,20 @@ enum msi_desc_filter {
  * @properties:		MSI properties which are interesting to drivers
  * @num_descs:		The number of allocated MSI descriptors for the device
  * @platform_data:	Platform-MSI specific data
- * @list:		List of MSI descriptors associated to the device
- * @mutex:		Mutex protecting the MSI list
- * @__next:		Cached pointer to the next entry for iterators
- * @__filter:		Cached descriptor filter
+ * @mutex:		Mutex protecting the MSI descriptor store
+ * @store:		Xarray for storing MSI descriptor pointers
+ * @__iter_idx:		Index to search the next entry for iterators
+ * @__iter_filter:	Cached descriptor filter
  */
 struct msi_device_data {
 	raw_spinlock_t			lock;
 	unsigned long			properties;
 	unsigned int			num_descs;
 	struct platform_msi_priv_data	*platform_data;
-	struct list_head		list;
 	struct mutex			mutex;
-	struct msi_desc			*__next;
-	enum msi_desc_filter		__filter;
+	struct xarray			store;
+	unsigned long			__iter_idx;
+	enum msi_desc_filter		__iter_filter;
 };
 
 int msi_setup_device_data(struct device *dev);
@@ -266,7 +265,7 @@ static inline struct msi_desc *msi_first
  * @dev:	struct device pointer - device to iterate
  * @filter:	Filter for descriptor selection
  *
- * See msi_for_each_desc_from()for further information.
+ * See msi_for_each_desc_from() for further information.
  */
 #define msi_for_each_desc(desc, dev, filter)				\
 	msi_for_each_desc_from(desc, dev, filter, 0)
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -20,7 +20,6 @@
 #include "internals.h"
 
 static inline int msi_sysfs_create_group(struct device *dev);
-#define dev_to_msi_list(dev)	(&(dev)->msi.data->list)
 
 /**
  * msi_alloc_desc - Allocate an initialized msi_desc
@@ -41,7 +40,6 @@ static struct msi_desc *msi_alloc_desc(s
 	if (!desc)
 		return NULL;
 
-	INIT_LIST_HEAD(&desc->list);
 	desc->dev = dev;
 	desc->nvec_used = nvec;
 	if (affinity) {
@@ -60,6 +58,19 @@ static void msi_free_desc(struct msi_des
 	kfree(desc);
 }
 
+static int msi_insert_desc(struct msi_device_data *md, struct msi_desc *desc, unsigned int index)
+{
+	int ret;
+
+	desc->msi_index = index;
+	ret = xa_insert(&md->store, index, desc, GFP_KERNEL);
+	if (!ret)
+		md->num_descs++;
+	else
+		msi_free_desc(desc);
+	return ret;
+}
+
 /**
  * msi_add_msi_desc - Allocate and initialize a MSI descriptor
  * @dev:	Pointer to the device for which the descriptor is allocated
@@ -77,13 +88,9 @@ int msi_add_msi_desc(struct device *dev,
 	if (!desc)
 		return -ENOMEM;
 
-	/* Copy the MSI index and type specific data to the new descriptor. */
-	desc->msi_index = init_desc->msi_index;
+	/* Copy type specific data to the new descriptor. */
 	desc->pci = init_desc->pci;
-
-	list_add_tail(&desc->list, &dev->msi.data->list);
-	dev->msi.data->num_descs++;
-	return 0;
+	return msi_insert_desc(dev->msi.data, desc, init_desc->msi_index);
 }
 
 /**
@@ -96,29 +103,41 @@ int msi_add_msi_desc(struct device *dev,
  */
 static int msi_add_simple_msi_descs(struct device *dev, unsigned int index, unsigned int ndesc)
 {
-	struct msi_desc *desc, *tmp;
-	LIST_HEAD(list);
-	unsigned int i;
+	struct msi_desc *desc;
+	unsigned long i;
+	int ret;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
 	for (i = 0; i < ndesc; i++) {
 		desc = msi_alloc_desc(dev, 1, NULL);
 		if (!desc)
+			goto fail_mem;
+		ret = msi_insert_desc(dev->msi.data, desc, index + i);
+		if (ret)
 			goto fail;
-		desc->msi_index = index + i;
-		list_add_tail(&desc->list, &list);
 	}
-	list_splice_tail(&list, &dev->msi.data->list);
-	dev->msi.data->num_descs += ndesc;
 	return 0;
 
+fail_mem:
+	ret = -ENOMEM;
 fail:
-	list_for_each_entry_safe(desc, tmp, &list, list) {
-		list_del(&desc->list);
-		msi_free_desc(desc);
+	msi_free_msi_descs_range(dev, MSI_DESC_NOTASSOCIATED, index, ndesc);
+	return ret;
+}
+
+static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
+{
+	switch (filter) {
+	case MSI_DESC_ALL:
+		return true;
+	case MSI_DESC_NOTASSOCIATED:
+		return !desc->irq;
+	case MSI_DESC_ASSOCIATED:
+		return !!desc->irq;
 	}
-	return -ENOMEM;
+	WARN_ON_ONCE(1);
+	return false;
 }
 
 /**
@@ -132,19 +151,16 @@ void msi_free_msi_descs_range(struct dev
 			      unsigned int base_index, unsigned int ndesc)
 {
 	struct msi_desc *desc;
+	unsigned long idx;
 
 	lockdep_assert_held(&dev->msi.data->mutex);
 
-	msi_for_each_desc(desc, dev, filter) {
-		/*
-		 * Stupid for now to handle MSI device domain until the
-		 * storage is switched over to an xarray.
-		 */
-		if (desc->msi_index < base_index || desc->msi_index >= base_index + ndesc)
-			continue;
-		list_del(&desc->list);
-		msi_free_desc(desc);
-		dev->msi.data->num_descs--;
+	xa_for_each_range(&dev->msi.data->store, idx, desc, base_index, base_index + ndesc - 1) {
+		if (msi_desc_match(desc, filter)) {
+			xa_erase(&dev->msi.data->store, idx);
+			msi_free_desc(desc);
+			dev->msi.data->num_descs--;
+		}
 	}
 }
 
@@ -192,7 +208,8 @@ static void msi_device_data_release(stru
 {
 	struct msi_device_data *md = res;
 
-	WARN_ON_ONCE(!list_empty(&md->list));
+	WARN_ON_ONCE(!xa_empty(&md->store));
+	xa_destroy(&md->store);
 	dev->msi.data = NULL;
 }
 
@@ -225,7 +242,7 @@ int msi_setup_device_data(struct device
 	}
 
 	raw_spin_lock_init(&md->lock);
-	INIT_LIST_HEAD(&md->list);
+	xa_init(&md->store);
 	mutex_init(&md->mutex);
 	dev->msi.data = md;
 	devres_add(dev, md);
@@ -252,38 +269,21 @@ void msi_unlock_descs(struct device *dev
 {
 	if (WARN_ON_ONCE(!dev->msi.data))
 		return;
-	/* Clear the next pointer which was cached by the iterator */
-	dev->msi.data->__next = NULL;
+	/* Invalidate the index wich was cached by the iterator */
+	dev->msi.data->__iter_idx = ULONG_MAX;
 	mutex_unlock(&dev->msi.data->mutex);
 }
 EXPORT_SYMBOL_GPL(msi_unlock_descs);
 
-static bool msi_desc_match(struct msi_desc *desc, enum msi_desc_filter filter)
-{
-	switch (filter) {
-	case MSI_DESC_ALL:
-		return true;
-	case MSI_DESC_NOTASSOCIATED:
-		return !desc->irq;
-	case MSI_DESC_ASSOCIATED:
-		return !!desc->irq;
-	}
-	WARN_ON_ONCE(1);
-	return false;
-}
-
-static struct msi_desc *msi_find_first_desc(struct device *dev, enum msi_desc_filter filter,
-					    unsigned int base_index)
+static struct msi_desc *msi_find_desc(struct msi_device_data *md)
 {
 	struct msi_desc *desc;
 
-	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
-		if (desc->msi_index < base_index)
-			continue;
-		if (msi_desc_match(desc, filter))
-			return desc;
+	xa_for_each_start(&md->store, md->__iter_idx, desc, md->__iter_idx) {
+		if (msi_desc_match(desc, md->__iter_filter))
+			break;
 	}
-	return NULL;
+	return desc;
 }
 
 /**
@@ -301,43 +301,25 @@ static struct msi_desc *msi_find_first_d
 struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter,
 				  unsigned int base_index)
 {
-	struct msi_desc *desc;
+	struct msi_device_data *md = dev->msi.data;
 
-	if (WARN_ON_ONCE(!dev->msi.data))
+	if (WARN_ON_ONCE(!md))
 		return NULL;
 
-	lockdep_assert_held(&dev->msi.data->mutex);
+	lockdep_assert_held(&md->mutex);
 
-	/* Invalidate a previous invocation within the same lock section */
-	dev->msi.data->__next = NULL;
-
-	desc = msi_find_first_desc(dev, filter, base_index);
-	if (desc) {
-		dev->msi.data->__next = list_next_entry(desc, list);
-		dev->msi.data->__filter = filter;
-	}
-	return desc;
+	md->__iter_filter = filter;
+	md->__iter_idx = base_index;
+	return msi_find_desc(md);
 }
 EXPORT_SYMBOL_GPL(__msi_first_desc);
 
-static struct msi_desc *__msi_next_desc(struct device *dev, enum msi_desc_filter filter,
-					struct msi_desc *from)
-{
-	struct msi_desc *desc = from;
-
-	list_for_each_entry_from(desc, dev_to_msi_list(dev), list) {
-		if (msi_desc_match(desc, filter))
-			return desc;
-	}
-	return NULL;
-}
-
 /**
  * msi_next_desc - Get the next MSI descriptor of a device
  * @dev:	Device to operate on
  *
  * The first invocation of msi_next_desc() has to be preceeded by a
- * successful incovation of __msi_first_desc(). Consecutive invocations are
+ * successful invocation of __msi_first_desc(). Consecutive invocations are
  * only valid if the previous one was successful. All these operations have
  * to be done within the same MSI mutex held region.
  *
@@ -346,20 +328,18 @@ static struct msi_desc *__msi_next_desc(
  */
 struct msi_desc *msi_next_desc(struct device *dev)
 {
-	struct msi_device_data *data = dev->msi.data;
-	struct msi_desc *desc;
+	struct msi_device_data *md = dev->msi.data;
 
-	if (WARN_ON_ONCE(!data))
+	if (WARN_ON_ONCE(!md))
 		return NULL;
 
-	lockdep_assert_held(&data->mutex);
+	lockdep_assert_held(&md->mutex);
 
-	if (!data->__next)
+	if (md->__iter_idx == ULONG_MAX)
 		return NULL;
 
-	desc = __msi_next_desc(dev, data->__filter, data->__next);
-	dev->msi.data->__next = desc ? list_next_entry(desc, list) : NULL;
-	return desc;
+	md->__iter_idx++;
+	return msi_find_desc(md);
 }
 EXPORT_SYMBOL_GPL(msi_next_desc);
 
@@ -384,21 +364,18 @@ int __msi_get_virq(struct device *dev, u
 	pcimsi = msi_device_has_property(dev, MSI_PROP_PCI_MSI);
 
 	msi_lock_descs(dev);
-	msi_for_each_desc_from(desc, dev, MSI_DESC_ASSOCIATED, index) {
-		/* PCI-MSI has only one descriptor for multiple interrupts. */
-		if (pcimsi) {
-			if (index < desc->nvec_used)
-				ret = desc->irq + index;
-			break;
-		}
-
+	desc = xa_load(&dev->msi.data->store, pcimsi ? 0 : index);
+	if (desc && desc->irq) {
 		/*
+		 * PCI-MSI has only one descriptor for multiple interrupts.
 		 * PCI-MSIX and platform MSI use a descriptor per
 		 * interrupt.
 		 */
-		if (desc->msi_index == index) {
+		if (pcimsi) {
+			if (index < desc->nvec_used)
+				ret = desc->irq + index;
+		} else {
 			ret = desc->irq;
-			break;
 		}
 	}
 	msi_unlock_descs(dev);
@@ -779,17 +756,13 @@ int msi_domain_populate_irqs(struct irq_
 	int ret, virq;
 
 	msi_lock_descs(dev);
-	for (virq = virq_base; virq < virq_base + nvec; virq++) {
-		desc = msi_alloc_desc(dev, 1, NULL);
-		if (!desc) {
-			ret = -ENOMEM;
-			goto fail;
-		}
+	ret = msi_add_simple_msi_descs(dev, virq_base, nvec);
+	if (ret)
+		goto unlock;
 
-		desc->msi_index = virq;
+	for (virq = virq_base; virq < virq_base + nvec; virq++) {
+		desc = xa_load(&dev->msi.data->store, virq);
 		desc->irq = virq;
-		list_add_tail(&desc->list, &dev->msi.data->list);
-		dev->msi.data->num_descs++;
 
 		ops->set_desc(arg, desc);
 		ret = irq_domain_alloc_irqs_hierarchy(domain, virq, 1, arg);
@@ -805,6 +778,7 @@ int msi_domain_populate_irqs(struct irq_
 	for (--virq; virq >= virq_base; virq--)
 		irq_domain_free_irqs_common(domain, virq, 1);
 	msi_free_msi_descs_range(dev, MSI_DESC_ALL, virq_base, nvec);
+unlock:
 	msi_unlock_descs(dev);
 	return ret;
 }


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 07/32] genirq/msi: Count the allocated MSI descriptors
  2021-11-27  1:23 ` [patch 07/32] genirq/msi: Count the allocated MSI descriptors Thomas Gleixner
@ 2021-11-27 12:19   ` Greg Kroah-Hartman
  2021-11-27 19:22     ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-27 12:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

No changelog?

Anyway, why do we care about the number of decriptors?


> ---
>  include/linux/msi.h |    3 +++
>  kernel/irq/msi.c    |   18 ++++++++++++++++++
>  2 files changed, 21 insertions(+)
> 
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -156,6 +156,7 @@ enum msi_desc_filter {
>   * msi_device_data - MSI per device data
>   * @lock:		Spinlock to protect register access
>   * @properties:		MSI properties which are interesting to drivers
> + * @num_descs:		The number of allocated MSI descriptors for the device
>   * @attrs:		Pointer to the sysfs attribute group
>   * @platform_data:	Platform-MSI specific data
>   * @list:		List of MSI descriptors associated to the device
> @@ -166,6 +167,7 @@ enum msi_desc_filter {
>  struct msi_device_data {
>  	raw_spinlock_t			lock;
>  	unsigned long			properties;
> +	unsigned int			num_descs;
>  	const struct attribute_group    **attrs;
>  	struct platform_msi_priv_data	*platform_data;
>  	struct list_head		list;
> @@ -208,6 +210,7 @@ static inline unsigned int msi_get_virq(
>  
>  void msi_lock_descs(struct device *dev);
>  void msi_unlock_descs(struct device *dev);
> +unsigned int msi_device_num_descs(struct device *dev);
>  
>  struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
>  struct msi_desc *msi_next_desc(struct device *dev);
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -82,6 +82,7 @@ int msi_add_msi_desc(struct device *dev,
>  	desc->pci = init_desc->pci;
>  
>  	list_add_tail(&desc->list, &dev->msi.data->list);
> +	dev->msi.data->num_descs++;
>  	return 0;
>  }
>  
> @@ -109,6 +110,7 @@ int msi_add_simple_msi_descs(struct devi
>  		list_add_tail(&desc->list, &list);
>  	}
>  	list_splice_tail(&list, &dev->msi.data->list);
> +	dev->msi.data->num_descs += ndesc;
>  	return 0;
>  
>  fail:
> @@ -142,6 +144,7 @@ void msi_free_msi_descs_range(struct dev
>  			continue;
>  		list_del(&desc->list);
>  		free_msi_entry(desc);
> +		dev->msi.data->num_descs--;
>  	}
>  }
>  
> @@ -157,6 +160,21 @@ bool msi_device_has_property(struct devi
>  	return !!(dev->msi.data->properties & prop);
>  }
>  
> +/**
> + * msi_device_num_descs - Query the number of allocated MSI descriptors of a device
> + * @dev:	The device to read from
> + *
> + * Note: This is a lockless snapshot of msi_device_data::num_descs
> + *
> + * Returns the number of MSI descriptors which are allocated for @dev
> + */
> +unsigned int msi_device_num_descs(struct device *dev)
> +{
> +	if (dev->msi.data)
> +		return dev->msi.data->num_descs;

As this number can change after it is read, what will callers do with
it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data
  2021-11-27  1:23 ` [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data Thomas Gleixner
@ 2021-11-27 12:19   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 141+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-27 12:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27, 2021 at 02:22:29AM +0100, Thomas Gleixner wrote:
> It's only required when MSI is in use.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  drivers/base/core.c    |    3 ---
>  include/linux/device.h |    4 ----
>  include/linux/msi.h    |    4 +++-
>  kernel/irq/msi.c       |    5 ++++-
>  4 files changed, 7 insertions(+), 9 deletions(-)

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 31/32] genirq/msi: Simplify sysfs handling
  2021-11-27  1:24 ` [patch 31/32] genirq/msi: Simplify sysfs handling Thomas Gleixner
@ 2021-11-27 12:32   ` Greg Kroah-Hartman
  2021-11-27 19:31     ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-27 12:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
> The sysfs handling for MSI is a convoluted maze and it is in the way of
> supporting dynamic expansion of the MSI-X vectors because it only supports
> a one off bulk population/free of the sysfs entries.
> 
> Change it to do:
> 
>    1) Creating an empty sysfs attribute group when msi_device_data is
>       allocated
> 
>    2) Populate the entries when the MSI descriptor is initialized

How much later does this happen?  Can it happen while the device has a
driver bound to it?

>    3) Free the entries when a MSI descriptor is detached from a Linux
>       interrupt.
> 
>    4) Provide functions for the legacy non-irqdomain fallback code to
>       do a bulk population/free. This code won't support dynamic
>       expansion.
> 
> This makes the code simpler and reduces the number of allocations as the
> empty attribute group can be shared.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/msi.h |    7 +
>  kernel/irq/msi.c    |  196 +++++++++++++++++++++++-----------------------------
>  2 files changed, 95 insertions(+), 108 deletions(-)
> 
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -72,6 +72,7 @@ struct irq_data;
>  struct msi_desc;
>  struct pci_dev;
>  struct platform_msi_priv_data;
> +struct device_attribute;
>  
>  void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
>  #ifdef CONFIG_GENERIC_MSI_IRQ
> @@ -127,6 +128,7 @@ struct pci_msi_desc {
>   * @dev:	Pointer to the device which uses this descriptor
>   * @msg:	The last set MSI message cached for reuse
>   * @affinity:	Optional pointer to a cpu affinity mask for this descriptor
> + * @sysfs_attr:	Pointer to sysfs device attribute
>   *
>   * @write_msi_msg:	Callback that may be called when the MSI message
>   *			address or data changes
> @@ -146,6 +148,9 @@ struct msi_desc {
>  #ifdef CONFIG_IRQ_MSI_IOMMU
>  	const void			*iommu_cookie;
>  #endif
> +#ifdef CONFIG_SYSFS
> +	struct device_attribute		*sysfs_attrs;
> +#endif
>  
>  	void (*write_msi_msg)(struct msi_desc *entry, void *data);
>  	void *write_msi_msg_data;
> @@ -171,7 +176,6 @@ enum msi_desc_filter {
>   * @lock:		Spinlock to protect register access
>   * @properties:		MSI properties which are interesting to drivers
>   * @num_descs:		The number of allocated MSI descriptors for the device
> - * @attrs:		Pointer to the sysfs attribute group
>   * @platform_data:	Platform-MSI specific data
>   * @list:		List of MSI descriptors associated to the device
>   * @mutex:		Mutex protecting the MSI list
> @@ -182,7 +186,6 @@ struct msi_device_data {
>  	raw_spinlock_t			lock;
>  	unsigned long			properties;
>  	unsigned int			num_descs;
> -	const struct attribute_group    **attrs;
>  	struct platform_msi_priv_data	*platform_data;
>  	struct list_head		list;
>  	struct mutex			mutex;
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -19,6 +19,7 @@
>  
>  #include "internals.h"
>  
> +static inline int msi_sysfs_create_group(struct device *dev);
>  #define dev_to_msi_list(dev)	(&(dev)->msi.data->list)
>  
>  /**
> @@ -208,6 +209,7 @@ static void msi_device_data_release(stru
>  int msi_setup_device_data(struct device *dev)
>  {
>  	struct msi_device_data *md;
> +	int ret;
>  
>  	if (dev->msi.data)
>  		return 0;
> @@ -216,6 +218,12 @@ int msi_setup_device_data(struct device
>  	if (!md)
>  		return -ENOMEM;
>  
> +	ret = msi_sysfs_create_group(dev);
> +	if (ret) {
> +		devres_free(md);
> +		return ret;
> +	}
> +
>  	raw_spin_lock_init(&md->lock);
>  	INIT_LIST_HEAD(&md->list);
>  	mutex_init(&md->mutex);
> @@ -395,6 +403,20 @@ int __msi_get_virq(struct device *dev, u
>  EXPORT_SYMBOL_GPL(__msi_get_virq);
>  
>  #ifdef CONFIG_SYSFS
> +static struct attribute *msi_dev_attrs[] = {
> +	NULL
> +};
> +
> +static const struct attribute_group msi_irqs_group = {
> +	.name	= "msi_irqs",
> +	.attrs	= msi_dev_attrs,
> +};
> +
> +static inline int msi_sysfs_create_group(struct device *dev)
> +{
> +	return devm_device_add_group(dev, &msi_irqs_group);

Much nicer, but you changed the lifetime rules of when these attributes
will be removed, is that ok?

I still worry that these attributes show up "after" the device is
registered with the driver core, but hey, it's no worse than it
currently is, so that's not caused by this patch series...

> @@ -404,97 +426,74 @@ static ssize_t msi_mode_show(struct devi
>  	return sysfs_emit(buf, "%s\n", is_msix ? "msix" : "msi");
>  }
>  
> -/**
> - * msi_populate_sysfs - Populate msi_irqs sysfs entries for devices
> - * @dev:	The device(PCI, platform etc) who will get sysfs entries
> - */
> -static const struct attribute_group **msi_populate_sysfs(struct device *dev)
> +static void msi_sysfs_remove_desc(struct device *dev, struct msi_desc *desc)
>  {
> -	const struct attribute_group **msi_irq_groups;
> -	struct attribute **msi_attrs, *msi_attr;
> -	struct device_attribute *msi_dev_attr;
> -	struct attribute_group *msi_irq_group;
> -	struct msi_desc *entry;
> -	int ret = -ENOMEM;
> -	int num_msi = 0;
> -	int count = 0;
> +	struct device_attribute *attrs = desc->sysfs_attrs;
>  	int i;
>  
> -	/* Determine how many msi entries we have */
> -	msi_for_each_desc(entry, dev, MSI_DESC_ALL)
> -		num_msi += entry->nvec_used;
> -	if (!num_msi)
> -		return NULL;
> +	if (!attrs)
> +		return;
>  
> -	/* Dynamically create the MSI attributes for the device */
> -	msi_attrs = kcalloc(num_msi + 1, sizeof(void *), GFP_KERNEL);
> -	if (!msi_attrs)
> -		return ERR_PTR(-ENOMEM);
> -
> -	msi_for_each_desc(entry, dev, MSI_DESC_ALL) {
> -		for (i = 0; i < entry->nvec_used; i++) {
> -			msi_dev_attr = kzalloc(sizeof(*msi_dev_attr), GFP_KERNEL);
> -			if (!msi_dev_attr)
> -				goto error_attrs;
> -			msi_attrs[count] = &msi_dev_attr->attr;
> -
> -			sysfs_attr_init(&msi_dev_attr->attr);
> -			msi_dev_attr->attr.name = kasprintf(GFP_KERNEL, "%d",
> -							    entry->irq + i);
> -			if (!msi_dev_attr->attr.name)
> -				goto error_attrs;
> -			msi_dev_attr->attr.mode = 0444;
> -			msi_dev_attr->show = msi_mode_show;
> -			++count;
> -		}
> +	desc->sysfs_attrs = NULL;
> +	for (i = 0; i < desc->nvec_used; i++) {
> +		if (attrs[i].show)
> +			sysfs_remove_file_from_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
> +		kfree(attrs[i].attr.name);

That's a cute hack, but should be documented somewhere in the code (that
if there is no show function, that means no attribute was registered
here).

If you add a comment for this (either here or when you register the
attribute), feel free to add:

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 32/32] genirq/msi: Convert storage to xarray
  2021-11-27  1:24 ` [patch 32/32] genirq/msi: Convert storage to xarray Thomas Gleixner
@ 2021-11-27 12:33   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 141+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-27 12:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27, 2021 at 02:23:17AM +0100, Thomas Gleixner wrote:
> The current linked list storage for MSI descriptors is suboptimal in
> several ways:
> 
>   1) Looking up a MSI desciptor requires a O(n) list walk in the worst case
> 
>   2) The upcoming support of runtime expansion of MSI-X vectors would need
>      to do a full list walk to figure out whether a particular index is
>      already associated.
> 
>   3) Runtime expansion of sparse allocations is even more complex as the
>      current implementation assumes an ordered list (increasing MSI index).
> 
> Use an xarray which solves all of the above problems nicely.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/msi.h |   19 ++---
>  kernel/irq/msi.c    |  188 ++++++++++++++++++++++------------------------------
>  2 files changed, 90 insertions(+), 117 deletions(-)

Much simpler code too, nice!

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 07/32] genirq/msi: Count the allocated MSI descriptors
  2021-11-27 12:19   ` Greg Kroah-Hartman
@ 2021-11-27 19:22     ` Thomas Gleixner
  2021-11-27 19:45       ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27 19:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27 2021 at 13:19, Greg Kroah-Hartman wrote:
> On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> No changelog?

Bah. This one should not be there at all.

> Anyway, why do we care about the number of decriptors?
>> +/**
>> + * msi_device_num_descs - Query the number of allocated MSI descriptors of a device
>> + * @dev:	The device to read from
>> + *
>> + * Note: This is a lockless snapshot of msi_device_data::num_descs
>> + *
>> + * Returns the number of MSI descriptors which are allocated for @dev
>> + */
>> +unsigned int msi_device_num_descs(struct device *dev)
>> +{
>> +	if (dev->msi.data)
>> +		return dev->msi.data->num_descs;
>
> As this number can change after it is read, what will callers do with
> it?

I wanted to get rid of this, but then forgot. Getting old.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 31/32] genirq/msi: Simplify sysfs handling
  2021-11-27 12:32   ` Greg Kroah-Hartman
@ 2021-11-27 19:31     ` Thomas Gleixner
  2021-11-28 11:07       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27 19:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27 2021 at 13:32, Greg Kroah-Hartman wrote:
> On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
>> The sysfs handling for MSI is a convoluted maze and it is in the way of
>> supporting dynamic expansion of the MSI-X vectors because it only supports
>> a one off bulk population/free of the sysfs entries.
>> 
>> Change it to do:
>> 
>>    1) Creating an empty sysfs attribute group when msi_device_data is
>>       allocated
>> 
>>    2) Populate the entries when the MSI descriptor is initialized
>
> How much later does this happen?  Can it happen while the device has a
> driver bound to it?

That's not later than before. It's when the driver initializes the
MSI[X] interrupts, which usually happens in the probe() function.

The difference is that the group, (i.e.) directory is created slightly
earlier.

>> +
>> +static inline int msi_sysfs_create_group(struct device *dev)
>> +{
>> +	return devm_device_add_group(dev, &msi_irqs_group);
>
> Much nicer, but you changed the lifetime rules of when these attributes
> will be removed, is that ok?

The msi entries are removed at the same place as they are removed in the
current mainline code, i.e. when the device driver shuts the device
down and disables MSI[X], which happens usually during remove()

What's different now is that the empty group stays around a bit
longer. I don't see how that matters.

> I still worry that these attributes show up "after" the device is
> registered with the driver core, but hey, it's no worse than it
> currently is, so that's not caused by this patch series...

Happens that register before or after driver->probe()?

>> -		}
>> +	desc->sysfs_attrs = NULL;
>> +	for (i = 0; i < desc->nvec_used; i++) {
>> +		if (attrs[i].show)
>> +			sysfs_remove_file_from_group(&dev->kobj, &attrs[i].attr, msi_irqs_group.name);
>> +		kfree(attrs[i].attr.name);
>
> That's a cute hack, but should be documented somewhere in the code (that
> if there is no show function, that means no attribute was registered
> here).
>
> If you add a comment for this (either here or when you register the
> attribute), feel free to add:

Will do.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 07/32] genirq/msi: Count the allocated MSI descriptors
  2021-11-27 19:22     ` Thomas Gleixner
@ 2021-11-27 19:45       ` Thomas Gleixner
  2021-11-28 11:07         ` Greg Kroah-Hartman
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-27 19:45 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27 2021 at 20:22, Thomas Gleixner wrote:

> On Sat, Nov 27 2021 at 13:19, Greg Kroah-Hartman wrote:
>> On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
>>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>> No changelog?
>
> Bah. This one should not be there at all.
>
>> Anyway, why do we care about the number of decriptors?

The last part of this really cares about it for the dynamic extension
part, but that's core code which looks at the counter under the lock.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators
  2021-11-27  1:23 ` [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators Thomas Gleixner
@ 2021-11-28  1:00   ` Jason Gunthorpe
  2021-11-28 19:22     ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-11-28  1:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Megha Dey, Ashok Raj, linux-pci, Greg Kroah-Hartman, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27, 2021 at 02:22:33AM +0100, Thomas Gleixner wrote:
> In preparation for dynamic handling of MSI-X interrupts provide a new set
> of MSI descriptor accessor functions and iterators. They are benefitial per
> se as they allow to cleanup quite some code in various MSI domain
> implementations.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>  include/linux/msi.h |   58 ++++++++++++++++++++++++++++
>  kernel/irq/msi.c    |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 165 insertions(+)
> 
> +++ b/include/linux/msi.h
> @@ -140,6 +140,18 @@ struct msi_desc {
>  	struct pci_msi_desc		pci;
>  };
>  
> +/*
> + * Filter values for the MSI descriptor iterators and accessor functions.
> + */
> +enum msi_desc_filter {
> +	/* All descriptors */
> +	MSI_DESC_ALL,
> +	/* Descriptors which have no interrupt associated */
> +	MSI_DESC_NOTASSOCIATED,
> +	/* Descriptors which have an interrupt associated */
> +	MSI_DESC_ASSOCIATED,
> +};
> +
>  /**
>   * msi_device_data - MSI per device data
>   * @lock:		Spinlock to protect register access
> @@ -148,6 +160,8 @@ struct msi_desc {
>   * @platform_data:	Platform-MSI specific data
>   * @list:		List of MSI descriptors associated to the device
>   * @mutex:		Mutex protecting the MSI list
> + * @__next:		Cached pointer to the next entry for iterators
> + * @__filter:		Cached descriptor filter
>   */
>  struct msi_device_data {
>  	raw_spinlock_t			lock;
> @@ -156,6 +170,8 @@ struct msi_device_data {
>  	struct platform_msi_priv_data	*platform_data;
>  	struct list_head		list;
>  	struct mutex			mutex;
> +	struct msi_desc			*__next;
> +	enum msi_desc_filter		__filter;
>  };
>  
>  int msi_setup_device_data(struct device *dev);
> @@ -193,6 +209,48 @@ static inline unsigned int msi_get_virq(
>  void msi_lock_descs(struct device *dev);
>  void msi_unlock_descs(struct device *dev);
>  
> +struct msi_desc *__msi_first_desc(struct device *dev, enum msi_desc_filter filter, unsigned int base_index);
> +struct msi_desc *msi_next_desc(struct device *dev);
> +
> +/**
> + * msi_first_desc - Get the first MSI descriptor associated to the device
> + * @dev:	Device to search
> + */
> +static inline struct msi_desc *msi_first_desc(struct device *dev)
> +{
> +	return __msi_first_desc(dev, MSI_DESC_ALL, 0);
> +}
> +
> +
> +/**
> + * msi_for_each_desc_from - Iterate the MSI descriptors from a given index
> + *
> + * @desc:	struct msi_desc pointer used as iterator
> + * @dev:	struct device pointer - device to iterate
> + * @filter:	Filter for descriptor selection
> + * @base_index:	MSI index to iterate from
> + *
> + * Notes:
> + *  - The loop must be protected with a msi_lock_descs()/msi_unlock_descs()
> + *    pair.
> + *  - It is safe to remove a retrieved MSI descriptor in the loop.
> + */
> +#define msi_for_each_desc_from(desc, dev, filter, base_index)			\
> +	for ((desc) = __msi_first_desc((dev), (filter), (base_index)); (desc);	\
> +	     (desc) = msi_next_desc((dev)))

Given this ends up as an xarray it feels really weird that there is a
hidden shared __next/__iter_idx instead of having the caller provide
the index storage as is normal for xa operations.

I understand why that isn't desirable at this patch where the storage
would have to be a list_head pointer, but still, seems like an odd
place to end up at the end of the series.

eg add index here unused and then the last patch uses it instead of
__iter_idx.

Also, I don't understand why filter was stored in the dev and not
passed into msi_next_desc() in the macro here?

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 31/32] genirq/msi: Simplify sysfs handling
  2021-11-27 19:31     ` Thomas Gleixner
@ 2021-11-28 11:07       ` Greg Kroah-Hartman
  2021-11-28 19:33         ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-28 11:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27, 2021 at 08:31:37PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 27 2021 at 13:32, Greg Kroah-Hartman wrote:
> > On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
> >> The sysfs handling for MSI is a convoluted maze and it is in the way of
> >> supporting dynamic expansion of the MSI-X vectors because it only supports
> >> a one off bulk population/free of the sysfs entries.
> >> 
> >> Change it to do:
> >> 
> >>    1) Creating an empty sysfs attribute group when msi_device_data is
> >>       allocated
> >> 
> >>    2) Populate the entries when the MSI descriptor is initialized
> >
> > How much later does this happen?  Can it happen while the device has a
> > driver bound to it?
> 
> That's not later than before. It's when the driver initializes the
> MSI[X] interrupts, which usually happens in the probe() function.
> 
> The difference is that the group, (i.e.) directory is created slightly
> earlier.

Ok, but that still happens when probe() is called for the driver, right?

> >> +
> >> +static inline int msi_sysfs_create_group(struct device *dev)
> >> +{
> >> +	return devm_device_add_group(dev, &msi_irqs_group);
> >
> > Much nicer, but you changed the lifetime rules of when these attributes
> > will be removed, is that ok?
> 
> The msi entries are removed at the same place as they are removed in the
> current mainline code, i.e. when the device driver shuts the device
> down and disables MSI[X], which happens usually during remove()
> 
> What's different now is that the empty group stays around a bit
> longer. I don't see how that matters.

How much longer does it stick around?

What happens if this sequence happens:
	- probe()
	- disconnect()
	- probe()
with the same device (i.e. the device is not removed from the system)?

Which can happen as userspace can trigger disconnect() or even worse, if
the driver is unloaded and then loaded again?  Will the second call to
create this directory fail as it is not cleaned up yet?

I can never remember if devm_*() stuff sticks around for the device
lifecycle, or for the driver/device lifecycle, which is one big reason
why I don't like that api...

> > I still worry that these attributes show up "after" the device is
> > registered with the driver core, but hey, it's no worse than it
> > currently is, so that's not caused by this patch series...
> 
> Happens that register before or after driver->probe()?

During probe is a bit too late, but we can handle that as we are used to
it.  If it happens after probe() succeeds, based on something else being
asked for in the driver (like the device being opened), then userspace
has no chance of ever noticing these attributes being added.

But again, this isn't new to your code series, so I wouldn't worry about
it.  Obviously userspace tools do not care or really notice these
attributes at all otherwise the authors of them would have complained
a long time ago :)

So again, no real objection from me here, just meta-comments, except for
the above thing with the devm_* call to ensure that the
probe/disconnect/probe sequence will still work just as well as it does
today.  Should be easy enough to test out by just unloading a module and
then loading it again with this patch series applied.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 07/32] genirq/msi: Count the allocated MSI descriptors
  2021-11-27 19:45       ` Thomas Gleixner
@ 2021-11-28 11:07         ` Greg Kroah-Hartman
  2021-11-28 19:23           ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-28 11:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27, 2021 at 08:45:18PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 27 2021 at 20:22, Thomas Gleixner wrote:
> 
> > On Sat, Nov 27 2021 at 13:19, Greg Kroah-Hartman wrote:
> >> On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
> >>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >>
> >> No changelog?
> >
> > Bah. This one should not be there at all.
> >
> >> Anyway, why do we care about the number of decriptors?
> 
> The last part of this really cares about it for the dynamic extension
> part, but that's core code which looks at the counter under the lock.

Ah, that should be documented well as right now you are saying "this is
done lockless" in the comment :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators
  2021-11-28  1:00   ` Jason Gunthorpe
@ 2021-11-28 19:22     ` Thomas Gleixner
  2021-11-29  9:26       ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-28 19:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Megha Dey, Ashok Raj, linux-pci, Greg Kroah-Hartman, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sat, Nov 27 2021 at 21:00, Jason Gunthorpe wrote:
> On Sat, Nov 27, 2021 at 02:22:33AM +0100, Thomas Gleixner wrote:
>> + * Notes:
>> + *  - The loop must be protected with a msi_lock_descs()/msi_unlock_descs()
>> + *    pair.
>> + *  - It is safe to remove a retrieved MSI descriptor in the loop.
>> + */
>> +#define msi_for_each_desc_from(desc, dev, filter, base_index)			\
>> +	for ((desc) = __msi_first_desc((dev), (filter), (base_index)); (desc);	\
>> +	     (desc) = msi_next_desc((dev)))
>
> Given this ends up as an xarray it feels really weird that there is a
> hidden shared __next/__iter_idx instead of having the caller provide
> the index storage as is normal for xa operations.
>
> I understand why that isn't desirable at this patch where the storage
> would have to be a list_head pointer, but still, seems like an odd
> place to end up at the end of the series.
>
> eg add index here unused and then the last patch uses it instead of
> __iter_idx.

TBH, didn't think about doing just that. OTH, given the experience of
looking at the creative mess people create, this was probably also a
vain attempt to make it harder in the future.

> Also, I don't understand why filter was stored in the dev and not
> passed into msi_next_desc() in the macro here?

No real reason. I probably just stored it along with the rest. Lemme try
that index approach.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 07/32] genirq/msi: Count the allocated MSI descriptors
  2021-11-28 11:07         ` Greg Kroah-Hartman
@ 2021-11-28 19:23           ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-28 19:23 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Sun, Nov 28 2021 at 12:07, Greg Kroah-Hartman wrote:
> On Sat, Nov 27, 2021 at 08:45:18PM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 27 2021 at 20:22, Thomas Gleixner wrote:
>> 
>> > On Sat, Nov 27 2021 at 13:19, Greg Kroah-Hartman wrote:
>> >> On Sat, Nov 27, 2021 at 02:22:38AM +0100, Thomas Gleixner wrote:
>> >>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> >>
>> >> No changelog?
>> >
>> > Bah. This one should not be there at all.
>> >
>> >> Anyway, why do we care about the number of decriptors?
>> 
>> The last part of this really cares about it for the dynamic extension
>> part, but that's core code which looks at the counter under the lock.
>
> Ah, that should be documented well as right now you are saying "this is
> done lockless" in the comment :)

I already zapped the whole patch as the function is not required for the
core code.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 31/32] genirq/msi: Simplify sysfs handling
  2021-11-28 11:07       ` Greg Kroah-Hartman
@ 2021-11-28 19:33         ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-28 19:33 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb, Neil Horman

On Sun, Nov 28 2021 at 12:07, Greg Kroah-Hartman wrote:
> On Sat, Nov 27, 2021 at 08:31:37PM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 27 2021 at 13:32, Greg Kroah-Hartman wrote:
>> > On Sat, Nov 27, 2021 at 02:23:15AM +0100, Thomas Gleixner wrote:
>> >> The sysfs handling for MSI is a convoluted maze and it is in the way of
>> >> supporting dynamic expansion of the MSI-X vectors because it only supports
>> >> a one off bulk population/free of the sysfs entries.
>> >> 
>> >> Change it to do:
>> >> 
>> >>    1) Creating an empty sysfs attribute group when msi_device_data is
>> >>       allocated
>> >> 
>> >>    2) Populate the entries when the MSI descriptor is initialized
>> >
>> > How much later does this happen?  Can it happen while the device has a
>> > driver bound to it?
>> 
>> That's not later than before. It's when the driver initializes the
>> MSI[X] interrupts, which usually happens in the probe() function.
>> 
>> The difference is that the group, (i.e.) directory is created slightly
>> earlier.
>
> Ok, but that still happens when probe() is called for the driver,
> right?

Yes.

>> >> +static inline int msi_sysfs_create_group(struct device *dev)
>> >> +{
>> >> +	return devm_device_add_group(dev, &msi_irqs_group);
>> >
>> > Much nicer, but you changed the lifetime rules of when these attributes
>> > will be removed, is that ok?
>> 
>> The msi entries are removed at the same place as they are removed in the
>> current mainline code, i.e. when the device driver shuts the device
>> down and disables MSI[X], which happens usually during remove()
>> 
>> What's different now is that the empty group stays around a bit
>> longer. I don't see how that matters.
>
> How much longer does it stick around?
>
> What happens if this sequence happens:
> 	- probe()
> 	- disconnect()
> 	- probe()
> with the same device (i.e. the device is not removed from the system)?
>
> Which can happen as userspace can trigger disconnect() or even worse, if
> the driver is unloaded and then loaded again?  Will the second call to
> create this directory fail as it is not cleaned up yet?
>
> I can never remember if devm_*() stuff sticks around for the device
> lifecycle, or for the driver/device lifecycle, which is one big reason
> why I don't like that api...

Driver lifecycle AFAICT.

>> > I still worry that these attributes show up "after" the device is
>> > registered with the driver core, but hey, it's no worse than it
>> > currently is, so that's not caused by this patch series...
>> 
>> Happens that register before or after driver->probe()?
>
> During probe is a bit too late, but we can handle that as we are used to
> it.  If it happens after probe() succeeds, based on something else being
> asked for in the driver (like the device being opened), then userspace
> has no chance of ever noticing these attributes being added.
>
> But again, this isn't new to your code series, so I wouldn't worry about
> it.  Obviously userspace tools do not care or really notice these
> attributes at all otherwise the authors of them would have complained
> a long time ago :)

I have no idea how these attributes are used at all. Neil should knows
as he added it in the first place.

> So again, no real objection from me here, just meta-comments, except for
> the above thing with the devm_* call to ensure that the
> probe/disconnect/probe sequence will still work just as well as it does
> today.  Should be easy enough to test out by just unloading a module and
> then loading it again with this patch series applied.

That works just fine. Tested that already before posting. After module
removal the directory is gone.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators
  2021-11-28 19:22     ` Thomas Gleixner
@ 2021-11-29  9:26       ` Thomas Gleixner
  2021-11-29 14:01         ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-29  9:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Megha Dey, Ashok Raj, linux-pci, Greg Kroah-Hartman, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

Jason,

On Sun, Nov 28 2021 at 20:22, Thomas Gleixner wrote:
> On Sat, Nov 27 2021 at 21:00, Jason Gunthorpe wrote:
>> On Sat, Nov 27, 2021 at 02:22:33AM +0100, Thomas Gleixner wrote:
>> I understand why that isn't desirable at this patch where the storage
>> would have to be a list_head pointer, but still, seems like an odd
>> place to end up at the end of the series.
>>
>> eg add index here unused and then the last patch uses it instead of
>> __iter_idx.
>
> TBH, didn't think about doing just that. OTH, given the experience of
> looking at the creative mess people create, this was probably also a
> vain attempt to make it harder in the future.
>
>> Also, I don't understand why filter was stored in the dev and not
>> passed into msi_next_desc() in the macro here?
>
> No real reason. I probably just stored it along with the rest. Lemme try
> that index approach.

After looking at all the call sites again, there is no real usage for
this local index variable.

If anything needs the index of a descriptor then it's available in the
descriptor itself. That won't change because the low level message write
code needs the index too and the only accessible storage there is
msi_desc.

So the "gain" would be to have a pointless 'unsigned long index;'
variable at all usage sites.

What for? The usage sites should not have to care about the storage
details of a facility they are using.

So it might look odd in the first place, but at the end it's conveniant
and does not put any restrictions on changing the underlying mechanics.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 14/32] s390/pci: Rework MSI descriptor walk
  2021-11-27  1:23 ` [patch 14/32] s390/pci: Rework MSI descriptor walk Thomas Gleixner
@ 2021-11-29 10:31   ` Niklas Schnelle
  2021-11-29 13:04     ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Niklas Schnelle @ 2021-11-29 10:31 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

On Sat, 2021-11-27 at 02:23 +0100, Thomas Gleixner wrote:
> Replace the about to vanish iterators and make use of the filtering.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-s390@vger.kernel.org
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  arch/s390/pci/pci_irq.c |    6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> --- a/arch/s390/pci/pci_irq.c
> +++ b/arch/s390/pci/pci_irq.c
> @@ -303,7 +303,7 @@ int arch_setup_msi_irqs(struct pci_dev *
>  
>  	/* Request MSI interrupts */
>  	hwirq = bit;
> -	for_each_pci_msi_entry(msi, pdev) {
> +	msi_for_each_desc(msi, &pdev->dev, MSI_DESC_NOTASSOCIATED) {
>  		rc = -EIO;
>  		if (hwirq - bit >= msi_vecs)
>  			break;
> @@ -362,9 +362,7 @@ void arch_teardown_msi_irqs(struct pci_d
>  		return;
>  
>  	/* Release MSI interrupts */
> -	for_each_pci_msi_entry(msi, pdev) {
> -		if (!msi->irq)
> -			continue;
> +	msi_for_each_desc(msi, &pdev->dev, MSI_DESC_ASSOCIATED) {
>  		irq_set_msi_desc(msi->irq, NULL);
>  		irq_free_desc(msi->irq);
>  		msi->msg.address_lo = 0;
> 

Hi Thomas,

while the change looks good to me I ran into some trouble trying to
test it. I tried with the git repository you linked in the cover
letter:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3

But with that I get the following linker error on s390:

s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_setup_msi_irqs':
/home/nschnelle/mainline/drivers/pci/msi/legacy.c:72: undefined reference to `msi_device_populate_sysfs'
s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_teardown_msi_irqs':
/home/nschnelle/mainline/drivers/pci/msi/legacy.c:78: undefined reference to `msi_device_destroy_sysfs'
make: *** [Makefile:1161: vmlinux] Error 1

This is caused by a misspelling of CONFIG_PCI_MSI_ARCH_FALLBACKS
(missing the final S) in kernel/irq/msi.c. With that fixed everything
builds and MSI IRQs work fine. So with that fixed you have my

Acked-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>

Best regards,
Niklas


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 14/32] s390/pci: Rework MSI descriptor walk
  2021-11-29 10:31   ` Niklas Schnelle
@ 2021-11-29 13:04     ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-29 13:04 UTC (permalink / raw)
  To: Niklas Schnelle, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, linux-s390, Heiko Carstens,
	Christian Borntraeger, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb

Niklas,

On Mon, Nov 29 2021 at 11:31, Niklas Schnelle wrote:
> On Sat, 2021-11-27 at 02:23 +0100, Thomas Gleixner wrote:
>
> while the change looks good to me I ran into some trouble trying to
> test it. I tried with the git repository you linked in the cover
> letter:
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git msi-v1-part-3
>
> But with that I get the following linker error on s390:
>
> s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_setup_msi_irqs':
> /home/nschnelle/mainline/drivers/pci/msi/legacy.c:72: undefined reference to `msi_device_populate_sysfs'
> s390x-11.2.0-ld: drivers/pci/msi/legacy.o: in function `pci_msi_legacy_teardown_msi_irqs':
> /home/nschnelle/mainline/drivers/pci/msi/legacy.c:78: undefined reference to `msi_device_destroy_sysfs'
> make: *** [Makefile:1161: vmlinux] Error 1

Yes, that got reported before and I fixed it locally already.

> This is caused by a misspelling of CONFIG_PCI_MSI_ARCH_FALLBACKS
> (missing the final S) in kernel/irq/msi.c. With that fixed everything
> builds and MSI IRQs work fine. So with that fixed you have my
>
> Acked-by: Niklas Schnelle <schnelle@linux.ibm.com>
> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>

Thanks for testing and dealing with my ineptness.

       tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators
  2021-11-29  9:26       ` Thomas Gleixner
@ 2021-11-29 14:01         ` Jason Gunthorpe
  2021-11-29 14:46           ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-11-29 14:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Megha Dey, Ashok Raj, linux-pci, Greg Kroah-Hartman, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

On Mon, Nov 29, 2021 at 10:26:11AM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Sun, Nov 28 2021 at 20:22, Thomas Gleixner wrote:
> > On Sat, Nov 27 2021 at 21:00, Jason Gunthorpe wrote:
> >> On Sat, Nov 27, 2021 at 02:22:33AM +0100, Thomas Gleixner wrote:
> >> I understand why that isn't desirable at this patch where the storage
> >> would have to be a list_head pointer, but still, seems like an odd
> >> place to end up at the end of the series.
> >>
> >> eg add index here unused and then the last patch uses it instead of
> >> __iter_idx.
> >
> > TBH, didn't think about doing just that. OTH, given the experience of
> > looking at the creative mess people create, this was probably also a
> > vain attempt to make it harder in the future.
> >
> >> Also, I don't understand why filter was stored in the dev and not
> >> passed into msi_next_desc() in the macro here?
> >
> > No real reason. I probably just stored it along with the rest. Lemme try
> > that index approach.
> 
> After looking at all the call sites again, there is no real usage for
> this local index variable.
> 
> If anything needs the index of a descriptor then it's available in the
> descriptor itself. That won't change because the low level message write
> code needs the index too and the only accessible storage there is
> msi_desc.

Oh, that makes it simpler, just use the current desc->index as the
input to the xa_for_each_start() and then there should be no need of
hidden state?

> What for? The usage sites should not have to care about the storage
> details of a facility they are using.

Generally for_each things shouldn't have hidden state that prevents
them from being nested. It is just an unexpected design pattern..

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators
  2021-11-29 14:01         ` Jason Gunthorpe
@ 2021-11-29 14:46           ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-29 14:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Megha Dey, Ashok Raj, linux-pci, Greg Kroah-Hartman, linux-s390,
	Heiko Carstens, Christian Borntraeger, Jon Mason, Dave Jiang,
	Allen Hubbe, linux-ntb

Jason,

On Mon, Nov 29 2021 at 10:01, Jason Gunthorpe wrote:
> On Mon, Nov 29, 2021 at 10:26:11AM +0100, Thomas Gleixner wrote:
>> After looking at all the call sites again, there is no real usage for
>> this local index variable.
>> 
>> If anything needs the index of a descriptor then it's available in the
>> descriptor itself. That won't change because the low level message write
>> code needs the index too and the only accessible storage there is
>> msi_desc.
>
> Oh, that makes it simpler, just use the current desc->index as the
> input to the xa_for_each_start() and then there should be no need of
> hidden state?

That works for alloc, but on free that's going to end up badly.

>> What for? The usage sites should not have to care about the storage
>> details of a facility they are using.
>
> Generally for_each things shouldn't have hidden state that prevents
> them from being nested. It is just an unexpected design pattern..

I'm not seeing any sensible use case for:

     msi_for_each_desc(dev)
        msi_for_each_desc(dev)

If that ever comes forth, I'm happy to debate this further :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-27  1:24 ` [patch 21/32] NTB/msi: Convert to msi_on_each_desc() Thomas Gleixner
@ 2021-11-29 18:21   ` Logan Gunthorpe
  2021-11-29 20:51     ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Logan Gunthorpe @ 2021-11-29 18:21 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger



On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
> Replace the about to vanish iterators, make use of the filtering and take
> the descriptor lock around the iteration.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Jon Mason <jdmason@kudzu.us>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Allen Hubbe <allenbh@gmail.com>
> Cc: linux-ntb@googlegroups.com

This patch looks good to me:

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

Thanks,

Logan

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-29 18:21   ` Logan Gunthorpe
@ 2021-11-29 20:51     ` Thomas Gleixner
  2021-11-29 22:27       ` Logan Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-29 20:51 UTC (permalink / raw)
  To: Logan Gunthorpe, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger

Logan,

On Mon, Nov 29 2021 at 11:21, Logan Gunthorpe wrote:
> On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
>> Replace the about to vanish iterators, make use of the filtering and take
>> the descriptor lock around the iteration.
>
> This patch looks good to me:
>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

thanks for having a look at this. While I have your attention, I have a
question related to NTB.

The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
to allocate non-hardware backed MSI-X descriptors.

AFAIU these descriptors are not MSI-X descriptors in the regular sense
of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
usage is somewhere in NTB which has nothing to do with the way how the
real MSI-X interrupts of a device work which explains why we have those
pci.msi_attrib.is_virtual checks all over the place.

I assume that there are other variants feeding into NTB which can handle
that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
in that code.

Could you please shed some light on the larger picture of this?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-29 20:51     ` Thomas Gleixner
@ 2021-11-29 22:27       ` Logan Gunthorpe
  2021-11-29 22:50         ` Dave Jiang
                           ` (2 more replies)
  0 siblings, 3 replies; 141+ messages in thread
From: Logan Gunthorpe @ 2021-11-29 22:27 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger



On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
> Logan,
> 
> On Mon, Nov 29 2021 at 11:21, Logan Gunthorpe wrote:
>> On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
>>> Replace the about to vanish iterators, make use of the filtering and take
>>> the descriptor lock around the iteration.
>>
>> This patch looks good to me:
>>
>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> 
> thanks for having a look at this. While I have your attention, I have a
> question related to NTB.
> 
> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
> to allocate non-hardware backed MSI-X descriptors.
> 
> AFAIU these descriptors are not MSI-X descriptors in the regular sense
> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
> usage is somewhere in NTB which has nothing to do with the way how the
> real MSI-X interrupts of a device work which explains why we have those
> pci.msi_attrib.is_virtual checks all over the place.
> 
> I assume that there are other variants feeding into NTB which can handle
> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
> in that code.
> 
> Could you please shed some light on the larger picture of this?

Yes, of course. I'll try to explain:

The NTB code here is trying to create an MSI interrupt that is not
triggered by the PCI device itself but from a peer behind the
Non-Transparent Bridge (or, more carefully: from the CPU's perspective
the interrupt will come from the PCI device, but nothing in the PCI
device's firmware or hardware will have triggered the interrupt).

In most cases, the NTB code needs more interrupts than the hardware
actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
for: it allows the driver to request more interrupts than the hardware
advertises (ie. pci_msix_vec_count()). These extra interrupts are
created, but get flagged with msi_attrib.is_virtual which ensures
functions that program the MSI-X table don't try to write past the end
of the hardware's table.

The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
(Or rather it can use any interrupt: it doesn't care whether its virtual
or not, it would be fine if it is just a spare interrupt in hardware,
but in practice, it will usually be a virtual one). The code uses these
interrupts by setting up a memory window in the bridge to cover the MMIO
addresses of MSI-X interrupts. It communicates the offsets of the
interrupts (and the MSI message data) to the peer so that the peer can
trigger the interrupt simply by writing the message data to its side of
the memory window. (In the code: ntbm_msi_request_threaded_irq() is
called to request and interrupt which fills in the ntb_msi_desc with the
offset and data, which is transferred to the peer which would then use
ntb_msi_peer_trigger() to trigger the interrupt.)

Existing NTB hardware does already have what's called a doorbell which
provides the same functionally as the above technique. However, existing
hardware implementations of doorbells have significant latency and thus
slow down performance substantially. Implementing the MSI interrupts as
described above increased the performance of ntb_transport by more than
three times[1].

There aren't really other "variants". In theory, IDT hardware would also
require the same quirk but the drivers in the tree aren't quite up to
snuff and don't even support ntb_transport (so nobody has added
support). Intel and AMD drivers could probably do this as well (provided
they have extra memory windows) but I don't know that anyone has tried.

Let me know if anything is still unclear or you have further questions.
You can also read the last posting of the patch series[2] if you'd like
more information.

Logan

[1] 2b0569b3b7e6 ("NTB: Add MSI interrupt support to ntb_transport")
[2]
https://lore.kernel.org/all/20190523223100.5526-1-logang@deltatee.com/T/#u





^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-29 22:27       ` Logan Gunthorpe
@ 2021-11-29 22:50         ` Dave Jiang
  2021-11-29 23:31         ` Jason Gunthorpe
  2021-11-30  0:29         ` Thomas Gleixner
  2 siblings, 0 replies; 141+ messages in thread
From: Dave Jiang @ 2021-11-29 22:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Thomas Gleixner, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger


On 11/29/2021 3:27 PM, Logan Gunthorpe wrote:
>
> On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
>> Logan,
>>
>> On Mon, Nov 29 2021 at 11:21, Logan Gunthorpe wrote:
>>> On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
>>>> Replace the about to vanish iterators, make use of the filtering and take
>>>> the descriptor lock around the iteration.
>>> This patch looks good to me:
>>>
>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>> thanks for having a look at this. While I have your attention, I have a
>> question related to NTB.
>>
>> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
>> to allocate non-hardware backed MSI-X descriptors.
>>
>> AFAIU these descriptors are not MSI-X descriptors in the regular sense
>> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
>> usage is somewhere in NTB which has nothing to do with the way how the
>> real MSI-X interrupts of a device work which explains why we have those
>> pci.msi_attrib.is_virtual checks all over the place.
>>
>> I assume that there are other variants feeding into NTB which can handle
>> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
>> in that code.
>>
>> Could you please shed some light on the larger picture of this?
> Yes, of course. I'll try to explain:
>
> The NTB code here is trying to create an MSI interrupt that is not
> triggered by the PCI device itself but from a peer behind the
> Non-Transparent Bridge (or, more carefully: from the CPU's perspective
> the interrupt will come from the PCI device, but nothing in the PCI
> device's firmware or hardware will have triggered the interrupt).
>
> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.
>
> The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
> (Or rather it can use any interrupt: it doesn't care whether its virtual
> or not, it would be fine if it is just a spare interrupt in hardware,
> but in practice, it will usually be a virtual one). The code uses these
> interrupts by setting up a memory window in the bridge to cover the MMIO
> addresses of MSI-X interrupts. It communicates the offsets of the
> interrupts (and the MSI message data) to the peer so that the peer can
> trigger the interrupt simply by writing the message data to its side of
> the memory window. (In the code: ntbm_msi_request_threaded_irq() is
> called to request and interrupt which fills in the ntb_msi_desc with the
> offset and data, which is transferred to the peer which would then use
> ntb_msi_peer_trigger() to trigger the interrupt.)
>
> Existing NTB hardware does already have what's called a doorbell which
> provides the same functionally as the above technique. However, existing
> hardware implementations of doorbells have significant latency and thus
> slow down performance substantially. Implementing the MSI interrupts as
> described above increased the performance of ntb_transport by more than
> three times[1].
>
> There aren't really other "variants". In theory, IDT hardware would also
> require the same quirk but the drivers in the tree aren't quite up to
> snuff and don't even support ntb_transport (so nobody has added
> support). Intel and AMD drivers could probably do this as well (provided
> they have extra memory windows) but I don't know that anyone has tried.

The Intel driver used to do something similar to bypass the doorbell 
hardware errata for pre Skylake Xeon hardware that wasn't upstream. I'd 
like to get this working for the performance reasons you mentioned. I 
just really need to find some time to test this with the second memory 
window Intel NTB has.


>
> Let me know if anything is still unclear or you have further questions.
> You can also read the last posting of the patch series[2] if you'd like
> more information.
>
> Logan
>
> [1] 2b0569b3b7e6 ("NTB: Add MSI interrupt support to ntb_transport")
> [2]
> https://lore.kernel.org/all/20190523223100.5526-1-logang@deltatee.com/T/#u
>
>
>
>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-29 22:27       ` Logan Gunthorpe
  2021-11-29 22:50         ` Dave Jiang
@ 2021-11-29 23:31         ` Jason Gunthorpe
  2021-11-29 23:52           ` Logan Gunthorpe
  2021-11-30  0:29         ` Thomas Gleixner
  2 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-11-29 23:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger

On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:

> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.

AFAICT what you've described is what Intel is calling IMS in other
contexts.

IMS is fundamentally a way to control MSI interrupt descriptors that
are not accessed through PCI SIG compliant means. In this case the NTB
driver has to do its magic to relay the addr/data pairs to the real
MSI storage in the hidden devices.

PCI_IRQ_VIRTUAL should probably be fully replaced by the new dynamic
APIs in the fullness of time..

> Existing NTB hardware does already have what's called a doorbell which
> provides the same functionally as the above technique. However, existing
> hardware implementations of doorbells have significant latency and thus
> slow down performance substantially. Implementing the MSI interrupts as
> described above increased the performance of ntb_transport by more than
> three times[1].

Does the doorbell scheme allow as many interrupts?

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-29 23:31         ` Jason Gunthorpe
@ 2021-11-29 23:52           ` Logan Gunthorpe
  2021-11-30  0:01             ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Logan Gunthorpe @ 2021-11-29 23:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger



On 2021-11-29 4:31 p.m., Jason Gunthorpe wrote:
> On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:
> 
>> In most cases, the NTB code needs more interrupts than the hardware
>> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
>> for: it allows the driver to request more interrupts than the hardware
>> advertises (ie. pci_msix_vec_count()). These extra interrupts are
>> created, but get flagged with msi_attrib.is_virtual which ensures
>> functions that program the MSI-X table don't try to write past the end
>> of the hardware's table.
> 
> AFAICT what you've described is what Intel is calling IMS in other
> contexts.
> 
> IMS is fundamentally a way to control MSI interrupt descriptors that
> are not accessed through PCI SIG compliant means. In this case the NTB
> driver has to do its magic to relay the addr/data pairs to the real
> MSI storage in the hidden devices.

With current applications, it isn't that there is real "MSI storage"
anywhere; the device on the other side of the bridge is always another
Linux host which holds the address (or rather mw offset) and data in
memory to use when it needs to trigger the interrupt of the other
machine. There are many prototypes and proprietary messes that try to
have other PCI devices (ie NVMe, etc) behind the non-transparent bridge;
but the Linux subsystem has no support for this.

> PCI_IRQ_VIRTUAL should probably be fully replaced by the new dynamic
> APIs in the fullness of time..

Perhaps, I don't really know much about IMS or how close a match it is.

>> Existing NTB hardware does already have what's called a doorbell which
>> provides the same functionally as the above technique. However, existing
>> hardware implementations of doorbells have significant latency and thus
>> slow down performance substantially. Implementing the MSI interrupts as
>> described above increased the performance of ntb_transport by more than
>> three times[1].
> 
> Does the doorbell scheme allow as many interrupts?

No, but for current applications there are plenty of doorbells.

Switchtec hardware (and I think other hardware) typically have 64
doorbells for the entire network (they must be split among the number of
hosts in the network; a two host system could have 32 per host). The NTB
subsystem in Linux only currently supports 2 hosts, but switchtec
hardware supports up to 48 hosts, in which case you might only have 1
doorbell per host and that might be limiting depending on the application.

Logan

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-29 23:52           ` Logan Gunthorpe
@ 2021-11-30  0:01             ` Jason Gunthorpe
  0 siblings, 0 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-11-30  0:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Thomas Gleixner, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger

On Mon, Nov 29, 2021 at 04:52:35PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2021-11-29 4:31 p.m., Jason Gunthorpe wrote:
> > On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:
> > 
> >> In most cases, the NTB code needs more interrupts than the hardware
> >> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> >> for: it allows the driver to request more interrupts than the hardware
> >> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> >> created, but get flagged with msi_attrib.is_virtual which ensures
> >> functions that program the MSI-X table don't try to write past the end
> >> of the hardware's table.
> > 
> > AFAICT what you've described is what Intel is calling IMS in other
> > contexts.
> > 
> > IMS is fundamentally a way to control MSI interrupt descriptors that
> > are not accessed through PCI SIG compliant means. In this case the NTB
> > driver has to do its magic to relay the addr/data pairs to the real
> > MSI storage in the hidden devices.
> 
> With current applications, it isn't that there is real "MSI storage"
> anywhere; the device on the other side of the bridge is always another
> Linux host which holds the address (or rather mw offset) and data in
> memory to use when it needs to trigger the interrupt of the other
> machine. 

Sure, that is fine "MSI Storage". The triggering device only needs to
store the addr/data pair someplace to be "MSI Storage".

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-29 22:27       ` Logan Gunthorpe
  2021-11-29 22:50         ` Dave Jiang
  2021-11-29 23:31         ` Jason Gunthorpe
@ 2021-11-30  0:29         ` Thomas Gleixner
  2021-11-30 19:21           ` Logan Gunthorpe
  2 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-30  0:29 UTC (permalink / raw)
  To: Logan Gunthorpe, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

Logan,

On Mon, Nov 29 2021 at 15:27, Logan Gunthorpe wrote:
> On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
>> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
>> to allocate non-hardware backed MSI-X descriptors.
>> 
>> AFAIU these descriptors are not MSI-X descriptors in the regular sense
>> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
>> usage is somewhere in NTB which has nothing to do with the way how the
>> real MSI-X interrupts of a device work which explains why we have those
>> pci.msi_attrib.is_virtual checks all over the place.
>> 
>> I assume that there are other variants feeding into NTB which can handle
>> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
>> in that code.
>> 
>> Could you please shed some light on the larger picture of this?
>
> Yes, of course. I'll try to explain:
>
> The NTB code here is trying to create an MSI interrupt that is not
> triggered by the PCI device itself but from a peer behind the
> Non-Transparent Bridge (or, more carefully: from the CPU's perspective
> the interrupt will come from the PCI device, but nothing in the PCI
> device's firmware or hardware will have triggered the interrupt).

That far I got.

> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.
>
> The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
> (Or rather it can use any interrupt: it doesn't care whether its virtual
> or not, it would be fine if it is just a spare interrupt in hardware,
> but in practice, it will usually be a virtual one). The code uses these
> interrupts by setting up a memory window in the bridge to cover the MMIO
> addresses of MSI-X interrupts. It communicates the offsets of the
> interrupts (and the MSI message data) to the peer so that the peer can
> trigger the interrupt simply by writing the message data to its side of
> the memory window. (In the code: ntbm_msi_request_threaded_irq() is
> called to request and interrupt which fills in the ntb_msi_desc with the
> offset and data, which is transferred to the peer which would then use
> ntb_msi_peer_trigger() to trigger the interrupt.)

So the whole thing looks like this:

    PCIe
     |
     |   ________________________
     |   |                       |
     |   | NTB                   |
     |   |                       |
     |   | PCI config space      |
     |   |     MSI-X space       |
     |   |_______________________|
     |---|                       |
         | Memory window A       |
         | Memory window ..      |
         | Memory window N       |
         |_______________________|

The peers can inject an interrupt through the associated memory window
like the NTB device itself does via the real MSI-X interrupts by writing
the associated MSI message data to the associated address (either
directly to the targeted APIC or to the IOMMU table).

As this message goes through the NTB device it's tagged as originating
from the NTB PCIe device as seen by the IOMMU/Interrupt remapping unit.

So far so good.

I completely understand why you went down the road to add this
PCI_IRQ_VIRTUAL "feature", but TBH this should have never happened.

Why?

These interrupts have absolutely nothing to do with PCI/MSI-X as defined
by the spec and as handled by the PCI/MSI core.

The fact that the MSI message is transported by PCIe does not change
that at all, neither does it matter that from an IOMMU/Interrupt
remapping POV these messages are tagged to come from that particular
PCIe device.

At the conceptual level these interrupts are in separate irq domains:

     |   _______________________
     |   |                      |
     |   | NTB                  |
     |   |                      |
     |   | PCI config space     |
     |   |     MSI-X space      | <- #1 Global or per IOMMU zone PCI/MSI domain
     |   |_____________________ |
     |---|                      |
         | Memory window A      |
         | Memory window ..     | <- #2 Per device NTB domain
         | Memory window N      |
         |______________________|

You surely can see the disctinction between #1 and #2, right?

And because they are in different domains, they simply cannot share the
interrupt chip implementation taking care of the particular oddities of
the "hardware". I know, you made it 'shared' by adding these
'is_virtual' conditionals all over the place, but that pretty much
defeats the purpose of having separate domains.

This is also reflected in drivers/ntb/msi.c::ntbm_msi_request_threaded_irq():

	for_each_pci_msi_entry(entry, ntb->pdev) {
		if (irq_has_action(entry->irq))
			continue;

What on earth guarantees that this check has any relevance? Just because
an entry does not have an interrupt requested on it does not tell
anything.

That might be an actual entry which belongs to the physical PCI NTB
device MSI-X space which is not requested yet. Just because that
swichtec device does not fall into that category does not make it any
more correct.

Seriously, infrastructure code which relies on undocumented assumptions
based on a particular hardware device is broken by definition.

I'm way too tired to come up with a proper solution for that, but that
PCI_IRQ_VIRTUAL has to die ASAP.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-30  0:29         ` Thomas Gleixner
@ 2021-11-30 19:21           ` Logan Gunthorpe
  2021-11-30 19:48             ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Logan Gunthorpe @ 2021-11-30 19:21 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86



On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
> At the conceptual level these interrupts are in separate irq domains:
> 
>      |   _______________________
>      |   |                      |
>      |   | NTB                  |
>      |   |                      |
>      |   | PCI config space     |
>      |   |     MSI-X space      | <- #1 Global or per IOMMU zone PCI/MSI domain
>      |   |_____________________ |
>      |---|                      |
>          | Memory window A      |
>          | Memory window ..     | <- #2 Per device NTB domain
>          | Memory window N      |
>          |______________________|
> 
> You surely can see the disctinction between #1 and #2, right?

I wouldn't say that's entirely obvious or even irrefutable. However, I'm
certainly open to this approach if it improves the code.

> I'm way too tired to come up with a proper solution for that, but that
> PCI_IRQ_VIRTUAL has to die ASAP.

I'm willing to volunteer a bit of my time to clean this up, but I'd need
a bit more direction on what a proper solution would look like. The MSI
domain code is far from well documented nor is it easy to understand.

Logan

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-30 19:21           ` Logan Gunthorpe
@ 2021-11-30 19:48             ` Thomas Gleixner
  2021-11-30 20:14               ` Logan Gunthorpe
  2021-11-30 20:28               ` Jason Gunthorpe
  0 siblings, 2 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-30 19:48 UTC (permalink / raw)
  To: Logan Gunthorpe, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

Logan,

On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
> On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
>> I'm way too tired to come up with a proper solution for that, but that
>> PCI_IRQ_VIRTUAL has to die ASAP.
>
> I'm willing to volunteer a bit of my time to clean this up, but I'd need
> a bit more direction on what a proper solution would look like. The MSI
> domain code is far from well documented nor is it easy to understand.

Fair enough. I'm struggling with finding time to document that properly.

I've not yet made my mind up what the best way forward for this is, but
I have a few ideas which I want to explore deeper.

But the most important question is right now on which architectures
these switchtec virtual interrupt things are used.

If it's used on any architecture which does not use hierarchical
irqdomains for MSI (x86, arm, arm64, power64), then using irqdomains is
obviously a non-starter unless falling back to a single interrupt would
not be considered a regression :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-30 19:48             ` Thomas Gleixner
@ 2021-11-30 20:14               ` Logan Gunthorpe
  2021-11-30 20:28               ` Jason Gunthorpe
  1 sibling, 0 replies; 141+ messages in thread
From: Logan Gunthorpe @ 2021-11-30 20:14 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Jason Gunthorpe, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Doug Meyer



On 2021-11-30 12:48 p.m., Thomas Gleixner wrote:
> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
>> On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
>>> I'm way too tired to come up with a proper solution for that, but that
>>> PCI_IRQ_VIRTUAL has to die ASAP.
>>
>> I'm willing to volunteer a bit of my time to clean this up, but I'd need
>> a bit more direction on what a proper solution would look like. The MSI
>> domain code is far from well documented nor is it easy to understand.
> 
> Fair enough. I'm struggling with finding time to document that properly.
> 
> I've not yet made my mind up what the best way forward for this is, but
> I have a few ideas which I want to explore deeper.
> 
> But the most important question is right now on which architectures
> these switchtec virtual interrupt things are used.
> 
> If it's used on any architecture which does not use hierarchical
> irqdomains for MSI (x86, arm, arm64, power64), then using irqdomains is
> obviously a non-starter unless falling back to a single interrupt would
> not be considered a regression :)

Well that's a hard question to answer. The switchtec hardware is a very
generic PCI switch that can technically be used on any architecture that
supports PCI. However, NTB is a very specialized use case and only a
handful of companies have attempted to use it for anything, as is. I
can't say I know for sure, but my gut says the vast majority (if not
all) are using x86. Maybe some are trying with arm64 or power64, but the
only architecture not in that list that I'd conceivably think someone
might care about down the road might be riscv.

Most other NTB hardware (specifically AMD and Intel) are built into x86
CPUs so should be fine with this restriction.

I personally expect it would be fine to add a dependency on
CONFIG_IRQ_DOMAIN_HIERARCHY to CONFIG_NTB_MSI. However, I've copied Doug
from GigaIO who's the only user I know that might have a better informed
opinion on this.

Logan

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-30 19:48             ` Thomas Gleixner
  2021-11-30 20:14               ` Logan Gunthorpe
@ 2021-11-30 20:28               ` Jason Gunthorpe
  2021-11-30 21:23                 ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-11-30 20:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Tue, Nov 30, 2021 at 08:48:03PM +0100, Thomas Gleixner wrote:
> Logan,
> 
> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
> > On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
> >> I'm way too tired to come up with a proper solution for that, but that
> >> PCI_IRQ_VIRTUAL has to die ASAP.
> >
> > I'm willing to volunteer a bit of my time to clean this up, but I'd need
> > a bit more direction on what a proper solution would look like. The MSI
> > domain code is far from well documented nor is it easy to understand.
> 
> Fair enough. I'm struggling with finding time to document that properly.
> 
> I've not yet made my mind up what the best way forward for this is, but
> I have a few ideas which I want to explore deeper.

I may have lost the plot in all of these patches, but I thought the
direction was moving toward the msi_domain_alloc_irqs() approach IDXD
demo'd here:

https://lore.kernel.org/kvm/162164243591.261970.3439987543338120797.stgit@djiang5-desk3.ch.intel.com/

I'd expect all the descriptor handling code in drivers/ntb/msi.c to
get wrapped in an irq_chip instead of inserting a single-use callback
to the pci core code's implementation:

void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
        if (entry->write_msi_msg)
                entry->write_msi_msg(entry, entry->write_msi_msg_data);

If this doesn't become an irq_chip what other way is there to properly
program the addr/data pair as drivers/ntb/msi.c is doing?

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-30 20:28               ` Jason Gunthorpe
@ 2021-11-30 21:23                 ` Thomas Gleixner
  2021-12-01  0:17                   ` Jason Gunthorpe
  2021-12-01 14:52                   ` Thomas Gleixner
  0 siblings, 2 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-11-30 21:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Tue, Nov 30 2021 at 16:28, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 08:48:03PM +0100, Thomas Gleixner wrote:
>> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
>> > On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
>> >> I'm way too tired to come up with a proper solution for that, but that
>> >> PCI_IRQ_VIRTUAL has to die ASAP.
>> >
>> > I'm willing to volunteer a bit of my time to clean this up, but I'd need
>> > a bit more direction on what a proper solution would look like. The MSI
>> > domain code is far from well documented nor is it easy to understand.
>> 
>> Fair enough. I'm struggling with finding time to document that properly.
>> 
>> I've not yet made my mind up what the best way forward for this is, but
>> I have a few ideas which I want to explore deeper.
>
> I may have lost the plot in all of these patches, but I thought the
> direction was moving toward the msi_domain_alloc_irqs() approach IDXD
> demo'd here:
>
> https://lore.kernel.org/kvm/162164243591.261970.3439987543338120797.stgit@djiang5-desk3.ch.intel.com/

Yes, that's something I have in mind. Though this patch series would not
be really required to support IDXD, it's making stuff simpler.

The main point of this is to cure the VFIO issue of tearing down MSI-X
of passed through devices in order to expand the MSI-X vector space on
the host.

> I'd expect all the descriptor handling code in drivers/ntb/msi.c to
> get wrapped in an irq_chip instead of inserting a single-use callback
> to the pci core code's implementation:
>
> void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
> {
>         if (entry->write_msi_msg)
>                 entry->write_msi_msg(entry, entry->write_msi_msg_data);
>
> If this doesn't become an irq_chip what other way is there to properly
> program the addr/data pair as drivers/ntb/msi.c is doing?

That's not the question. This surely will be a separate irq chip and a
separate irqdomain.

The real problem is where to store the MSI descriptors because the PCI
device has its own real PCI/MSI-X interrupts which means it still shares
the storage space.

IDXD is different in that regard because IDXD creates subdevices which
have their own struct device and they just store the MSI descriptors in
the msi data of that device.

I'm currently tending to partition the index space in the xarray:

 0x00000000 - 0x0000ffff          PCI/MSI-X
 0x00010000 - 0x0001ffff          NTB

which is feasible now with the range modifications and way simpler to do
with xarray than with the linked list.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-30 21:23                 ` Thomas Gleixner
@ 2021-12-01  0:17                   ` Jason Gunthorpe
  2021-12-01 10:16                     ` Thomas Gleixner
  2021-12-01 14:52                   ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-01  0:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

> > If this doesn't become an irq_chip what other way is there to properly
> > program the addr/data pair as drivers/ntb/msi.c is doing?
> 
> That's not the question. This surely will be a separate irq chip and a
> separate irqdomain.

OK
 
> The real problem is where to store the MSI descriptors because the PCI
> device has its own real PCI/MSI-X interrupts which means it still shares
> the storage space.

Er.. I never realized that just looking at the patches :|

That is relevant to all real "IMS" users. IDXD escaped this because
it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
a messy artifact of VFIO, it should not be required to make the IRQ
layers work.

I don't think it makes sense that the msi_desc would point to a mdev,
the iommu layer consumes the msi_desc_to_dev(), it really should point
to the physical device that originates the message with a proper
iommu ops/data/etc.

> I'm currently tending to partition the index space in the xarray:
> 
>  0x00000000 - 0x0000ffff          PCI/MSI-X
>  0x00010000 - 0x0001ffff          NTB

It is OK, with some xarray work it can be range allocating & reserving
so that the msi_domain_alloc_irqs() flows can carve out chunks of the
number space..

Another view is the msi_domain_alloc_irqs() flows should have their
own xarrays..

> which is feasible now with the range modifications and way simpler to do
> with xarray than with the linked list.

Indeed!

Regards,
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01  0:17                   ` Jason Gunthorpe
@ 2021-12-01 10:16                     ` Thomas Gleixner
  2021-12-01 13:00                       ` Jason Gunthorpe
  2021-12-01 16:28                       ` Dave Jiang
  0 siblings, 2 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 10:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>> The real problem is where to store the MSI descriptors because the PCI
>> device has its own real PCI/MSI-X interrupts which means it still shares
>> the storage space.
>
> Er.. I never realized that just looking at the patches :|
>
> That is relevant to all real "IMS" users. IDXD escaped this because
> it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
> a messy artifact of VFIO, it should not be required to make the IRQ
> layers work.

> I don't think it makes sense that the msi_desc would point to a mdev,
> the iommu layer consumes the msi_desc_to_dev(), it really should point
> to the physical device that originates the message with a proper
> iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

>> I'm currently tending to partition the index space in the xarray:
>> 
>>  0x00000000 - 0x0000ffff          PCI/MSI-X
>>  0x00010000 - 0x0001ffff          NTB
>
> It is OK, with some xarray work it can be range allocating & reserving
> so that the msi_domain_alloc_irqs() flows can carve out chunks of the
> number space..
>
> Another view is the msi_domain_alloc_irqs() flows should have their
> own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

    struct xarray     store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.

>> which is feasible now with the range modifications and way simpler to do
>> with xarray than with the linked list.
>
> Indeed!

I'm glad you like the approach.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 10:16                     ` Thomas Gleixner
@ 2021-12-01 13:00                       ` Jason Gunthorpe
  2021-12-01 17:35                         ` Thomas Gleixner
  2021-12-01 16:28                       ` Dave Jiang
  1 sibling, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-01 13:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:

> Looking at the device slices as subdevices with their own struct device
> makes a lot of sense from the conceptual level.

Except IMS is not just for subdevices, it should be usable for any
driver in any case as a general interrupt mechiansm, as you alluded to
below about ethernet queues. ntb seems to be the current example of
this need..

If it works properly on the physical PCI dev there is no reason to try
to also make it work on the mdev and add complixity in iommu land.

> Though I fear there is also a use case for MSI-X and IMS tied to the
> same device. That network card you are talking about might end up using
> MSI-X for a control block and then IMS for the actual network queues
> when it is used as physical function device as a whole, but that's
> conceptually a different case.

Is it? Why?

IDXD is not so much making device "slices", but virtualizing and
sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
described and the VFIO driver is simply allocating queues from a PCI
device for specific usages and assigning them interrupts.

There is already a char dev interface that equally allocates queues
from the same IDXD device, why shouldn't it be able to access IMS
interrupt pools too?

IMHO a well designed IDXD driver should put all the PCI MMIO, queue
mangement, interrupts, etc in the PCI driver layer, and the VFIO
driver layer should only deal with the MMIO trapping and VFIO
interfacing.

From this view it is conceptually wrong to have the VFIO driver
directly talking to MMIO registers in the PCI device or owning the
irq_chip. It would be very odd for the PCI driver to allocate
interrupts from some random external struct device when it is creating
queues on the PCI device.

> > Another view is the msi_domain_alloc_irqs() flows should have their
> > own xarrays..
> 
> Yes, I was thinking about that as well. The trivial way would be:
> 
>     struct xarray     store[MSI_MAX_STORES];
> 
> and then have a store index for each allocation domain. With the
> proposed encapsulation of the xarray handling that's definitely
> feasible. Whether that buys much is a different question. Let me think
> about it some more.

Any possibility that the 'IMS' xarray could be outside the struct
device?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-11-30 21:23                 ` Thomas Gleixner
  2021-12-01  0:17                   ` Jason Gunthorpe
@ 2021-12-01 14:52                   ` Thomas Gleixner
  2021-12-01 15:11                     ` Jason Gunthorpe
  1 sibling, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 14:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Tue, Nov 30 2021 at 22:23, Thomas Gleixner wrote:
> On Tue, Nov 30 2021 at 16:28, Jason Gunthorpe wrote:
>
> The real problem is where to store the MSI descriptors because the PCI
> device has its own real PCI/MSI-X interrupts which means it still shares
> the storage space.

Bah. I confused myself by staring at the existing code instead of
looking at how this NTB stuff actually works.

So if I understand it correctly then the end result looks like this:

1) PCIe device (switchtec)

   The device has 4 MSI[X] interrupts: event, dma_rpc, message,
   doorbell. The event and dma_rpc interrupts are requested by the
   switchtec PCI driver itself.

2) Switchtec character device

   The switchtec PCI driver creates a character device which is exposed
   for device specific IOCTLs

   The device belongs to the switchtec_class device class.

3) Switchtec NTB device

   The ntb_hw_switchtec driver registers the switchtec_class class
   interface.

   So when #2 is registered with the driver core the switchtec class
   interface add_dev() function is invoked. That function creates a NTB
   device, requests the message and the doorbell interrupts which have
   been allocated by the underlying PCIe device driver (#1) and
   registers the NTB device with the NTB core.

4) The NTB core then tries to use the virtual MSI vectors which have
   been allocated by the switchtec driver in #1 and requires the msg
   write intercept to actually expose it to the peers.

So we really can go and create a MSI irqdomain and stick the pointer
into stdev->dev.irqdomain. The parent domain of this irqdomain is

     stdev->pdev.dev.irqdomain->parent

which is either the irq remapping domain or the vector domain. Which is
pretty much what I proposed as general facility for IMS/IDXD. I need to
go back and polish that up on top of the current pile.

Along with that have an irq chip implementation which exposes:

static struct irq_chip ntb_chip = {
	.name			= "ntb",
	.irq_ack		= irq_chip_ack_parent,
	.irq_write_msi_msg	= ntb_msi_write_msg,
#ifdef CONFIG_SMP
	.irq_set_affinity	= irq_chip_set_affinity_parent,
#endif
};

We just need some reasonable solution for the DMA/remap problem Jason
mentioned vs. msi_desc::dev, but that wants to be cleaned up in any
case for all the aliasing muck.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 14:52                   ` Thomas Gleixner
@ 2021-12-01 15:11                     ` Jason Gunthorpe
  2021-12-01 18:37                       ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-01 15:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Wed, Dec 01, 2021 at 03:52:02PM +0100, Thomas Gleixner wrote:
> So we really can go and create a MSI irqdomain and stick the pointer
> into stdev->dev.irqdomain. The parent domain of this irqdomain is
> 
>      stdev->pdev.dev.irqdomain->parent

It can work (pending some solution to the iommu stuff), but IMHO it is
strange/hacky to put HW objects like irqdomain on what is a character
struct device with a set major/minor in dev->devt and associated
struct cdev.

Conceptually it makes no sense to me, cdevs are software constructs,
they should never go into HW areas..

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 10:16                     ` Thomas Gleixner
  2021-12-01 13:00                       ` Jason Gunthorpe
@ 2021-12-01 16:28                       ` Dave Jiang
  2021-12-01 18:41                         ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Dave Jiang @ 2021-12-01 16:28 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu


On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
> Jason,
>
> CC+ IOMMU folks
>
> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>> The real problem is where to store the MSI descriptors because the PCI
>>> device has its own real PCI/MSI-X interrupts which means it still shares
>>> the storage space.
>> Er.. I never realized that just looking at the patches :|
>>
>> That is relevant to all real "IMS" users. IDXD escaped this because
>> it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
>> a messy artifact of VFIO, it should not be required to make the IRQ
>> layers work.
>> I don't think it makes sense that the msi_desc would point to a mdev,
>> the iommu layer consumes the msi_desc_to_dev(), it really should point
>> to the physical device that originates the message with a proper
>> iommu ops/data/etc.
> Looking at the device slices as subdevices with their own struct device
> makes a lot of sense from the conceptual level. That makes is pretty
> much obvious to manage the MSIs of those devices at this level like we
> do for any other device.
>
> Whether mdev is the right encapsulation for these subdevices is an
> orthogonal problem.
>
> I surely agree that msi_desc::dev is an interesting question, but we
> already have this disconnect of msi_desc::dev and DMA today due to DMA
> aliasing. I haven't looked at that in detail yet, but of course the
> alias handling is substantially different accross the various IOMMU
> implementations.
>
> Though I fear there is also a use case for MSI-X and IMS tied to the
> same device. That network card you are talking about might end up using
> MSI-X for a control block and then IMS for the actual network queues
> when it is used as physical function device as a whole, but that's
> conceptually a different case.

Hi Thomas. This is actually the IDXD usage for a mediated device passed 
to a guest kernel when we plumb the pass through of IMS to the guest 
rather than doing previous implementation of having a MSIX vector on 
guest backed by IMS. The control block for the mediated device is 
emulated and therefore an emulated MSIX vector will be surfaced as 
vector 0. However the queues will backed by IMS vectors. So we end up 
needing MSIX and IMS coexist running on the guest kernel for the same 
device.

DJ

>
>>> I'm currently tending to partition the index space in the xarray:
>>>
>>>   0x00000000 - 0x0000ffff          PCI/MSI-X
>>>   0x00010000 - 0x0001ffff          NTB
>> It is OK, with some xarray work it can be range allocating & reserving
>> so that the msi_domain_alloc_irqs() flows can carve out chunks of the
>> number space..
>>
>> Another view is the msi_domain_alloc_irqs() flows should have their
>> own xarrays..
> Yes, I was thinking about that as well. The trivial way would be:
>
>      struct xarray     store[MSI_MAX_STORES];
>
> and then have a store index for each allocation domain. With the
> proposed encapsulation of the xarray handling that's definitely
> feasible. Whether that buys much is a different question. Let me think
> about it some more.
>
>>> which is feasible now with the range modifications and way simpler to do
>>> with xarray than with the linked list.
>> Indeed!
> I'm glad you like the approach.
>
> Thanks,
>
>          tglx
>
>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 13:00                       ` Jason Gunthorpe
@ 2021-12-01 17:35                         ` Thomas Gleixner
  2021-12-01 18:14                           ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 17:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
>> Looking at the device slices as subdevices with their own struct device
>> makes a lot of sense from the conceptual level.
>
> Except IMS is not just for subdevices, it should be usable for any
> driver in any case as a general interrupt mechiansm, as you alluded to
> below about ethernet queues. ntb seems to be the current example of
> this need..

But NTB is operating through an abstraction layer and is not a direct
PCIe device driver.

> IDXD is not so much making device "slices", but virtualizing and
> sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> described and the VFIO driver is simply allocating queues from a PCI
> device for specific usages and assigning them interrupts.

Right.

But what is the representation for that resulting device? Some VFIO
specific homebrewn muck or something which is based on struct device?

Right now with VF passthrough, I can see the interrupts which are
associated to the device.

How is that going to be with something which is just made up? Does that
expose it's own msi properties then somewhere hidden in the VFIO layer?

See below.

> There is already a char dev interface that equally allocates queues
> from the same IDXD device, why shouldn't it be able to access IMS
> interrupt pools too?

Why wouldn't it be able to do so?

> IMHO a well designed IDXD driver should put all the PCI MMIO, queue
> mangement, interrupts, etc in the PCI driver layer, and the VFIO
> driver layer should only deal with the MMIO trapping and VFIO
> interfacing.
>
> From this view it is conceptually wrong to have the VFIO driver
> directly talking to MMIO registers in the PCI device or owning the
> irq_chip.

The VFIO driver does not own the irq chip ever. The irq chip is of
course part of the underlying infrastructure. I never asked for that.

PCIe driver
     Owns the PCI/MSI[x] interrupts for the control block

     Has a control mechanism which handles queues or whatever the
     device is partitioned into, that's what I called slice.

     The irqdomain is part of that control mechanism and the irqchip
     implementation belongs to that as well. It has to because the
     message store is device specific.

     Whether the doamin and chip implementation is in a separate
     drivers/irqchip/foo.c file for sharing or directly built into the
     PCIe driver itself does not matter.

     When it allocates a slice for whatever usage then it also
     allocates the IMS interrupts (though the VFIO people want to
     have only one and do the allocations later on demand).

     That allocation cannot be part of the PCI/MSIx interrupt
     domain as we already agreed on.

We have several problems to solve. Let me look at it from both point of
views:

    1) Storage

       A) Having "subdevices" solves the storage problem nicely and
          makes everything just fall in place. Even for a purely
          physical multiqueue device one can argue that each queue is a
          "subdevice" of the physical device. The fact that we lump them
          all together today is not an argument against that.

       B) Requires extra storage in the PCIe device and extra storage
          per subdevice, queue to keep track of the interrupts which
          are associated to it.

    2) Exposure of VFIO interrupts via sysfs

       A) Just works

       B) Requires extra mechanisms to expose it

    3) On demand expansion of the vectors for VFIO

       A) Just works because the device has an irqdomain assigned.

       B) Requires extra indirections to do that

    4) IOMMU msi_desc::dev

       A) I think that's reasonably simple to address by having the
          relationship to the underlying PCIe device stored in struct
          device, which is not really adding any complexity to the IOMMU
          code.

          Quite the contrary it could be used to make the DMA aliasing a
          problem of device setup time and not a lookup per interrupt
          allocation in the IOMMU code.

       B) No change required.

    4) PASID

       While an Intel IDXD specific issue, it want's to be solved
       without any nasty hacks.

       A) Having a "subdevice" allows to associate the PASID with the
          underlying struct device which makes IOMMU integration trivial

       B) Needs some other custom hackery to get that solved

So both variants come with their ups and downs.

IMO A) is the right thing to do when looking at all the involved moving
pieces.

> It would be very odd for the PCI driver to allocate interrupts from
> some random external struct device when it is creating queues on the
> PCI device.

No. That's not what I want.

>> and then have a store index for each allocation domain. With the
>> proposed encapsulation of the xarray handling that's definitely
>> feasible. Whether that buys much is a different question. Let me think
>> about it some more.
>
> Any possibility that the 'IMS' xarray could be outside the struct
> device?

We could, but we really want to keep things tied to devices which is the
right thing to do.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 17:35                         ` Thomas Gleixner
@ 2021-12-01 18:14                           ` Jason Gunthorpe
  2021-12-01 18:46                             ` Logan Gunthorpe
  2021-12-01 20:21                             ` Thomas Gleixner
  0 siblings, 2 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-01 18:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> > On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
> >> Looking at the device slices as subdevices with their own struct device
> >> makes a lot of sense from the conceptual level.
> >
> > Except IMS is not just for subdevices, it should be usable for any
> > driver in any case as a general interrupt mechiansm, as you alluded to
> > below about ethernet queues. ntb seems to be the current example of
> > this need..
> 
> But NTB is operating through an abstraction layer and is not a direct
> PCIe device driver.

I'm not sure exactly how NTB seems to be split between switchtec and
the ntb code, but since the ntbd code seems to be doing MMIO touches,
it feels like part of a PCIe driver?

> > IDXD is not so much making device "slices", but virtualizing and
> > sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> > described and the VFIO driver is simply allocating queues from a PCI
> > device for specific usages and assigning them interrupts.
> 
> Right.
> 
> But what is the representation for that resulting device? Some VFIO
> specific homebrewn muck or something which is based on struct device?

Why is there a device? A queue is a queue, not a device.

If the task is to make some struct device (eg mdev, or cdev, or
whatever) then queues may be allocated to do this, but the queue is
logically a resource handed out by the PCIe driver and there should
not be a requirement to have an external struct device just to create
a queue.

> Right now with VF passthrough, I can see the interrupts which are
> associated to the device.
> 
> How is that going to be with something which is just made up? Does that
> expose it's own msi properties then somewhere hidden in the VFIO layer?

For sysfs, I think all interrupts should be on the PCI directory.

> > There is already a char dev interface that equally allocates queues
> > from the same IDXD device, why shouldn't it be able to access IMS
> > interrupt pools too?
> 
> Why wouldn't it be able to do so?

The only 'struct device' there is a cdev and I really don't think
cdevs should have interrupts. It is a bit hacky as a in-kernel thing
and downright wrong as a sysfs ABI.

> The VFIO driver does not own the irq chip ever. The irq chip is of
> course part of the underlying infrastructure. I never asked for that.

That isn't quite what I ment.. I ment the PCIe driver cannot create
the domain or make use of the irq_chip until the VFIO layer comes
along and provides the struct device. To me this is backwards
layering, the interrupts come from the PCIe layer and should exist
independently from VFIO.

>      When it allocates a slice for whatever usage then it also
>      allocates the IMS interrupts (though the VFIO people want to
>      have only one and do the allocations later on demand).
> 
>      That allocation cannot be part of the PCI/MSIx interrupt
>      domain as we already agreed on.

Yes, it is just an open question of where the new irq_domain need to
reside

>     1) Storage
> 
>        A) Having "subdevices" solves the storage problem nicely and
>           makes everything just fall in place. Even for a purely
>           physical multiqueue device one can argue that each queue is a
>           "subdevice" of the physical device. The fact that we lump them
>           all together today is not an argument against that.

I don't like the idea that queue is a device, that is trying to force
a struct device centric world onto a queue which doesn't really want
it..
 
>        B) Requires extra storage in the PCIe device and extra storage
>           per subdevice, queue to keep track of the interrupts which
>           are associated to it.

Yes

>     2) Exposure of VFIO interrupts via sysfs
> 
>        A) Just works

I would say this is flawed, in sysfs I expect all the interrupts for
the PCIe device to be in the PCIe sysfs, not strewn over subsystem
owned sub-directories.

For instance, today in mlx5, when a subdevice allocates a queue for a
slice (which is modeled as an aux device) the queue's assigned MSI-X
interrupt shows up on the PCIe sysfs, not the aux.

It should be uniform, if I assign a queue a legacy INT, MSI or an IMS
it should show in sysfs in the same way. Leaking this kernel
implementation detail as sysfs ABI does not seem good.

>     3) On demand expansion of the vectors for VFIO
> 
>        A) Just works because the device has an irqdomain assigned.
> 
>        B) Requires extra indirections to do that

Yes.
 
>     4) PASID
> 
>        While an Intel IDXD specific issue, it want's to be solved
>        without any nasty hacks.
> 
>        A) Having a "subdevice" allows to associate the PASID with the
>           underlying struct device which makes IOMMU integration trivial
> 
>        B) Needs some other custom hackery to get that solved

Yes

> > Any possibility that the 'IMS' xarray could be outside the struct
> > device?
> 
> We could, but we really want to keep things tied to devices which is the
> right thing to do.

I see the sysfs issue makes this a poor idea as well, as where would
the sysfs live if there was no struct device?

I'm inclined to think either of your ideas with the xarray are good
directions, primarily because it keeps HW data out of non-HW struct
devices and maintains a consistent sysfs representation for all the
different interrupt allocation methods.

Regards,
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 15:11                     ` Jason Gunthorpe
@ 2021-12-01 18:37                       ` Thomas Gleixner
  2021-12-01 18:47                         ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 18:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Wed, Dec 01 2021 at 11:11, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 03:52:02PM +0100, Thomas Gleixner wrote:
>> So we really can go and create a MSI irqdomain and stick the pointer
>> into stdev->dev.irqdomain. The parent domain of this irqdomain is
>> 
>>      stdev->pdev.dev.irqdomain->parent
>
> It can work (pending some solution to the iommu stuff), but IMHO it is
> strange/hacky to put HW objects like irqdomain on what is a character
> struct device with a set major/minor in dev->devt and associated
> struct cdev.
>
> Conceptually it makes no sense to me, cdevs are software constructs,
> they should never go into HW areas..

I picked that because it _is_ already used to establish the connection
to the switchtec_class NTB driver which is beyond the usual cdev muck.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 16:28                       ` Dave Jiang
@ 2021-12-01 18:41                         ` Thomas Gleixner
  2021-12-01 18:47                           ` Dave Jiang
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 18:41 UTC (permalink / raw)
  To: Dave Jiang, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

Dave,

please trim your replies.

On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:

> On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
>> Jason,
>>
>> CC+ IOMMU folks
>>
>> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>
>> Though I fear there is also a use case for MSI-X and IMS tied to the
>> same device. That network card you are talking about might end up using
>> MSI-X for a control block and then IMS for the actual network queues
>> when it is used as physical function device as a whole, but that's
>> conceptually a different case.
>
> Hi Thomas. This is actually the IDXD usage for a mediated device passed 
> to a guest kernel when we plumb the pass through of IMS to the guest 
> rather than doing previous implementation of having a MSIX vector on 
> guest backed by IMS.

Which makes a lot of sense.

> The control block for the mediated device is emulated and therefore an
> emulated MSIX vector will be surfaced as vector 0. However the queues
> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
> running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 18:14                           ` Jason Gunthorpe
@ 2021-12-01 18:46                             ` Logan Gunthorpe
  2021-12-01 20:21                             ` Thomas Gleixner
  1 sibling, 0 replies; 141+ messages in thread
From: Logan Gunthorpe @ 2021-12-01 18:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Gleixner
  Cc: LKML, Bjorn Helgaas, Marc Zygnier, Alex Williamson, Kevin Tian,
	Megha Dey, Ashok Raj, linux-pci, Greg Kroah-Hartman, Jon Mason,
	Dave Jiang, Allen Hubbe, linux-ntb, linux-s390, Heiko Carstens,
	Christian Borntraeger, x86, Joerg Roedel, iommu



On 2021-12-01 11:14 a.m., 'Jason Gunthorpe' via linux-ntb wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>>> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
>>>> Looking at the device slices as subdevices with their own struct device
>>>> makes a lot of sense from the conceptual level.
>>>
>>> Except IMS is not just for subdevices, it should be usable for any
>>> driver in any case as a general interrupt mechiansm, as you alluded to
>>> below about ethernet queues. ntb seems to be the current example of
>>> this need..
>>
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
> 
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

Eh, sort of. NTB has lots of layers. At the top you'll find ntb_netdev
which is an network interface. Below that is ntb_transport() which is a
generic queue pair. Below that is the hardware driver itself (ie
switchtec) through the abstraction layer. The switchtec driver is split
in two: the main driver which just allows for information and
administration of the switch itself and switchtec_ntb which is the
module that provides the NTB abstractions to twiddle its registers.

ntb_transport() doesn't directly do MMIO touches (as it doesn't know
what the underlying hardware registers are). Except for the memory
windows which are usually setup to be a specific BAR (or parts of a
BAR). ntb_transport will do MMIO writes to those specific BAR address
which correspond to writing into buffers on the peer.

Logan

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 18:37                       ` Thomas Gleixner
@ 2021-12-01 18:47                         ` Jason Gunthorpe
  2021-12-01 20:26                           ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-01 18:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Wed, Dec 01, 2021 at 07:37:32PM +0100, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 11:11, Jason Gunthorpe wrote:
> > On Wed, Dec 01, 2021 at 03:52:02PM +0100, Thomas Gleixner wrote:
> >> So we really can go and create a MSI irqdomain and stick the pointer
> >> into stdev->dev.irqdomain. The parent domain of this irqdomain is
> >> 
> >>      stdev->pdev.dev.irqdomain->parent
> >
> > It can work (pending some solution to the iommu stuff), but IMHO it is
> > strange/hacky to put HW objects like irqdomain on what is a character
> > struct device with a set major/minor in dev->devt and associated
> > struct cdev.
> >
> > Conceptually it makes no sense to me, cdevs are software constructs,
> > they should never go into HW areas..
> 
> I picked that because it _is_ already used to establish the connection
> to the switchtec_class NTB driver which is beyond the usual cdev muck.

IMHO that is also a misuse. These days two drivers should be hooked
together using an aux device, not a cdev and the obscure
class_interface stuff. Aux device supports auto probing and module
auto loading for instance.

An interrupt on an aux device is at least somewhat conceptually
parallel to an interrupt on a mdev as both are usually representing
some slice of a device.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 18:41                         ` Thomas Gleixner
@ 2021-12-01 18:47                           ` Dave Jiang
  2021-12-01 20:25                             ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Dave Jiang @ 2021-12-01 18:47 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu


On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
> Dave,
>
> please trim your replies.
>
> On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:
>
>> On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
>>> Jason,
>>>
>>> CC+ IOMMU folks
>>>
>>> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>>>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>> Though I fear there is also a use case for MSI-X and IMS tied to the
>>> same device. That network card you are talking about might end up using
>>> MSI-X for a control block and then IMS for the actual network queues
>>> when it is used as physical function device as a whole, but that's
>>> conceptually a different case.
>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>> to a guest kernel when we plumb the pass through of IMS to the guest
>> rather than doing previous implementation of having a MSIX vector on
>> guest backed by IMS.
> Which makes a lot of sense.
>
>> The control block for the mediated device is emulated and therefore an
>> emulated MSIX vector will be surfaced as vector 0. However the queues
>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>> running on the guest kernel for the same device.
> Why? What's wrong with using straight MSI-X for all of them?

The hardware implementation does not have enough MSIX vectors for 
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS 
vectors. So if we are to do MSI-X for all of them, then we need to do 
the IMS backed MSIX scheme rather than passthrough IMS to guests.


>
> Thanks,
>
>          tglx
>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 18:14                           ` Jason Gunthorpe
  2021-12-01 18:46                             ` Logan Gunthorpe
@ 2021-12-01 20:21                             ` Thomas Gleixner
  2021-12-02  0:01                               ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 20:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
>
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

It's a maze of magic PCIe driver with callbacks left and right into the
underlying NTB hardware driver which itself does some stuff on its own
and also calls back into the underlying PCIe driver.

Decomposing that thing feels like being trapped in the Labyrinth of
Knossos. But let's ignore that for a moment.

>> The VFIO driver does not own the irq chip ever. The irq chip is of
>> course part of the underlying infrastructure. I never asked for that.
>
> That isn't quite what I ment.. I ment the PCIe driver cannot create
> the domain or make use of the irq_chip until the VFIO layer comes
> along and provides the struct device. To me this is backwards
> layering, the interrupts come from the PCIe layer and should exist
> independently from VFIO.

See below.

>>      When it allocates a slice for whatever usage then it also
>>      allocates the IMS interrupts (though the VFIO people want to
>>      have only one and do the allocations later on demand).
>> 
>>      That allocation cannot be part of the PCI/MSIx interrupt
>>      domain as we already agreed on.
>
> Yes, it is just an open question of where the new irq_domain need to
> reside

The irqdomain is created by and part of the physical device. But that
does not mean that the interrupts which are allocated from that irq
domain are stored in the physical device representation.

Going by that logic, the PCI/MSI domain would store all MSI[X]
interrupts which are allocated on some root bridge or wherever.

They are obviously stored per PCI device, but the irqdomain is owned
e.g. by the underlying IOMMU zone.

The irqdomain is managing and handing out resources. Like any other
resource manager does.

See?

>>     1) Storage
>> 
>>        A) Having "subdevices" solves the storage problem nicely and
>>           makes everything just fall in place. Even for a purely
>>           physical multiqueue device one can argue that each queue is a
>>           "subdevice" of the physical device. The fact that we lump them
>>           all together today is not an argument against that.
>
> I don't like the idea that queue is a device, that is trying to force
> a struct device centric world onto a queue which doesn't really want
> it..

Here we are at the point where we agree to disagree.

Of course a queue is a resource, but what prevents us to represent a
queue as a carved out subdevice of the physical device?

Look at it from the VF point of view. If VFs are disabled then all
resources belong to the physical device. If VFs are enabled then the
hardware/firmware splits the resources into separate subdevices which
have their own device representation, interrupt storage etc.

As VFs have a scalability limitation due to the underlying PCIe
restrictions the step we are talking about now is to split the queues up
in software which means nothing else than creating a software
representation of finer grained and more scalable subdevices.

So why would we want to pretend that these are not devices?

They are from a conceptual and topology view a subdevice of the physical
device. Just because they are named queues does not make it any
different.

Let's look at VFIO again. If VFIO passes a VF through then it builds a
wrapper around the VF device.

So if a queue is represented as a subdevice, then VFIO can just build
a wrapper around that subdevice.

But with your model, VFIO has to create a device, request a queue,
wrestle the interrupts in place, etc. Which is exactly the opposite of
how VFs are handled.

So again, why would we want to make software managed subdevices look
exactly the opposite way like hardware/firmware managed subdevices?

Let me also look at the cdev which is exposed by the phsyical device.

The cdev is nothing else than a software vehicle to create a /dev/
node. (ignore the switchtec case which (ab)uses the cdev to connect to
NTB). The cdev allows user space to allocate/release a resource.

Of course it can just allocate a queue and stick the necessary
interrupts into the physical device MSI descriptor storage.

But there is no reason why the queue allocation cannot allocate a
subdevice representing the queue.

That makes the VFIO and the cdev case just using the same underlying
representation which can expose it's properties via the underlying
device.

Which in turn is consistent all over the place and does not require any
special case for anything. Neither for interrupts nor for anything else.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 18:47                           ` Dave Jiang
@ 2021-12-01 20:25                             ` Thomas Gleixner
  2021-12-01 21:21                               ` Dave Jiang
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 20:25 UTC (permalink / raw)
  To: Dave Jiang, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:
> On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
>>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>>> to a guest kernel when we plumb the pass through of IMS to the guest
>>> rather than doing previous implementation of having a MSIX vector on
>>> guest backed by IMS.
>> Which makes a lot of sense.
>>
>>> The control block for the mediated device is emulated and therefore an
>>> emulated MSIX vector will be surfaced as vector 0. However the queues
>>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>>> running on the guest kernel for the same device.
>> Why? What's wrong with using straight MSI-X for all of them?
>
> The hardware implementation does not have enough MSIX vectors for 
> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS 
> vectors. So if we are to do MSI-X for all of them, then we need to do 
> the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 18:47                         ` Jason Gunthorpe
@ 2021-12-01 20:26                           ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 20:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86

On Wed, Dec 01 2021 at 14:47, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 07:37:32PM +0100, Thomas Gleixner wrote:
>> I picked that because it _is_ already used to establish the connection
>> to the switchtec_class NTB driver which is beyond the usual cdev muck.
>
> IMHO that is also a misuse. These days two drivers should be hooked
> together using an aux device, not a cdev and the obscure
> class_interface stuff. Aux device supports auto probing and module
> auto loading for instance.
>
> An interrupt on an aux device is at least somewhat conceptually
> parallel to an interrupt on a mdev as both are usually representing
> some slice of a device.

No argument about that.

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 20:25                             ` Thomas Gleixner
@ 2021-12-01 21:21                               ` Dave Jiang
  2021-12-01 21:44                                 ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Dave Jiang @ 2021-12-01 21:21 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu


On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:
>> On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
>>>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>>>> to a guest kernel when we plumb the pass through of IMS to the guest
>>>> rather than doing previous implementation of having a MSIX vector on
>>>> guest backed by IMS.
>>> Which makes a lot of sense.
>>>
>>>> The control block for the mediated device is emulated and therefore an
>>>> emulated MSIX vector will be surfaced as vector 0. However the queues
>>>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>>>> running on the guest kernel for the same device.
>>> Why? What's wrong with using straight MSI-X for all of them?
>> The hardware implementation does not have enough MSIX vectors for
>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>> vectors. So if we are to do MSI-X for all of them, then we need to do
>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
> Confused. Are you talking about passing a full IDXD device to the guest
> or about passing a carved out subdevice, aka. queue?

I'm talking about carving out a subdevice. I had the impression of you 
wanting IMS passed through for all variations. But it sounds like for a 
sub-device, you are ok with the implementation of MSIX backed by IMS?


>
> Thanks,
>
>          tglx
>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 21:21                               ` Dave Jiang
@ 2021-12-01 21:44                                 ` Thomas Gleixner
  2021-12-01 21:49                                   ` Dave Jiang
  2021-12-09  5:23                                   ` Tian, Kevin
  0 siblings, 2 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 21:44 UTC (permalink / raw)
  To: Dave Jiang, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
> On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
>>> The hardware implementation does not have enough MSIX vectors for
>>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>>> vectors. So if we are to do MSI-X for all of them, then we need to do
>>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
>> Confused. Are you talking about passing a full IDXD device to the guest
>> or about passing a carved out subdevice, aka. queue?
>
> I'm talking about carving out a subdevice. I had the impression of you 
> wanting IMS passed through for all variations. But it sounds like for a 
> sub-device, you are ok with the implementation of MSIX backed by IMS?

I don't see anything wrong with that. A subdevice is it's own entity and
VFIO can chose the most conveniant representation of it to the guest
obviously.

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?

What needs a subdevice to expose?

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 21:44                                 ` Thomas Gleixner
@ 2021-12-01 21:49                                   ` Dave Jiang
  2021-12-01 22:03                                     ` Thomas Gleixner
  2021-12-09  5:23                                   ` Tian, Kevin
  1 sibling, 1 reply; 141+ messages in thread
From: Dave Jiang @ 2021-12-01 21:49 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu


On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
>> On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
>>>> The hardware implementation does not have enough MSIX vectors for
>>>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>>>> vectors. So if we are to do MSI-X for all of them, then we need to do
>>>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
>>> Confused. Are you talking about passing a full IDXD device to the guest
>>> or about passing a carved out subdevice, aka. queue?
>> I'm talking about carving out a subdevice. I had the impression of you
>> wanting IMS passed through for all variations. But it sounds like for a
>> sub-device, you are ok with the implementation of MSIX backed by IMS?
> I don't see anything wrong with that. A subdevice is it's own entity and
> VFIO can chose the most conveniant representation of it to the guest
> obviously.
>
> How that is backed on the host does not really matter. You can expose
> MSI-X to the guest with a INTx backing as well.
>
> I'm still failing to see the connection between the 9 MSIX vectors and
> the 2048 IMS vectors which I assume that this is the limitation of the
> physical device, right?

I think I was confused with what you were asking and was thinking you 
are saying why can't we just have MSIX on guest backed by the MSIX on 
the physical device and thought there would not be enough vectors to 
service the many guests. I think I understand what your position is now 
with the clarification above.


>
> What needs a subdevice to expose?
>
> Thanks,
>
>          tglx
>
>
>

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 21:49                                   ` Dave Jiang
@ 2021-12-01 22:03                                     ` Thomas Gleixner
  2021-12-01 22:53                                       ` Dave Jiang
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 22:03 UTC (permalink / raw)
  To: Dave Jiang, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:
> On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
>> How that is backed on the host does not really matter. You can expose
>> MSI-X to the guest with a INTx backing as well.
>>
>> I'm still failing to see the connection between the 9 MSIX vectors and
>> the 2048 IMS vectors which I assume that this is the limitation of the
>> physical device, right?
>
> I think I was confused with what you were asking and was thinking you 
> are saying why can't we just have MSIX on guest backed by the MSIX on 
> the physical device and thought there would not be enough vectors to 
> service the many guests. I think I understand what your position is now 
> with the clarification above.

This still depends on how this overall discussion about representation
of all of this stuff is resolved.

>> What needs a subdevice to expose?

Can you answer that too please?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 22:03                                     ` Thomas Gleixner
@ 2021-12-01 22:53                                       ` Dave Jiang
  2021-12-01 23:57                                         ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Dave Jiang @ 2021-12-01 22:53 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu


On 12/1/2021 3:03 PM, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:
>> On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
>>> How that is backed on the host does not really matter. You can expose
>>> MSI-X to the guest with a INTx backing as well.
>>>
>>> I'm still failing to see the connection between the 9 MSIX vectors and
>>> the 2048 IMS vectors which I assume that this is the limitation of the
>>> physical device, right?
>> I think I was confused with what you were asking and was thinking you
>> are saying why can't we just have MSIX on guest backed by the MSIX on
>> the physical device and thought there would not be enough vectors to
>> service the many guests. I think I understand what your position is now
>> with the clarification above.
> This still depends on how this overall discussion about representation
> of all of this stuff is resolved.
>
>>> What needs a subdevice to expose?
> Can you answer that too please?

Sorry. So initial version of the IDXD sub-device is represented with a 
single queue. It needs a command irq (emulated) and a completion irq 
that is backed by a device vector (currently IMS).


>
> Thanks,
>
>          tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 22:53                                       ` Dave Jiang
@ 2021-12-01 23:57                                         ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-01 23:57 UTC (permalink / raw)
  To: Dave Jiang, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

dave,

On Wed, Dec 01 2021 at 15:53, Dave Jiang wrote:
> On 12/1/2021 3:03 PM, Thomas Gleixner wrote:
>> This still depends on how this overall discussion about representation
>> of all of this stuff is resolved.
>>
>>>> What needs a subdevice to expose?
>> Can you answer that too please?
>
> Sorry. So initial version of the IDXD sub-device is represented with a 
> single queue. It needs a command irq (emulated) and a completion irq 
> that is backed by a device vector (currently IMS).

thanks for clarification! Let me think about it some more.

       tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 20:21                             ` Thomas Gleixner
@ 2021-12-02  0:01                               ` Thomas Gleixner
  2021-12-02 13:55                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-02  0:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> Which in turn is consistent all over the place and does not require any
> special case for anything. Neither for interrupts nor for anything else.

that said, feel free to tell me that I'm getting it all wrong.

The reason I'm harping on this is that we are creating ABIs on several
ends and we all know that getting that wrong is a major pain.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02  0:01                               ` Thomas Gleixner
@ 2021-12-02 13:55                                 ` Jason Gunthorpe
  2021-12-02 14:23                                   ` Greg Kroah-Hartman
                                                     ` (2 more replies)
  0 siblings, 3 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-02 13:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> > Which in turn is consistent all over the place and does not require any
> > special case for anything. Neither for interrupts nor for anything else.
> 
> that said, feel free to tell me that I'm getting it all wrong.
> 
> The reason I'm harping on this is that we are creating ABIs on several
> ends and we all know that getting that wrong is a major pain.

I don't really like coupling the method to fetch IRQs with needing
special struct devices. Struct devices have a sysfs presence and it is
not always appropriate to create sysfs stuff just to allocate some
IRQs.

A queue is simply not a device, it doesn't make any sense. A queue is
more like a socket().

That said, we often have enough struct devices floating about to make
this work. Between netdev/ib_device/aux device/mdev we can use them to
do this.

I think it is conceptual nonsense to attach an IMS IRQ domain to a
netdev or a cdev, but it will solve this problem.

However, I would really prefer that there be no uAPI here. 

I looked at the msi_irqs/ stuff and could not find a user. Debian code
search found nothing, Greg redid the entire uAPI in 2013
(1c51b50c2995), so I think it is just dead. (maybe delete it?)

So lets not create any sysfs for IMS with the msi_irqs/ dir.  We can
revise the in-kenel mechanism someday if it turns out to be a problem.

As to your question:

> So again, why would we want to make software managed subdevices look
> exactly the opposite way like hardware/firmware managed subdevices?

That isn't my thinking at all.

Something like mlx5 has a hw/fw managed VF and there is an RPC call
from driver to device to 'create a queue'. The device has no hard
division inside along a VF, the device simply checks resource limits
and security properties and returns one of the >>10k queues. Again
think more like socket() than a hard partitioning.

It is the same as I suggest for IDXD & VFIO where the PCIe IDXD layer
takes the place of hw/fw and has a 'create a queue' API call for the
VFIO layer to use. Instead of using a VF as the security identity, it
uses a PASID.

This is a logical partitioning and it matches the partioning we'd have
if it was a real device.

> So if a queue is represented as a subdevice, then VFIO can just build
> a wrapper around that subdevice.

I think that oversimplifies the picture.

IDXD is a multi queue device that uses PASID as a security context. It
has a cdev /dev/idxd interface where userspace can use an IOCTL and
get a queue to use. The queue is bound to a PASID that is linked to an
IO Page table that mirrors the process page table. Userspace operates
the queue and does whatever with it.

VFIO is just another interface that should logically be considered a
peer of the cdev. Using VFIO userspace can get a queue, bind it to a
PASID and operate it. The primary difference between the cdev and the
VFIO mdev is user programming API - VFIO uses IOCTLs that carry
emulated MMIO read/write operations.

I consider *neither* to be a subdevice. They are just a user API,
however convoluted, to create a queue, associate it with a PASID
security context and allow userspace to operate the queue. It is much
closer to socket() than a PCI VF subdevice.

Internally the driver should be built so that the PCI driver is doing
all the device operation and the two uAPI layers are only concerend
with translating their repsective uAPIs to the internal device API.

Further, there is no reason why IMS should be reserved exclusively for
VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
just a feature of the PCI device like MSI. If the queue has a PASID it
can use IDXD's IMS.

If we really need a 2nd struct device to turn on IMS then, I'd suggest
picking the cdev, as it keeps IMS and its allocator inside the IDXD
PCIe driver and not in the VFIO world.

Regards,
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02 13:55                                 ` Jason Gunthorpe
@ 2021-12-02 14:23                                   ` Greg Kroah-Hartman
  2021-12-02 14:45                                     ` Jason Gunthorpe
  2021-12-02 19:25                                   ` Thomas Gleixner
  2021-12-09  5:41                                   ` Tian, Kevin
  2 siblings, 1 reply; 141+ messages in thread
From: Greg Kroah-Hartman @ 2021-12-02 14:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj,
	linux-pci, Jon Mason, Dave Jiang, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> Further, there is no reason why IMS should be reserved exclusively for
> VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> just a feature of the PCI device like MSI. If the queue has a PASID it
> can use IDXD's IMS.

No, sorry, but a cdev is not for anything resembling any real resource
at all.

It is ONLY for the /dev/NODE interface that controls the character
device api to userspace.  The struct device involved in it is ONLY for
that, nothing else.  Any attempt to add things to it will be gleefully
rejected.

The cdev api today (in the kernel) exposes too much mess and there's at
least 4 or 5 different ways to use it.  It's on my long-term TODO list
to fix this up to not even allow abuses like you are considering here,
so please don't do that.

> If we really need a 2nd struct device to turn on IMS then, I'd suggest
> picking the cdev, as it keeps IMS and its allocator inside the IDXD
> PCIe driver and not in the VFIO world.

No!  Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
only.  Anything other than that is not ok to do at all.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02 14:23                                   ` Greg Kroah-Hartman
@ 2021-12-02 14:45                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-02 14:45 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Thomas Gleixner, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj,
	linux-pci, Jon Mason, Dave Jiang, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 02, 2021 at 03:23:38PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> > just a feature of the PCI device like MSI. If the queue has a PASID it
> > can use IDXD's IMS.
> 
> No, sorry, but a cdev is not for anything resembling any real resource
> at all.

My point is that when the user asks the driver to allocate a queue
through a cdev ioctl it should be able to get the queue attached to an
IMS, today it can only get a queue attached to a MSI.

> It is ONLY for the /dev/NODE interface that controls the character
> device api to userspace.  The struct device involved in it is ONLY for
> that, nothing else.  Any attempt to add things to it will be gleefully
> rejected.

I agree with you!
 
> > If we really need a 2nd struct device to turn on IMS then, I'd suggest
> > picking the cdev, as it keeps IMS and its allocator inside the IDXD
> > PCIe driver and not in the VFIO world.
> 
> No!  Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
> only.  Anything other than that is not ok to do at all.

Said the same thing in a prior email - which is why I think the only
logical choice here is to make IMS work on the pci_device

FWIW I feel the same way about the VFIO mdev - its *ONLY* purpose is
to control the lifecycle and we are close to stripping away all the
other abuses using it for other things.

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02 13:55                                 ` Jason Gunthorpe
  2021-12-02 14:23                                   ` Greg Kroah-Hartman
@ 2021-12-02 19:25                                   ` Thomas Gleixner
  2021-12-02 20:00                                     ` Jason Gunthorpe
  2021-12-09  5:41                                   ` Tian, Kevin
  2 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-02 19:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
>> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
>> > Which in turn is consistent all over the place and does not require any
>> > special case for anything. Neither for interrupts nor for anything else.
>> 
>> that said, feel free to tell me that I'm getting it all wrong.
>> 
>> The reason I'm harping on this is that we are creating ABIs on several
>> ends and we all know that getting that wrong is a major pain.
>
> I don't really like coupling the method to fetch IRQs with needing
> special struct devices. Struct devices have a sysfs presence and it is
> not always appropriate to create sysfs stuff just to allocate some
> IRQs.
>
> A queue is simply not a device, it doesn't make any sense. A queue is
> more like a socket().

Let's put the term 'device' for a bit please. 

> That said, we often have enough struct devices floating about to make
> this work. Between netdev/ib_device/aux device/mdev we can use them to
> do this.
>
> I think it is conceptual nonsense to attach an IMS IRQ domain to a
> netdev or a cdev, but it will solve this problem.

The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
explained that several times now.

We seem to have a serious problem of terminology and the understanding
of topology which is why we continue to talk past each other forever.

Let's take a big step back and clarify these things first.

An irq domain is an entity which has two parts:

   1) The resource management part
   
   2) An irqchip which knows how to access the actual hardware
      implementation for the various operations related to irq chips
      (mask, unmask, ack, eoi, ....)

irq domains are used for uniform resource spaces which have a uniform
hardware interrupt chip. Modern computer systems have several resource
levels which form a hierarchy of several irqdomain levels.

The root of that hierarchy is at the CPU level:

 |---------------|
 | CPU     | BUS |
 |         |-----|        
 |         | IRQ |
 |----------------

The CPU exposes the bus and a mechanism to raise interrupts in the CPU
from external components. Whether those interrupts are raised by wire or
by writing a message to a given address is an implementation detail.

So the root IRQ domain is the one which handles the CPU level interrupt
controller and manages the associated resources: The per CPU vector
space on x86, the global vector space on ARM64.

If there's an IOMMU in the system them the IOMMU is the next level
interrupt controller:

    1) It manages the translation table resources

    2) Provides an interrupt chip which knows how to access the
       table entries

Of course there can be several IOMMUs depending on the system topology,
but from an irq domain view they are at the same level.

On top of that we have on x86 the next level irq domains:

   IO/APIC, HPET and PCI/MSI

which again do resource management and provide an irqchip which
handles the underlying hardware implementation.

So the full hierarchy so far looks like this on X86

   VECTOR --|
            |-- IOMMU --|
                        |-- IO/APIC
                        |-- HPET
                        |-- PCI/MSI

So allocating an interrupt on the outer level which is exposed to
devices requires allocation through the hierarchy down to the root
domain (VECTOR). The reason why we have this hierarchy is that this
makes it trivial to support systems with and without IOMMU because on a
system w/o IOMMU the outer domains have the vector domain as their
parent. ARM(64) has even deeper hierarchies, but operates on the same
principle.

Let's look at the PCI/MSI domain more detailed:

  It's either one irqdomain per system or one irqdomain per IOMMU domain

  The PCI/MCI irqdomain does not really do resource allocations, it's
  all delegated down the hierarchy.

  It provides also an interrupt chip which knows to access the MSI[X]
  entries and can handle masking etc.

Now you might ask how the MSI descriptors come into play. The MSI
descriptors are allocated by the PCI/MSI interface functions in order to
enable the PCI/MSI irqdomain to actually allocate resources.

They could be allocated as part of the irqdomain allocation mechanism
itself. That's what I do now for simple irq domains which only have a
linear MSI descriptor space where the only information in the descriptor
is the index and not all the PCI/MSI tidbits.

It's kinda blury, but that's moslty due to the fact that we still have
to support the legacy PCI/MSI implementations on architectures which do
not use hierarchical irq domains.

The MSI descriptors are stored in device->msi.data.store. That makes
sense because they are allocated on behalf of that device. But there is
no requirement to store them there. We could just store a cookie in the
device and have some central storage. It's an implementation detail,
which I kept that way so far.

But note, these MSI descriptors are not part of the PCI/MSI
irqdomain. They are just resources which the irqdomain hands out to the
device on allocation and collects on free.

From a topology view the location of the PCI/MSI domain is at the PCI
level which is correct because if you look at a PCIe card then it
consists of two parts:

    |----------------|----------------------------
    | PCIe interface |                           |
    |----------------| Physical function         |
    | Config space   |                           |
    |----------------|----------------------------
    |            Bus bridge                      |
    |---------------------------------------------

The PCIe interface is _not_ managed by the device driver. It's managed
by the PCI layer including the content of the MSI-X table.

The physical function or some interrupt translation unit reads these
entries to actually raise an interrupt down through

        PCIe -> IOMMU -> CPU

by writing the message data to the message address.

If you look at the above irqdomain hierarchy then you'll notice that the
irqdomain hierarchy is modeled exactly that way.

Now ideally with we should do the same modeling for IMS:

    CPU -> IOMMU -> IMS

and have that global or per IOMMU domain. But we can't do that because
IMS is not uniform in terms of the interrupt chip.

IDXD has this table in device memory and you want to store the message
in system memory and the next fancy card will do something else.

Megha tried to do that via this platform MSI indirection, but that's
just a nasty hack. The platform MSI domain is already questionable, but
bolting IMS support into it would make it a total trainwreck.

Aside of that this would not solve the allocation problem for devices
which have a IMS table in device memory because then every device driver
would have to have it's own table allocation mechanism.

Creating a IMS-IDXD irq domain per IOMMU domain would be a possible
solution, but then dealing with different incarnations of table
layouts will create a zoo of crap. Also it would not solve the problem
of IMS storage in system memory.

That's why I suggested to create a device specific IMS irq domain with a
device specific irqchip. The implementation can be purely per device or
a shared implementation in drivers/irqchip. Which is preferrable because
otherwise we won't ever get code reused even if the device memory table
or the system memory storage mechanism are the same.

So the device driver for the PCIe device has to do:

   probe()
   ...
     mydev->ims_domain = ims_$type_create_msi_domain(....)

Now allocations from that domain for usage by the device driver do:

     ret = msi_domain_alloc_irqs(mydev->ims_domain, ...);

Now that's where the problem starts:

  The allocation creates MSI descriptors and has to store them
  somewhere.

Of course we can store them in pci_dev.dev.msi.data.store. Either with a
dedicated xarray or by partitioning the xarray space. Both have their
pro and cons.

Ok?

But what I really struggle with is how this is further represented when
the queues are allocated for VFIO, cdev, whatever usage.

The point is that what you call a queue is not a single resource entity,
it's a container for which several resource types are required to become
functional:

   1) The device queue
   2) Interrupts
   3) ....

Of those resources only one, the device queue, is managed by the
device driver.

The fact that the irqdomain is instantiated by the device driver does
not make it any different. As explained above it's just an
implementation detail which makes it possible to handle the device
specific message storage requirement in a sane way. The actual interrupt
resources come still from the underlying irqdomains as for PCI/MSI.

Now I was looking for a generic representation of such a container and
my initial reaction was to bind it to a struct device, which also makes
it trivial to store these MSI descriptors in that struct device.

I can understand your resistance against that to some extent, but I
think we really have to come up with a proper abstract representation
for these.

Why?

Simply because I want to avoid yet another non-uniform representation of
these things which are making it hard to expose them in a topology view
and require non-uniform wrappers for the same type of information all
over the place.

We seriously have enough of this adhoc stuff already and there is no
need to open another can of worms.

After thinking more about this, I came up to name it: Logical function.

See, I was carefully avoiding the term 'device' and even more so did not
fall for the Software-Defined buzzword space. :)

Such a logical function would be the entity to hand out for VFIO or
cdev.

But it also can be exported e.g. via sysfs

    bus/pci/.../lfunc/0/

to expose information about these. From a practical POV exposing the
interrupts there too seems to be genuinly useful in order to do proper
affinity settings etc. Retrieving that information from /proc/interrupts
is a complete nightmare.

While you did not find anything in debian, I use the msi_irqs/ as an
easy way to find out which interrupts belong to which device.

I have no strong opinion about those entries as surely can figure it out
by other means, but from a consistency view it makes sense to me.

Using such a uniform representation for storing the associated MSI
descriptors makes sense to me not only from the irq layer but also for
practical purposes.

If this is not split out, then every driver and wrapper has to come up
with it's own representation of this instead of being able to do:

     request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
     request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);

I went through a ton of usage sites while I was working on this and the
amount of crap which is done there differently is amazing.

Of course that needs some thoughts vs. the MSI domain interfaces, but as
we are modifying them anyway, it's a good opportunity to make introduce
a common representation for these things.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02 19:25                                   ` Thomas Gleixner
@ 2021-12-02 20:00                                     ` Jason Gunthorpe
  2021-12-02 22:31                                       ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-02 20:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> >> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> >> > Which in turn is consistent all over the place and does not require any
> >> > special case for anything. Neither for interrupts nor for anything else.
> >> 
> >> that said, feel free to tell me that I'm getting it all wrong.
> >> 
> >> The reason I'm harping on this is that we are creating ABIs on several
> >> ends and we all know that getting that wrong is a major pain.
> >
> > I don't really like coupling the method to fetch IRQs with needing
> > special struct devices. Struct devices have a sysfs presence and it is
> > not always appropriate to create sysfs stuff just to allocate some
> > IRQs.
> >
> > A queue is simply not a device, it doesn't make any sense. A queue is
> > more like a socket().
> 
> Let's put the term 'device' for a bit please. 
> 
> > That said, we often have enough struct devices floating about to make
> > this work. Between netdev/ib_device/aux device/mdev we can use them to
> > do this.
> >
> > I think it is conceptual nonsense to attach an IMS IRQ domain to a
> > netdev or a cdev, but it will solve this problem.
> 
> The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
> explained that several times now.
> 
> We seem to have a serious problem of terminology and the understanding
> of topology which is why we continue to talk past each other forever.

I think I understand and agree with everything you said below.

The point we diverge is where to put the vector storage:
 
> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> dedicated xarray or by partitioning the xarray space. Both have their
> pro and cons.

This decision seems to drive the question of how many 'struct devices'
do we need, and where do we get them..

> But what I really struggle with is how this is further represented when
> the queues are allocated for VFIO, cdev, whatever usage.

Yes, this seems to be the primary question

> The fact that the irqdomain is instantiated by the device driver does
> not make it any different. As explained above it's just an
> implementation detail which makes it possible to handle the device
> specific message storage requirement in a sane way. The actual interrupt
> resources come still from the underlying irqdomains as for PCI/MSI.

Yes! This is not under debate

> Now I was looking for a generic representation of such a container and
> my initial reaction was to bind it to a struct device, which also makes
> it trivial to store these MSI descriptors in that struct device.

Right, I've been trying to argue around just this choice.
 
> I can understand your resistance against that to some extent, but I
> think we really have to come up with a proper abstract representation
> for these.

Ok

> Such a logical function would be the entity to hand out for VFIO or
> cdev.

What is a logical function, concretely? 

Does it have struct device?

Can I instead suggest a name like 'message interrupt table' ?

Ie a device has two linearly indexed message interrupt tables - the
PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
created by the driver.

Both start at 0 index and they have different irq_domains.

Instead of asking the driver to create a domain we ask the driver to
create a new 'message interrupt table'. The driver provides the
irq_chip to program the messages and the pci_device. The core code
manages the irq domain setup.

Using what you say below:

> If this is not split out, then every driver and wrapper has to come up
> with it's own representation of this instead of being able to do:
> 
>      request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
>      request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);

We could say:
  msi_get_virq(device.pci_msi_table, index=0)

Is the 0th PCI SIG MSI vector

Something like:

 ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
 msi_get_virq(ims_table, index=0)

Is the 0th IMS vector

Is it close to what you are thinking with lfunc?

Regards,
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02 20:00                                     ` Jason Gunthorpe
@ 2021-12-02 22:31                                       ` Thomas Gleixner
  2021-12-03  0:37                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-02 22:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

On Thu, Dec 02 2021 at 16:00, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
>> We seem to have a serious problem of terminology and the understanding
>> of topology which is why we continue to talk past each other forever.
>
> I think I understand and agree with everything you said below.

Good!

> The point we diverge is where to put the vector storage:

Kinda. The vector, i.e. message storage is either:

  - MSI entry in the PCI config space
  - MSI-X table in the PCI config space
  - Device specific IMS storage

The software representation aka struct msi_desc is a different
story. That's what we are debating.

>> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> dedicated xarray or by partitioning the xarray space. Both have their
>> pro and cons.
>
> This decision seems to drive the question of how many 'struct devices'
> do we need, and where do we get them..

Not really. There is nothing what enforces to make the MSI irqdomain
storage strictly hang off struct device. There has to be a connection to
a struct device in some way obviously to make IOMMU happy.

>> Such a logical function would be the entity to hand out for VFIO or
>> cdev.
>
> What is a logical function, concretely?

That's a name I came up with for a abstract representation of such a
queue container. I came up with that as a obvious consequence of my
previous reasoning about PF -> VF -> XF.

> Does it have struct device?

It does not necessarily have to .

> Can I instead suggest a name like 'message interrupt table' ?

Well yes, but that's not what I meant. See below.

> Ie a device has two linearly indexed message interrupt tables - the
> PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
> created by the driver.
>
> Both start at 0 index and they have different irq_domains.
>
> Instead of asking the driver to create a domain we ask the driver to
> create a new 'message interrupt table'. The driver provides the
> irq_chip to program the messages and the pci_device. The core code
> manages the irq domain setup.
>
> Using what you say below:
>
>> If this is not split out, then every driver and wrapper has to come up
>> with it's own representation of this instead of being able to do:
>> 
>>      request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
>>      request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);
>
> We could say:
>   msi_get_virq(device.pci_msi_table, index=0)
>
> Is the 0th PCI SIG MSI vector
>
> Something like:
>
>  ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
>  msi_get_virq(ims_table, index=0)

Which is pretty much a wrapper around two different irqdomains for the
device and either partitioned index space in the xarray or two xarrays.

Just badly named because the table itself is where the resulting message
is stored, which is composed with the help of the relevant MSI
descriptor. See above.

We really should not try to make up an artifical table representation
for something which does not necessarily have a table at all, i.e. the
devices you talk about which store the message in queue specific system
memory. Pretending that this is a table is just silly.

Also I disagree that this has to be tied to a PCI specific interface,
except for creating a PCI specific wrapper for it to not make a driver
developer have to write '&pdev->dev', which is the very least of our
problems.

IMS as a technical concept is absolutely not PCI specific at all and not
invented by PCI/SIG. It's a marketing brandname for something which
existed way before they thought about it: Message signaled interrupts.

Aside of that 'my_irq_chip' does not cut it at all because of the way
how the resulting messages are stored. IDXD has IOMEM storage and a
storage space limitation while your device uses system memory storage
and has other limitations, i.e. system memory and the number of queues
the device can provide.

An irqchip is a just set of functions to talk to hardware either
directly or via some indirect transport (I2C, SPI, MLX queue management
magic...). These functions require irqdomain and/or device specific
information to function.

Trying to create a universal pci_create_foo() wrapper around this is
going to be like the 13th Herkulean task.

Seriously, you cannot make something uniform which is by definition
non-uniform.

Let's not even try to pretend that it is possible.

> Is the 0th IMS vector
>
> Is it close to what you are thinking with lfunc?

Not really. I was really reasoning about an abstract representation for
a functional queue, which is more than just a queue allocated from the
PF or VF device.

I really meant a container like this:

struct logical_function {
        /* Pointer to the physical device */
        struct device		*phys_device;
        /* MSI descriptor storage */
	struct msi_data		msi;
        /* The queue number */
        unsigned int		queue_nr;
        /* Add more information which is common to these things */
};

Now the real queue, which is obviously not generic:

struct myqueue_function {
	struct logical_function lfunc;
        struct myqueue		queue;
};        

The idea is to have a common representation for these type of things
which allows:

 1) Have common code for exposing queues to VFIO, cdev, sysfs...

    You still need myqueue specific code, but the common stuff which is
    in struct logical_function can be generic and device independent.

 2) Having the MSI storage per logical function (queue) allows to have
    a queue relative 0 based MSI index space.

    The actual index in the physical table (think IMS) would be held in
    the msi descriptor itself.

`   Which then allows queue relative addressing without extra device/queue
    specific meta storage.

    i.e.

        msi_get_virq(&myqueue->lfunc.msi, idx = 0)

    v.s.

        idx = myqueue->msidx[0];
        msi_get_virq(pcidev->dev, idx);

        where the queue management code has to set up myqueue->msidx[]
        and stick the index of the underlying device storage into it.

 3) Setup and teardown would be simply per logical function for
    all of the related resources which are required.

    Interrrupt teardown would look like this:

      msi_domain_free_all_irqs(irqdomain, &lfunc->msi);

    vs.

      for (i = 0; i < myqueue->nrirqs; i++)
           msi_domain_free_irq(irqdomain, &pcidev->dev, myqueue->msidx[0]);


Now change struct logical_function to:

struct logical_function {
-       /* Pointer to the physical device */
-       struct device		*phys_device;

+       /* Pseudo device to allow using devres */
+       struct pseudo_device	pseudo_device;

	/* MSI descriptor storage */
	struct msi_data		msi;
        /* The queue number */
        unsigned int		queue_nr;
        /* Add more information which is common to these things */
};

where struct pseudo_device holds the phys_device pointer and then you
can utilize the devres infrastructure like you do for any other device
and do:

      pseudo_device_add(&myqueue->lfunc.pseudo_device);

at setup time and

      pseudo_device_remove(&myqueue->lfunc.pseudo_device);

on teardown and let all the resources including MSI interrupts be
released automatically.

Needs some infrastructure obviously, but to me that makes a lot of
sense.

And I named it pseudo_device on purpose as it is just a vehicle to make
existing infrastructure which is device specific usable for this kind of
thing.

I might be completely off track. Feel free to tell me so :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02 22:31                                       ` Thomas Gleixner
@ 2021-12-03  0:37                                         ` Jason Gunthorpe
  2021-12-03 15:07                                           ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-03  0:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:

> The software representation aka struct msi_desc is a different
> story. That's what we are debating.

Okay, I did mean msi_desc storage, so we are talking about the same thigns

> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> pro and cons.
> >
> > This decision seems to drive the question of how many 'struct devices'
> > do we need, and where do we get them..
> 
> Not really. There is nothing what enforces to make the MSI irqdomain
> storage strictly hang off struct device. There has to be a connection to
> a struct device in some way obviously to make IOMMU happy.

Yes, I thought this too, OK we agree

> Just badly named because the table itself is where the resulting message
> is stored, which is composed with the help of the relevant MSI
> descriptor. See above.

I picked the name because it looks like it will primarily contain an
xarray and the API you suggested to query it is idx based. To me that
is a table. A table of msi_desc storage to be specific.

It seems we have some agreement here as your lfunc also included the
same xarray and uses an idx as part of the API, right?

> We really should not try to make up an artifical table representation
> for something which does not necessarily have a table at all, i.e. the
> devices you talk about which store the message in queue specific system
> memory. Pretending that this is a table is just silly.

Now I'm a bit confused, what is the idx in the lfunc? 

I think this is language again, I would call idx an artificial table
representation?

> Also I disagree that this has to be tied to a PCI specific interface,
> except for creating a PCI specific wrapper for it to not make a driver
> developer have to write '&pdev->dev', which is the very least of our
> problems.

Agreed, just trying to be illustrative at a higher level
 
> Aside of that 'my_irq_chip' does not cut it at all because of the way
> how the resulting messages are stored. IDXD has IOMEM storage and a
> storage space limitation while your device uses system memory storage
> and has other limitations, i.e. system memory and the number of queues
> the device can provide.

Okay, so the device must setup the domain because it must provide
information during setup. Makes sense

> Not really. I was really reasoning about an abstract representation for
> a functional queue, which is more than just a queue allocated from the
> PF or VF device.
> 
> I really meant a container like this:
> 
> struct logical_function {
>         /* Pointer to the physical device */
>         struct device		*phys_device;
>         /* MSI descriptor storage */
> 	struct msi_data		msi;

Up to here is basically what I thought the "message table" would
contain. Possibly a pointer to the iommu_domain too?

>         /* The queue number */
>         unsigned int		queue_nr;
>         /* Add more information which is common to these things */

Not sure why do we need this?

Lets imagine a simple probe function for a simple timer device. It has
no cdev, vfio, queues, etc. However, the device only supports IMS. No
MSI, MSI-X, nothing else.

What does the probe function look like to get to request_irq()?

Why does this simple driver create something called a 'logical
function' to access its only interrupt?

I think you have the right idea here, just forget about VFIO, the IDXD
use case, all of it. Provide a way to use IMS cleanly and concurrently
with MSI.

Do that and everything else should have sane solutions too.

> The idea is to have a common representation for these type of things
> which allows:
> 
>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
> 
>     You still need myqueue specific code, but the common stuff which is
>     in struct logical_function can be generic and device independent.

I will quote you: "Seriously, you cannot make something uniform which
is by definition non-uniform." :)

We will find there is no common stuff here, we did this exercise
already when we designed struct auxiliary_device, and came up
empty. 

>  2) Having the MSI storage per logical function (queue) allows to have
>     a queue relative 0 based MSI index space.

Can you explain a bit what you see 0 meaning in this idx numbering?

I also don't understand what 'queue relative' means?

>     The actual index in the physical table (think IMS) would be held in
>     the msi descriptor itself.

This means that a device like IDXD would store the phsical IMS table
entry # in the msi descriptor? What is idx then?

For a device like IDXD with a simple linear table, how does the driver
request a specific entry in the IMS table? eg maybe IMS table entry #0
is special like we often see in MSI?

>         msi_get_virq(&myqueue->lfunc.msi, idx = 0)
>
>     v.s.
>
>         idx = myqueue->msidx[0];
>         msi_get_virq(pcidev->dev, idx);
 
>         where the queue management code has to set up myqueue->msidx[]
>         and stick the index of the underlying device storage into it.

For the devices I know about there are two approaches for allocating
interrupts. 

Many devices have a few special interrupts at fixed table
indexes. These can only be used for their single purpose. Eg MSI 0 and
1. An example is IRQ 1 signals the device had an error, read the error
reporting MMIO registers.

Then the device has a broad pool of fungible vectors. These are all
interchangeable, any vector can be used with any queue. eg MSI 2->128

A common mode operation is to setup the special vectors first, and
then assign on demand the pool vectors as CPUs are allocated for
use. Basically each CPU gets a vector and then many device objects are
attached to that vector. This scheme gives CPU locality and doesn't
exhaust CPU vectors. As example nvme uses 1 queue per CPU and 1 vector
per queue and userspace tells it how many queues (and thus CPUs) to
consume.

A device like mlx5 may have 1000's of queues and hundreds of sub
devices sharing a single CPU interrupt. This is desirable because we
can quickly run out of interrupts when we are talking about 100's of
concurrent subdevices.

When a "VFIO mdev" gets involved (mlx5 has this already, it is just
called VDPA, don't ask) the same allocator is used by the VFIO part
except VFIO has to request a dedicated MSI. Dedicated is because the
Intel VTx is used to deliver that MSI addr/data directly to the
guest's vAPIC.

When I imagine mlx5 using IMS, I see IMS as a simple extension of the
existing MSI-X vector pool. Every IMS vector is interchangable and the
main PCI driver will apply exactly the same allocation algorithm we
already have today for MSI, just with even more vectors. When VFIO
wants a vector it may get a MSI or it may get an IMS, I don't want to
care.

All of this about logical functions just confuses
responsibilities. The IRQ layer should be worrying about configuring
IRQs and not dictating how the device will design its IRQ assignment
policy or subdevice scheme.

>  3) Setup and teardown would be simply per logical function for
>     all of the related resources which are required.
> 
>     Interrrupt teardown would look like this:
> 
>       msi_domain_free_all_irqs(irqdomain, &lfunc->msi);

Looks nice

> Now change struct logical_function to:
> 
> struct logical_function {
> -       /* Pointer to the physical device */
> -       struct device		*phys_device;
> 
> +       /* Pseudo device to allow using devres */
> +       struct pseudo_device	pseudo_device;
> 
> 	/* MSI descriptor storage */
> 	struct msi_data		msi;
>         /* The queue number */
>         unsigned int		queue_nr;
>         /* Add more information which is common to these things */
> };
> 
> where struct pseudo_device holds the phys_device pointer and then you
> can utilize the devres infrastructure like you do for any other device
> and do:
> 
>       pseudo_device_add(&myqueue->lfunc.pseudo_device);
> 
> at setup time and
> 
>       pseudo_device_remove(&myqueue->lfunc.pseudo_device);
> 
> on teardown and let all the resources including MSI interrupts be
> released automatically.

We are already doing this with struct auxiliary_device and other
similar things. I don't see the value to couple the msi_data into the
lifetime model?

> I might be completely off track. Feel free to tell me so :)

I think you have the right core idea, just shrink your thinking of
logical_function to only cover msi stuff and leave the device model to
the other places that already have good mechansims. eg auxiliary device

IMHO this has become hyper focused on the special IDXD VFIO use case -
step back and think about my timer example above - a simple pci_driver
that just wants to use IMS for itself. No queues, no VFIO, no
mess. Just how does it configure and use the interrupt source.

Is it helpful feedback?

Regards,
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-03  0:37                                         ` Jason Gunthorpe
@ 2021-12-03 15:07                                           ` Thomas Gleixner
  2021-12-03 16:41                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-03 15:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
>> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> >> dedicated xarray or by partitioning the xarray space. Both have their
>> >> pro and cons.
>> >
>> > This decision seems to drive the question of how many 'struct devices'
>> > do we need, and where do we get them..
>> 
>> Not really. There is nothing what enforces to make the MSI irqdomain
>> storage strictly hang off struct device. There has to be a connection to
>> a struct device in some way obviously to make IOMMU happy.
>
> Yes, I thought this too, OK we agree

One thing which just occured to me and I missed so far is the
association of the irqdomain.

Right now we know for each device which irqdomain it belongs to. That's
part of device discovery (PCI, device tree or whatever) which set's up

     device->irqdomain

That obviously cannot work for that device specific IMS domain. Why?

Because the PCIe device provides MSI[x] which obviously requires that
device->irqdomain is pointing to the PCI/MSI irqdomain. Otherwise the
PCI/MSI allocation mechanism wont work.

Sure each driver can have

     mystruct->ims_irqdomain

and then do the allocation via

    msi_irq_domain_alloc(mystruct->ims_irqdomain....)

but I really need to look at all of the MSI infrastructure code whether
it makes assumptions about this:

   msi_desc->dev->irqdomain

being the correct one. Which would obviously just be correct when we'd
have a per queue struct device :)

>> Just badly named because the table itself is where the resulting message
>> is stored, which is composed with the help of the relevant MSI
>> descriptor. See above.
>
> I picked the name because it looks like it will primarily contain an
> xarray and the API you suggested to query it is idx based. To me that
> is a table. A table of msi_desc storage to be specific.
>
> It seems we have some agreement here as your lfunc also included the
> same xarray and uses an idx as part of the API, right?

It's a per lfunc xarray, yes.

>> We really should not try to make up an artifical table representation
>> for something which does not necessarily have a table at all, i.e. the
>> devices you talk about which store the message in queue specific system
>> memory. Pretending that this is a table is just silly.
>
> Now I'm a bit confused, what is the idx in the lfunc? 
>
> I think this is language again, I would call idx an artificial table
> representation?

Ok. I prefer the term indexed storage, but yeah.

>> I really meant a container like this:
>> 
>> struct logical_function {
>>         /* Pointer to the physical device */
>>         struct device		*phys_device;
>>         /* MSI descriptor storage */
>> 	struct msi_data		msi;
>
> Up to here is basically what I thought the "message table" would
> contain. Possibly a pointer to the iommu_domain too?
>
>>         /* The queue number */
>>         unsigned int		queue_nr;
>>         /* Add more information which is common to these things */
>
> Not sure why do we need this?
>
> Lets imagine a simple probe function for a simple timer device. It has
> no cdev, vfio, queues, etc. However, the device only supports IMS. No
> MSI, MSI-X, nothing else.
>
> What does the probe function look like to get to request_irq()?

The timer device would be represented by what? A struct device?

If so then something needs to set device->irqdomain and then you can
just do:

     msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

> Why does this simple driver create something called a 'logical
> function' to access its only interrupt?

It does not have to when it's struct device based.

> I think you have the right idea here, just forget about VFIO, the IDXD
> use case, all of it. Provide a way to use IMS cleanly and concurrently
> with MSI.
>
> Do that and everything else should have sane solutions too.

To do that, I need to understand the bigger picture and the intended
usage because otherwise we end up with an utter mess.

>> The idea is to have a common representation for these type of things
>> which allows:
>> 
>>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
>> 
>>     You still need myqueue specific code, but the common stuff which is
>>     in struct logical_function can be generic and device independent.
>
> I will quote you: "Seriously, you cannot make something uniform which
> is by definition non-uniform." :)
>
> We will find there is no common stuff here, we did this exercise
> already when we designed struct auxiliary_device, and came up
> empty.

Really?

>>  2) Having the MSI storage per logical function (queue) allows to have
>>     a queue relative 0 based MSI index space.
>
> Can you explain a bit what you see 0 meaning in this idx numbering?
>
> I also don't understand what 'queue relative' means?
>
>>     The actual index in the physical table (think IMS) would be held in
>>     the msi descriptor itself.
>
> This means that a device like IDXD would store the phsical IMS table
> entry # in the msi descriptor? What is idx then?
>
> For a device like IDXD with a simple linear table, how does the driver
> request a specific entry in the IMS table? eg maybe IMS table entry #0
> is special like we often see in MSI?

If there is a hardwired entry then this hardwired entry belongs to the
controlblock (status, error or whatever), but certainly not to a
dynamically allocatable queue as that would defeat the whole purpose.

That hardwired entry could surely exist in a IDXD type setup where the
message storage is in an defined array on the device.

But with the use case you described where you store the message at some
place in per queue system memory, the MSI index is not relevant at all,
because there is no table. So it's completely meaningless for the device
and just useful for finding the related MSI descriptor again.

If the queue or the controlblock have an internal small array of
messages in their queue or controlblock specific memory, then how would
a per device global index help?

Or has each queue and controlblock and whatever access to a shared large
array where the messages are stored and the indices are handed out to
the queues and controlblocks?

If each of them have their own small array, then queue relative indexing
makes a ton of sense, no?

> For the devices I know about there are two approaches for allocating
> interrupts. 

I'm aware of that.

> When I imagine mlx5 using IMS, I see IMS as a simple extension of the
> existing MSI-X vector pool. Every IMS vector is interchangable and the
> main PCI driver will apply exactly the same allocation algorithm we
> already have today for MSI, just with even more vectors. When VFIO
> wants a vector it may get a MSI or it may get an IMS, I don't want to
> care.
>
> All of this about logical functions just confuses
> responsibilities. The IRQ layer should be worrying about configuring
> IRQs and not dictating how the device will design its IRQ assignment
> policy or subdevice scheme.

The IRQ layer _has_ to worry about it because there is a strict
relationship between device and irqdomain today.

Now with a PCIe device having PCI/MSI[X] it's a strict 1:1
relationship. Now add IMS to the picture and we end up with a 1:2
relationship, which does not work today.

That's why my initial reaction was to partition the device into
subdevices which would have solved the problem nicely.

> IMHO this has become hyper focused on the special IDXD VFIO use case -
> step back and think about my timer example above - a simple pci_driver
> that just wants to use IMS for itself. No queues, no VFIO, no
> mess. Just how does it configure and use the interrupt source.

That would mean that the PCI device would not use MSI-X at all, right?
So it could have pci_dev.dev.irqdomain = IMSdomain.

But that sucks too because it's yet another quirk as the PCIe discovery
cannot set that up simply because the device specific IMS domain does
not exist yet.

> Is it helpful feedback?

Yes, definitely. It helps me to understand the bigger picture and to
find a proper representation of these things so that we don't end up
with the usual mess in drivers/* which makes it a fricking nightmare to
change anything at the core code at all.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-03 15:07                                           ` Thomas Gleixner
@ 2021-12-03 16:41                                             ` Jason Gunthorpe
  2021-12-04 14:20                                               ` Thomas Gleixner
  2021-12-09  6:26                                               ` Tian, Kevin
  0 siblings, 2 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-03 16:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
> >> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> >> pro and cons.
> >> >
> >> > This decision seems to drive the question of how many 'struct devices'
> >> > do we need, and where do we get them..
> >> 
> >> Not really. There is nothing what enforces to make the MSI irqdomain
> >> storage strictly hang off struct device. There has to be a connection to
> >> a struct device in some way obviously to make IOMMU happy.
> >
> > Yes, I thought this too, OK we agree
> 
> One thing which just occured to me and I missed so far is the
> association of the irqdomain.

Oh? I though this was part of what you were thinking when talkinga
about using the mdev struct device.

> but I really need to look at all of the MSI infrastructure code whether
> it makes assumptions about this:
> 
>    msi_desc->dev->irqdomain
> 
> being the correct one. Which would obviously just be correct when we'd
> have a per queue struct device :)

Right

I'd half expect the msi_desc to gain a pointer to the 'lfunc'

Then msi_desc->dev becomes msi_desc->origin_physical_device and is
only used by IOMMU code to figure out MSI remapping/etc. Maybe that
allows solve your aliasing problem (Is this the RID aliasing we see
with iommu_groups too?)

> > Lets imagine a simple probe function for a simple timer device. It has
> > no cdev, vfio, queues, etc. However, the device only supports IMS. No
> > MSI, MSI-X, nothing else.
> >
> > What does the probe function look like to get to request_irq()?
> 
> The timer device would be represented by what? A struct device?

Lets say it is a pci_device and probe is a pci_driver probe function

> If so then something needs to set device->irqdomain and then you can
> just do:
> 
>      msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

Okay, but that doesn't work if it is a PCI device and dev->irqdomain
is already the PCI MSI and INTx irqdomain?

> To do that, I need to understand the bigger picture and the intended
> usage because otherwise we end up with an utter mess.

Sure
 
> >> The idea is to have a common representation for these type of things
> >> which allows:
> >> 
> >>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
> >> 
> >>     You still need myqueue specific code, but the common stuff which is
> >>     in struct logical_function can be generic and device independent.
> >
> > I will quote you: "Seriously, you cannot make something uniform which
> > is by definition non-uniform." :)
> >
> > We will find there is no common stuff here, we did this exercise
> > already when we designed struct auxiliary_device, and came up
> > empty.
> 
> Really?

Yes.. It was a long drawn out affair, and auxiliary_device is just a
simple container

> >>  2) Having the MSI storage per logical function (queue) allows to have
> >>     a queue relative 0 based MSI index space.
> >
> > Can you explain a bit what you see 0 meaning in this idx numbering?
> >
> > I also don't understand what 'queue relative' means?
> >
> >>     The actual index in the physical table (think IMS) would be held in
> >>     the msi descriptor itself.
> >
> > This means that a device like IDXD would store the phsical IMS table
> > entry # in the msi descriptor? What is idx then?
> >
> > For a device like IDXD with a simple linear table, how does the driver
> > request a specific entry in the IMS table? eg maybe IMS table entry #0
> > is special like we often see in MSI?
> 
> If there is a hardwired entry then this hardwired entry belongs to the
> controlblock (status, error or whatever), but certainly not to a
> dynamically allocatable queue as that would defeat the whole purpose.
> 
> That hardwired entry could surely exist in a IDXD type setup where the
> message storage is in an defined array on the device.

Agree - so how does it get accessed?

> But with the use case you described where you store the message at some
> place in per queue system memory, the MSI index is not relevant at all,
> because there is no table. So it's completely meaningless for the device
> and just useful for finding the related MSI descriptor again.

Yes

What I don't see is how exactly we make this situation work. The big
picture sequence would be:

 - Device allocates a queue and gets a queue handle
 - Device finds a free msi_desc and associates that msi_desc with the
   queue handle
 - Device specific irq_chip uses the msi_desc to get the queue handle
   and programs the addr/data

In this regard the msi_desc is just an artifact of the API, the real
work is in the msi_desc.

> Or has each queue and controlblock and whatever access to a shared large
> array where the messages are stored and the indices are handed out to
> the queues and controlblocks?

> If each of them have their own small array, then queue relative indexing
> makes a ton of sense, no?

Okay, I see.

I don't know of any use case for more than one interrupt on a queue,
and if it did come up I'd probably approach it by making the queue
handle above also specify the 'queue relative HW index'

> > All of this about logical functions just confuses
> > responsibilities. The IRQ layer should be worrying about configuring
> > IRQs and not dictating how the device will design its IRQ assignment
> > policy or subdevice scheme.
> 
> The IRQ layer _has_ to worry about it because there is a strict
> relationship between device and irqdomain today.

Yes. Absolutely.

This is what I've been thinking for a few messages now :)

I liked your lfunc/"msi_table" (without the queue stuff) because it
breaks that 1:1 without bringing in struct device to do it.

If that seems reasonable I think it is the best way to go.

Lets do a thought experiment, lets say we forget about the current PCI
MSI API.

What if it worked more like this:

probe()
 // Access the real PCI SIG MSI/MSI-X table
 mystruct->msi_table = pci_allocate_msi_table(pci_dev);

 // Special interrupt 0
 mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
 request_irq(mystruct->desc0, ..)

 // Special interrupt 1
 mystruct->desc1 = msi_table_alloc_vector(mystruct->msi_table, hw_label=1);
 request_irq(mystruct->desc1->irq, ..)

 // Fungible queue interrutps are 3->MAX
 for (nqueues) {
   queue[i]->desc = msi_table_alloc_vector_range(mystruct->msi_table, lowest=3, highest=MAX);
   request_irq(queue[i]->desc->irq, ..)
 }

remove()
 for (nqueues)
     msi_table_free_vector(mystruct->msi_table, queue[i]->desc)
 msi_table_free_vector(mystruct->msi_table, mystruct->desc1);
 msi_table_free_vector(mystruct->msi_table, mystruct->desc0);
 msi_table_free(mystruct->msi_table);

(please take some latitude when reading this, I am just trying to
sketch)

Here I am trying to show:

 - msi_table is a general API for accessing MSIs. Each bus type
   would provide some kind of logical creation function for their
   bus standardized MSI table type. eg MSI/MSI-X/etc

   It is a logical handle for the physical resource the holds the MSI
   addr/data paris.

 - Use a pointer instead of idx as the API for referring to a MSI
   irq. Most drivers probably only use msi_desc->irq?

 - We do not have device->msi, it is held in the driver instead.
   dev->irqdomain is always the bus/platform originated irqdomain of
   the physical device.

   Thus we break the 1:1 of the device and irqdomain. A device can
   spawn many msi_tables, but only one would be on the dev->irqdomain

 - Rely on dynamic allocation instead of the awkward batch interfaces.
   This means a msi_table starts out with 0 msi_descs and as we go
   they are allocated, installed in the xarray, and connected to the
   HW.

 - hw_label is defined by whatever type of msi_table it is:
   for PCI MSI this is the MSI table index
   for IDXD IMS this is the IMS table index
   for mlx5 memory IMS it is -1 (?) and the msi_desc is just placed
   anywhere in the xarray. msi_desc->xarray_index can store its
   location

 - 'msi_table_alloc_vector_range' allows a simple API to get any
   free entry from a group of fungible entries to make it clear
   to readers the typical special/pooled HW design pattern

Is it sane? What really needs device->msi anyhow?

IMS is a logical progression of this concept:

probe()
   mystruct->msi_table = pci_allocate_msi_table(pci_dev);
   mystruct->ims_irqdomain = <....>
   mystruct->ims_table = msi_allocate_table(pci_dev->dev,  mystruct->ims_irqdomain ..)

   // Use MSI
   mystruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
   // Use IMS
   mystruct->desc1 = msi_table_alloc_vector(mystruct->ims_table, hw_label=0);

Not saying we can/should go out and just do something so radical, but
as a thought experiment, can it guide toward a direction like this?

> That's why my initial reaction was to partition the device into
> subdevices which would have solved the problem nicely.

Yes, I see that logic.

> > IMHO this has become hyper focused on the special IDXD VFIO use case -
> > step back and think about my timer example above - a simple pci_driver
> > that just wants to use IMS for itself. No queues, no VFIO, no
> > mess. Just how does it configure and use the interrupt source.
> 
> That would mean that the PCI device would not use MSI-X at all, right?
> So it could have pci_dev.dev.irqdomain = IMSdomain.

Yes, but I'm not trying to get that as a solution. I fully expect all
real PCI devices to have MSI tables to support OS's without IMS
capability (notably old Linux)

So IMS and MSI must co-exist.

> > Is it helpful feedback?
> 
> Yes, definitely. It helps me to understand the bigger picture and to
> find a proper representation of these things so that we don't end up
> with the usual mess in drivers/* which makes it a fricking nightmare to
> change anything at the core code at all.

Great!

I agree alot with the other email you wrote that IMS == MSI - I think
that is key.

So much of these conversations have been warped by thinking of IMS as
some wonky thing for virtualization, or trying to focus on VFIO as the
only use case.

I think the other sort of tangential issue when you think on IMS ==
MSI is how pci_msi_create_irq_domain() is basically a special case for
PCI MSI by providing chip ops that only work for PCI to the irqdomain.

Not only that, but instead of cleanly having two irq_chip ops for the
very different HW programming under the MSI and MSI-X cases it is
handled with a 'if (msi_desc->msix)' on each op.

If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
completely unable to handle IMS/MSI/MSI-X at all, and instead, when
the driver asks for PCI MSI access we create a new hierarchical
irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
you outlined for IMS?  (again, not saying to do this, but let's ask if
that makes more sense than the current configuration)

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-03 16:41                                             ` Jason Gunthorpe
@ 2021-12-04 14:20                                               ` Thomas Gleixner
  2021-12-05 14:16                                                 ` Thomas Gleixner
  2021-12-06 14:19                                                 ` Jason Gunthorpe
  2021-12-09  6:26                                               ` Tian, Kevin
  1 sibling, 2 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-04 14:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> Lets do a thought experiment, lets say we forget about the current PCI
> MSI API.
>
> What if it worked more like this:
>
> probe()
>  // Access the real PCI SIG MSI/MSI-X table
>  mystruct->msi_table = pci_allocate_msi_table(pci_dev);
>
>  // Special interrupt 0
>  mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
>  request_irq(mystruct->desc0, ..)

A device driver should not even know about msi_desc. Period.

The code which gets the msi_desc pointer handed in is the irqdomain code
and eventually the irqchip callbacks which need to deal with it.

I just cleaned up quite some places again which fiddled with msi_desc
for the very wrong reasons. We've had subtle bugs due to that in the
past.

The time I wasted in the past to fix drivers which fiddled with msi_desc
and even with irq descriptors just because they can, makes me think hard
about encapsulations because I have absolutely no interest to do full
tree cleanups every time I need to change something in the core
code. Neither do I have interest to deal with the resulting bugs which
end up on my desk because the usual consequence is that the abuse makes
the core go belly up.

If I need to change a dozen irq chip implementations, that's a limited
scope, but chasing the abuse all over the place is insane. I'm dead
tired of it.

The driver can get some info struct which has the index and the linux
irq number in it and that is all it needs, seriously.

If C would provide access protection scopes in objects then the msi
descriptor pointer could become a cookie. But we only can document that
the driver writer has to treat it like a cookie which is equally
effective to not documenting it at all. That's unfortunately the sad
reality. So the goal is to make it harder to screw up and not to make it
easier. The only way to do that is strict encapsulation.

>  - msi_table is a general API for accessing MSIs. Each bus type
>    would provide some kind of logical creation function for their
>    bus standardized MSI table type. eg MSI/MSI-X/etc

You can't give up on that table thing, right? We just established that
for devices like yours IMS is not necessary a table and does not even
need the index. The fact that MSI-X uses a table does not make
everything a nail^Wtable. :)

>    It is a logical handle for the physical resource the holds the MSI
>    addr/data paris.
>
>  - Use a pointer instead of idx as the API for referring to a MSI
>    irq. Most drivers probably only use msi_desc->irq?

No. pci_irq_vector() translates the hw index to linux irq number.  The
hardware index is known by the driver when it is doing a batched
allocation, right?

There is also that magic interface which hands in an array of struct
msi_entry.  msi_entry contains the hardware index and the linux irq
number. The hardware index is filled by the driver and the linux irq
number by the core code when the allocation was successful.

Yes, it's awful from an interface point of view, but that has nothing to
do with the underlying infrastructure and the problem at hand.

I'm fine with using the info struct I described above as the reference
cookie.

>  - We do not have device->msi, it is held in the driver instead.

No. First of all drivers have no business with that, really.

It's infrastructure which the driver utilizes and that infrastructure
manages the data it requires and uses it within the infrastructure, in
this case irqdomains and the associated irq chips.

Aside of that. Even if it would be a good idea, then someone has to
rewrite the PCI/MSI backends of powerpc, mips-octeon, s390, xen and some
more first. Along with the MSI and the PCI/MSI core.

Then create either nasty wrappers around the existing interfaces which
create those magic tables automatically and store them where? In struct
device->msi perhaps? :)

Alternatively cleanup ~360 usage sites throughout drivers and add the
same silly boilerplate code all over the place, which is wrong to begin
with. We want less boilerplate not more.

I'm sure you're volunteering or can provide the engineering capacity to
do that. Depending on the capacity this might be done in 1-10 years from
now.

>    dev->irqdomain is always the bus/platform originated irqdomain of
>    the physical device.

This is a guarantee for subtle bugs and I'm not even remotely interested
going there. See also below.

>    Thus we break the 1:1 of the device and irqdomain. A device can
>    spawn many msi_tables, but only one would be on the dev->irqdomain

Why are you trying brute force to push things into device drivers?
That's really the wrong direction. We want common infrastructure to be
managed by generic code. And all of this is about common infrastructure.

The only device specific thing which is required due to IMS is the
irqchip plus it's data. Whether that irq chip implementation is part of
the device driver itself or is preferrably an irq chip implementation
which can be shared and instantiated by several device types is not
making any difference here at all. It's from an encapsulation POV still
an irq chip which is handled and managed by the interrupt subsystem.

We don't need that for PCI/MSI[-X] at all, so why would we go and give
up on central managed storage and add pointless crap to drivers so that
driver writers can screw up even more?

>  - Rely on dynamic allocation instead of the awkward batch interfaces.
>    This means a msi_table starts out with 0 msi_descs and as we go
>    they are allocated, installed in the xarray, and connected to the
>    HW.

We need to preserve multi-vector batch allocations unless you have the
engineering capacity to fixup the stuff I mentioned above.

And no, you can't just make the batch allocation a wrapper loop around
single vector allocation as that will break the world as well.

There is a very good reason why I made runtime expansion of MSI-X
interrupts opt-in by the underlying architecture specific MSI
backend. IMS support has to be opt-in for the very same reason.

>  - hw_label is defined by whatever type of msi_table it is:
>    for PCI MSI this is the MSI table index
>    for IDXD IMS this is the IMS table index
>    for mlx5 memory IMS it is -1 (?) and the msi_desc is just placed
>    anywhere in the xarray. msi_desc->xarray_index can store its
>    location
>
>  - 'msi_table_alloc_vector_range' allows a simple API to get any
>    free entry from a group of fungible entries to make it clear
>    to readers the typical special/pooled HW design pattern
>
> Is it sane? What really needs device->msi anyhow?

device->msi is a container for interrupt management and that's done by
the interrupt code and not by random device drivers. Again, struct
msi_desc is a software construct which the interrupt core code and the
irqdomains and irq chip implementation use for various purposes. Nothing
outside of this realm has to even know about the fact that this data
structure exists in the first place. See below.

> IMS is a logical progression of this concept:
>
> probe()
>    mystruct->msi_table = pci_allocate_msi_table(pci_dev);
>    mystruct->ims_irqdomain = <....>
>    mystruct->ims_table = msi_allocate_table(pci_dev->dev,  mystruct->ims_irqdomain ..)
>
>    // Use MSI
>    mystruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
>    // Use IMS
>    mystruct->desc1 = msi_table_alloc_vector(mystruct->ims_table, hw_label=0);
>
> Not saying we can/should go out and just do something so radical, but
> as a thought experiment, can it guide toward a direction like this?

What I agree on is that the interface should be in a way that the driver
can:

 1) Allocate vectors at a given hardware index

 2) Allocate vectors within a given hardware index range

 3) The core returns the hardware index and the Linux irq number in
    a separate info structure which is independent of the interrupt
    stack internal data representations.

But the driver interface is _not_ the problem at all.

Interrupts are fundamentally a resource pool which roots at the
CPU. There are layers on top, e.g. IOMMU, which are involved in
routing. On top of that there is the outermost interrupt chip which is
handling the device specifics.

Interrupt delivery roots at the CPU too. So the interrupt core code has
to have a consistent view of the whole chain. That requires to have
proper data representations and a consistent relationship between these
representations.

That brings me back to device->msi. This is part of the data
representation and disconnecting it make things just harder and more
exposed to the whim of driver writers. Thanks, no.

Sure the driver can get a cookie of some sort to do allocation/free from
different resource pools, e.g. PCI-MSI[X] and IMS. But the internal data
representation management has to be done at the core level.

That means that a driver which utilizes device specific IMS should not
even try to make the irqchip/domain code mangled into the other device
code. It should create the irqdomain with the associated chip and that
creation process returns a cookie which is passed to the actual device
specific code. Allocation then can use that cookie and not the irqdomain
pointer itself.

But the driver API is the least of the problems at this point,
really. What's important is to know what kind of functionality the
drivers need.

Then you build the thing up from bottom to top, layer by layer
considering the functional requirements.

The final driver API is a conveniance wrapper based on the underlying
infrastucture which hides common tasks from the driver level.

> I agree alot with the other email you wrote that IMS == MSI - I think
> that is key.

Yes, that was clear to me from the very beginning, that all of them just
raise interrupts by writing msg.data to msg.address.

> So much of these conversations have been warped by thinking of IMS as
> some wonky thing for virtualization, or trying to focus on VFIO as the
> only use case.

Well, I was always thinking about all use cases.

The important thing about this discussion was that I had to understand
how this is used and that my naive assumption that this new approach is
just a logical consequence of splitting the devices up further into
smaller slices is not really correct.

The Physical function -> Virtual function -> Logical function view made
a lot of sense to me. And that's how some people explained it in the
earlier discussions. From that point of view having a struct device to
represent it was not that absurd, right?

So thanks for being patient in educating me here.

The fact that we will be able to support this at all is based on years
of cleanups, consolidation and restructuring of the infrastructure. The
result of doing this is that my trust in driver developers in that
regard is very close to zero. The cleanup patches I had to do for this
series just to solve the single irqdomain case did not help to elevate
the trust level either.

The common approach is: My device is special and therefore I need
special X. The core does not provide special X so I hack something up
which violates all known rules of encapsulation and layering in one go.

Reality is that this is in 99+% of the cases not true at all. If "my
device is special" would be true, then writing and maintaining an OS
which can handle this hardware zoo would be close to impossible.

That's why I'm insisting on understanding the use cases and the
underlying concepts of representation deeply. That's why I'm more
interested in data representation and the relation ship of data than in
function interfaces and code.

Code and function interfaces are defined by data representation and the
relationship of data. Doing it the other way round is doomed.

> I think the other sort of tangential issue when you think on IMS ==
> MSI is how pci_msi_create_irq_domain() is basically a special case for
> PCI MSI by providing chip ops that only work for PCI to the irqdomain.
>
> Not only that, but instead of cleanly having two irq_chip ops for the
> very different HW programming under the MSI and MSI-X cases it is
> handled with a 'if (msi_desc->msix)' on each op.

I'm well aware of that. That's historical ballast which is also required
to keep the legacy architecture code alive. Same problem as explained
above.

> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
> the driver asks for PCI MSI access we create a new hierarchical
> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
> you outlined for IMS?  (again, not saying to do this, but let's ask if
> that makes more sense than the current configuration)

That's not really a good idea because dev->irqdomain is a generic
mechanism and not restricted to the PCI use case. Special casing it for
PCI is just wrong. Special casing it for all use cases just to please
PCI is equally wrong. There is a world outside of PCI and x86. 

But yes, dev->irqdomain _is_ part of the problem and even the root of
it. I definitely stared at it before, but for stupid some reason I
stopped staring at it. I blame it on too much email. :)

The point is that up to now we had a strict 1:1 device irqdomain
mapping, which starts at dev->irqdomain. The MSI descriptor storage is
just a consequence of that. And yes, it took a while to understand that
there is a use case for 1:2.

So I need to break that up in a way which caters for both cases, but
does neither create a special case for PCI nor for the rest of the
universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
means all of it is common case. With that solved the storage question
becomes a nobrainer.

When I looked at it again yesterday while writing mail, I went back to
my notes and found the loose ends where I left off. Let me go back and
start over from there.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-04 14:20                                               ` Thomas Gleixner
@ 2021-12-05 14:16                                                 ` Thomas Gleixner
  2021-12-06 14:43                                                   ` Jason Gunthorpe
  2021-12-06 14:19                                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-05 14:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu, Kalle Valo

On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> So I need to break that up in a way which caters for both cases, but
> does neither create a special case for PCI nor for the rest of the
> universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> means all of it is common case. With that solved the storage question
> becomes a nobrainer.
>
> When I looked at it again yesterday while writing mail, I went back to
> my notes and found the loose ends where I left off. Let me go back and
> start over from there.

I found out why I stopped looking at it. I came from a similar point of
view what you were suggesting:

>> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
>> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
>> the driver asks for PCI MSI access we create a new hierarchical
>> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
>> you outlined for IMS?  (again, not saying to do this, but let's ask if
>> that makes more sense than the current configuration)

Which I shot down with:

> That's not really a good idea because dev->irqdomain is a generic
> mechanism and not restricted to the PCI use case. Special casing it for
> PCI is just wrong. Special casing it for all use cases just to please
> PCI is equally wrong. There is a world outside of PCI and x86. 

That argument is actually only partially correct.

After studying my notes and some more code (sigh), it looks feasible
under certain assumptions, constraints and consequences.

Assumptions:

  1) The irqdomain pointer of PCI devices which is set up during device
     discovery is not used by anything else than infrastructure which
     knows how to handle it.

     Of course there is no guarantee, but I'm not that horrified about
     it anymore after chasing the exposure with yet more coccinelle
     scripts.

Constraints:

  1) This is strictly opt-in and depends on hierarchical irqdomains.

     If an architecture/subarchitecture wants to support it then it
     needs to rework their PCI/MSI backend to hierarchical irqdomains or
     make their PCI/MSI irqdomain ready for the task.

     From my inspection of 30+ PCI/MSI irqdomains, most of them should
     be trivial to convert. The hard ones are powerpc, XEN and VMD,
     where XEN is definitely the most convoluted one.

     That means that devices which depend on IMS won't work on anything
     which is not up to date.

  2) Guest support is strictly opt-in

     The underlying architecture/subarchitecture specific irqdomain has
     to detect at setup time (eventually early boot), whether the
     underlying hypervisor supports it.

     The only reasonable way to support that is the availability of
     interrupt remapping via vIOMMU, as we discussed before.

  3) IOMMU/Interrupt remapping dependency

     While IMS just works without interrupt remapping on bare metal the
     fact that there is no reliable way to figure out whether the kernel
     runs on bare metal or not, makes it pretty much mandatory, at least
     on x86.

     That's not a hardcoded constraint. It's a decision made during the
     setup of the underlying architecture/subarchitecture specific
     irqdomain.

  4) The resulting irqdomain hierarchy would ideally look like this:

     VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

     That does not work in all cases due to architecture and host
     controller constraints, so we might end up with:

           VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

     The nice thing about the irqdomain hierarchy concept is that this
     does not create any runtime special cases as the base hierarchy is
     established at boot or device detection time. It's just another
     layer of indirection.

  5) The design rules for the device specific IMS irqdomains have to be
     documented and enforced to the extent possible.

     Rules which I have in my notes as of today:

       - The device specific IMS irq chip / irqdomain has to be strictly
         separated from the rest of the driver code and can only
         interact via the irq chip data which is either per interrupt or
         per device.

         I have some ideas how to enforce these things to go into
         drivers/irqchip/ so they are exposed to scrutiny and not
         burried in some "my device is special" driver code and applied
         by subsystem maintainers before anyone can even look at it. 

       - The irqchip callbacks which can be implemented by these top
         level domains are going to be restricted.

       - For the irqchip callbacks which are allowed/required the rules
         vs. following down the hierarchy need to be defined and
         enforced.

       - To achieve that the registration interface will not be based on
         struct irq_chip. This will be a new representation and the core
         will convert that into a proper irq chip which fits into the
         hierarchy. This provides one central place where the hierarchy
         requirements can be handled as they depend on the underlying
         MSI domain (IOMMU, SHIM, etc.). Otherwise any change on that
         would require to chase the IMS irqchips all over the place.

Consequences:

  1) A more radical split between legacy and hierarchical irqdomain
     code in drivers/pci/msi/ into:

       - api
       - legacy
       - irqdomain
       - shared

     That means that we are going to end up with duplicated code for
     some of the mechanisms up to the point where the stuck-in-the-mud
     parts either get converted or deleted.

  2) The device centric storage concept will stay as it does not make
     any sense to push it towards drivers and what's worse it would be a
     major pain vs. the not yet up to the task irqdomains and the legacy
     architecture backends to change that. I really have no interrest to
     make the legacy code 

     It also makes sense because the interrupts are strictly tied to the
     device. They cannot originate from some disconnected layer of thin
     air.

     Sorry Jason, no tables for you. :)

How to get there:

  1) I'm going to post part 1-3 of the series once more with the fallout
     and review comments addressed.

  2) If nobody objects, I'll merge that into tip irq/msi and work on top
     of that.

     The consolidation makes sense on it's own and is required anyway. I
     might need to revisit some of the already touched places, but that
     should be a halfways limited number. I rather do that step for step
     on top than going back to start and mixing the new concepts in from
     the very beginning.

     But I drop part 4 in it's current form because that's going to be
     part of the new infrastructure.

  3) I'll work on that bottom up towards a driver exposable API as that
     is going to be a result of the final requirements of the underlying
     infrastructure.

     The final driver visible interfaces can be bikeshedded on top to
     make them palatable for driver writers.

  4) Convert x86 PCI/MSI[x] to the new scheme

  5) Implement an IMS user.

     The obvious candidate which should be halfways accessible is the
     ath11 PCI driver which falls into that category.

     It uses horrendous hackery to make it "work" by abusing MSI. It's a
     wonder that it works at all, by some definition of "works".

     I'm pretty sure how to make it fall apart without touching a single
     line of code.

     With a small code change I can make it fail hard without blowing up
     any other MSI/MSI-X user except the switchtec NTB.

     That's a prime example for the way how driver writers behave.

     Instead of talking to the people who are responsible for the
     interrupt subsystem, they go off and do their own thing. It does
     not explode on their test machine, but it's not even remotely close
     to the requirements for PCI drivers to work independent of the
     underlying platform.

     Of course the responsible maintainer does not even notice and waves
     it through without questioning it.

Thoughts?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-04 14:20                                               ` Thomas Gleixner
  2021-12-05 14:16                                                 ` Thomas Gleixner
@ 2021-12-06 14:19                                                 ` Jason Gunthorpe
  2021-12-06 15:06                                                   ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-06 14:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Sat, Dec 04, 2021 at 03:20:36PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> > On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> > Lets do a thought experiment, lets say we forget about the current PCI
> > MSI API.

I've read both your emails, and I'll make a few small remarks on this
one, mainly to try to clearly communicate what I was thinking

Just want to call out the above paragraph, not suggesting we do it,
but the thought exercise of what should we do if not weighed down by
endless engineering cruft is usually informative.

> > What if it worked more like this:
> >
> > probe()
> >  // Access the real PCI SIG MSI/MSI-X table
> >  mystruct->msi_table = pci_allocate_msi_table(pci_dev);
> >
> >  // Special interrupt 0
> >  mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
> >  request_irq(mystruct->desc0, ..)
> 
> A device driver should not even know about msi_desc. Period.

Here, this wasn't really about the msi_desc, it was just about some
pointer-based handle. Call it what you will, and force it to be opaque
to the drivers.

eg this 'msi table' layer can just have a 

struct msi_handle;

In headers

and in a C it can be:

struct msi_handle {
   struct msi_desc desc;
};

And achieve what you've asked for

> >  - msi_table is a general API for accessing MSIs. Each bus type
> >    would provide some kind of logical creation function for their
> >    bus standardized MSI table type. eg MSI/MSI-X/etc
> 
> You can't give up on that table thing, right? We just established that
> for devices like yours IMS is not necessary a table and does not even
> need the index. The fact that MSI-X uses a table does not make
> everything a nail^Wtable. :)

It is just a name for the idea of a handle to a device's MSI
mechanism.

Call it 'msi_hw' or something.

My concept is that each device, integral to the device, has some ways
to operate the device's MSI storage (MSI/MSI-X/IMS/Platform), so lets
give that a name and give it a struct and an API. It is all still core
code.

Think of it as a clean separation between 'determining the addr/data
pair' and 'writing the addr/data pair to HW'.

Today you are thinking about this object as an irqdomain, irqchip, and
msi xarray - I think?

> >    It is a logical handle for the physical resource the holds the MSI
> >    addr/data paris.
> >
> >  - Use a pointer instead of idx as the API for referring to a MSI
> >    irq. Most drivers probably only use msi_desc->irq?
> 
> No. pci_irq_vector() translates the hw index to linux irq number.  The
> hardware index is known by the driver when it is doing a batched
> allocation, right?

Yes

> I'm fine with using the info struct I described above as the reference
> cookie.

IMHO an opaque pointer to refer to the MSI is cleaner
 
> >  - We do not have device->msi, it is held in the driver instead.
> 
> No. First of all drivers have no business with that, really.

I'm reading your second email and still not entirely clear what your
thinking is for devices->msi?

> >    dev->irqdomain is always the bus/platform originated irqdomain of
> >    the physical device.
> 
> This is a guarantee for subtle bugs and I'm not even remotely interested
> going there. See also below.

Not sure I follow this? My suggestion is that it is either as-is today
or NULL and we don't try to use eg mdev->irqdomain for anything.

> >    Thus we break the 1:1 of the device and irqdomain. A device can
> >    spawn many msi_tables, but only one would be on the dev->irqdomain
> 
> Why are you trying brute force to push things into device drivers?
> That's really the wrong direction. We want common infrastructure to be
> managed by generic code. And all of this is about common infrastructure.

The driver needs some kind of handle to tell the core code what MSI
'namespace' it is talking about in every API - eg MSI or IMS.

Pointers are the usual way to make such a handle.

Going along the IMS == MSI principle that says there should be a
driver facing API with a pointer handle for the MSI and a pointer
handle for the IMS.

Yes, this means the existing API is some compat wrapper around a
pointer API and would have to store the pointer handle in the device,
but the idea would be to make that pci->dev ONLY for the compat API,
not used in the IRQ infrastructure.

> > Is it sane? What really needs device->msi anyhow?
> 
> device->msi is a container for interrupt management and that's done by
> the interrupt code and not by random device drivers. Again, struct
> msi_desc is a software construct which the interrupt core code and the
> irqdomains and irq chip implementation use for various purposes. Nothing
> outside of this realm has to even know about the fact that this data
> structure exists in the first place. See below.

I imagined that msi_desc remains as part of the interrupt core code
and the 'msi_table' object is another interrupt core code object for
dealing with device's MSI

> > IMS is a logical progression of this concept:
> >
> > probe()
> >    mystruct->msi_table = pci_allocate_msi_table(pci_dev);
> >    mystruct->ims_irqdomain = <....>
> >    mystruct->ims_table = msi_allocate_table(pci_dev->dev,  mystruct->ims_irqdomain ..)
> >
> >    // Use MSI
> >    mystruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
> >    // Use IMS
> >    mystruct->desc1 = msi_table_alloc_vector(mystruct->ims_table, hw_label=0);
> >
> > Not saying we can/should go out and just do something so radical, but
> > as a thought experiment, can it guide toward a direction like this?
> 
> What I agree on is that the interface should be in a way that the driver
> can:
> 
>  1) Allocate vectors at a given hardware index
> 
>  2) Allocate vectors within a given hardware index range
> 
>  3) The core returns the hardware index and the Linux irq number in
>     a separate info structure which is independent of the interrupt
>     stack internal data representations.

Still slightly like an opaque pointer better than a two entry struct -
but compat is compat..

> Sure the driver can get a cookie of some sort to do allocation/free from
> different resource pools, e.g. PCI-MSI[X] and IMS. But the internal data
> representation management has to be done at the core level.

And here I still see it in the core level - this 'msi table' is a core
level data-structure and the opaque 'msi handle' is a core level
data-structure
 
We are just exposing a handle that the drive can use to. You are
calling it a cooking, but IMHO using a cookie instead of a pointer
seems obfuscating a bit?

> even try to make the irqchip/domain code mangled into the other device
> code. It should create the irqdomain with the associated chip and that
> creation process returns a cookie which is passed to the actual device
> specific code. Allocation then can use that cookie and not the irqdomain
> pointer itself.

Sounds like your cookie == my msi_table? Maybe we are all agreeing at
this point then?

> So thanks for being patient in educating me here.

I'm happy you got something out of all these words!
 
> The fact that we will be able to support this at all is based on years
> of cleanups, consolidation and restructuring of the infrastructure. The
> result of doing this is that my trust in driver developers in that
> regard is very close to zero. The cleanup patches I had to do for this
> series just to solve the single irqdomain case did not help to elevate
> the trust level either.

Yes, it is amazing how many patches this is at already.

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-05 14:16                                                 ` Thomas Gleixner
@ 2021-12-06 14:43                                                   ` Jason Gunthorpe
  2021-12-06 15:47                                                     ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-06 14:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu, Kalle Valo

On Sun, Dec 05, 2021 at 03:16:40PM +0100, Thomas Gleixner wrote:
> On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> > On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> > So I need to break that up in a way which caters for both cases, but
> > does neither create a special case for PCI nor for the rest of the
> > universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> > means all of it is common case. With that solved the storage question
> > becomes a nobrainer.
> >
> > When I looked at it again yesterday while writing mail, I went back to
> > my notes and found the loose ends where I left off. Let me go back and
> > start over from there.
> 
> I found out why I stopped looking at it. I came from a similar point of
> view what you were suggesting:
> 
> >> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
> >> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
> >> the driver asks for PCI MSI access we create a new hierarchical
> >> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
> >> you outlined for IMS?  (again, not saying to do this, but let's ask if
> >> that makes more sense than the current configuration)
> 
> Which I shot down with:
> 
> > That's not really a good idea because dev->irqdomain is a generic
> > mechanism and not restricted to the PCI use case. Special casing it for
> > PCI is just wrong. Special casing it for all use cases just to please
> > PCI is equally wrong. There is a world outside of PCI and x86. 
> 
> That argument is actually only partially correct.

I'm not sure I understood your reply? I think we are both agreeing
that dev->irqdomain wants to be a generic mechanism?

I'd say that today we've special cased it to handle PCI. IMHO that is
exactly what pci_msi_create_irq_domain() is doing - it replaces the
chip ops with ops that can *ONLY* do PCI MSI and so dev->irqdomain
becomes PCI only and non-generic.

> After studying my notes and some more code (sigh), it looks feasible
> under certain assumptions, constraints and consequences.
> 
> Assumptions:
> 
>   1) The irqdomain pointer of PCI devices which is set up during device
>      discovery is not used by anything else than infrastructure which
>      knows how to handle it.
> 
>      Of course there is no guarantee, but I'm not that horrified about
>      it anymore after chasing the exposure with yet more coccinelle
>      scripts.

OK


> Constraints:
> 
>   1) This is strictly opt-in and depends on hierarchical irqdomains.

OK

>      That means that devices which depend on IMS won't work on anything
>      which is not up to date.

OK - I think any pressure to get platforms to update their code is
only positive.
 
>   2) Guest support is strictly opt-in
> 
>      The underlying architecture/subarchitecture specific irqdomain has
>      to detect at setup time (eventually early boot), whether the
>      underlying hypervisor supports it.
> 
>      The only reasonable way to support that is the availability of
>      interrupt remapping via vIOMMU, as we discussed before.

This is talking about IMS specifically because of the legacy issue
where the MSI addr/data pair inside a guest is often completely fake?

>   4) The resulting irqdomain hierarchy would ideally look like this:
> 
>      VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

OK, this matches where I have come from as well
 
>      That does not work in all cases due to architecture and host
>      controller constraints, so we might end up with:
> 
>            VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

OK - I dont' know enough about the architecture/controller details to
imagine what SHIM is, but if it allows keeping the PCI code as purely
PCI code, then great

>   5) The design rules for the device specific IMS irqdomains have to be
>      documented and enforced to the extent possible.
> 
>      Rules which I have in my notes as of today:
> 
>        - The device specific IMS irq chip / irqdomain has to be strictly
>          separated from the rest of the driver code and can only
>          interact via the irq chip data which is either per interrupt or
>          per device.

It seems OK with the observaion that IDXD and mlx5 will both need to
set 'per-interrupt' data before provisioning the IRQ

>          I have some ideas how to enforce these things to go into
>          drivers/irqchip/ so they are exposed to scrutiny and not
>          burried in some "my device is special" driver code and applied
>          by subsystem maintainers before anyone can even look at it. 

Means more modules, but OK
 
>        - The irqchip callbacks which can be implemented by these top
>          level domains are going to be restricted.

OK - I think it is great that the driver will see a special ops struct
that is 'ops for device's MSI addr/data pair storage'. It makes it
really clear what it is

>        - For the irqchip callbacks which are allowed/required the rules
>          vs. following down the hierarchy need to be defined and
>          enforced.

The driver should be the ultimate origin of the interrupt so it is
always end-point in the hierarchy, opposite the CPU?

I would hope the driver doesn't have an exposure to hierarchy?
 
>        - To achieve that the registration interface will not be based on
>          struct irq_chip. This will be a new representation and the core
>          will convert that into a proper irq chip which fits into the
>          hierarchy. This provides one central place where the hierarchy
>          requirements can be handled as they depend on the underlying
>          MSI domain (IOMMU, SHIM, etc.). Otherwise any change on that
>          would require to chase the IMS irqchips all over the place.

OK, I like this too.

So we have a new concept: 'device MSI storage ops'

Put them along with the xarray holding the msi_descs and you've got my
msi_table :)

 
>   2) The device centric storage concept will stay as it does not make
>      any sense to push it towards drivers and what's worse it would be a
>      major pain vs. the not yet up to the task irqdomains and the legacy
>      architecture backends to change that. I really have no interrest to
>      make the legacy code
> 
>      It also makes sense because the interrupts are strictly tied to the
>      device. They cannot originate from some disconnected layer of thin
>      air.
> 
>      Sorry Jason, no tables for you. :)

How does the driver select with 'device MSI storage ops' it is
requesting a MSI for ?

>   1) I'm going to post part 1-3 of the series once more with the fallout
>      and review comments addressed.

OK, I didn't see anything in there that was making anything harder in
this direction
 
>   5) Implement an IMS user.
> 
>      The obvious candidate which should be halfways accessible is the
>      ath11 PCI driver which falls into that category.

Aiiee:

drivers/net/wireless/ath/ath11k/pci.c:  ab_pci->msi_ep_base_data = msi_desc->msg.data;

So, we already have two in-tree PCI IMS devices!!

Agree this makes a lot of sense to focus on some first steps

Along with NTB which is in the same general camp

Thanksm
Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-06 14:19                                                 ` Jason Gunthorpe
@ 2021-12-06 15:06                                                   ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-06 15:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Jason,

On Mon, Dec 06 2021 at 10:19, Jason Gunthorpe wrote:
> On Sat, Dec 04, 2021 at 03:20:36PM +0100, Thomas Gleixner wrote:
>> even try to make the irqchip/domain code mangled into the other device
>> code. It should create the irqdomain with the associated chip and that
>> creation process returns a cookie which is passed to the actual device
>> specific code. Allocation then can use that cookie and not the irqdomain
>> pointer itself.
>
> Sounds like your cookie == my msi_table? Maybe we are all agreeing at
> this point then?

I think so. It's going to be something the driver can use as a
reference. Same for the actual interrupt allocated through this domain
reference.

>> So thanks for being patient in educating me here.
>
> I'm happy you got something out of all these words!

Definitely so. That's why we are having these discussions, right?

The shiny irq subsystem is not so shiny when the drivers cannot use
it. OTOH, it's often enough the case that driver folks want to use it
the wrong way just because.

> Yes, it is amazing how many patches this is at already.

Don't worry. You'll get a few more patch bombs in your inbox until IMS
is supported, unless you want to be "unsubscribed".

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-06 14:43                                                   ` Jason Gunthorpe
@ 2021-12-06 15:47                                                     ` Thomas Gleixner
  2021-12-06 17:00                                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-06 15:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu, Kalle Valo

Jason,

On Mon, Dec 06 2021 at 10:43, Jason Gunthorpe wrote:
> On Sun, Dec 05, 2021 at 03:16:40PM +0100, Thomas Gleixner wrote:
>> > That's not really a good idea because dev->irqdomain is a generic
>> > mechanism and not restricted to the PCI use case. Special casing it for
>> > PCI is just wrong. Special casing it for all use cases just to please
>> > PCI is equally wrong. There is a world outside of PCI and x86. 
>> 
>> That argument is actually only partially correct.
>
> I'm not sure I understood your reply? I think we are both agreeing
> that dev->irqdomain wants to be a generic mechanism?

Yes. I managed to confuse myself there by being too paranoid about how
to distinguish things on platforms which need to support both ways, i.e.
x86 when XEN is enabled.

> I'd say that today we've special cased it to handle PCI. IMHO that is
> exactly what pci_msi_create_irq_domain() is doing - it replaces the
> chip ops with ops that can *ONLY* do PCI MSI and so dev->irqdomain
> becomes PCI only and non-generic.

Right. See above. That's why I went back to my notes, did some more
research ...

>>   2) Guest support is strictly opt-in
>> 
>>      The underlying architecture/subarchitecture specific irqdomain has
>>      to detect at setup time (eventually early boot), whether the
>>      underlying hypervisor supports it.
>> 
>>      The only reasonable way to support that is the availability of
>>      interrupt remapping via vIOMMU, as we discussed before.
>
> This is talking about IMS specifically because of the legacy issue
> where the MSI addr/data pair inside a guest is often completely fake?

This is about IMS, right. PCI/MSI[x] is handled today because the writes
to the MSI/MSI-X message store can be trapped.

>>      That does not work in all cases due to architecture and host
>>      controller constraints, so we might end up with:
>> 
>>            VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains
>
> OK - I dont' know enough about the architecture/controller details to
> imagine what SHIM is, but if it allows keeping the PCI code as purely
> PCI code, then great

It's today part of the arch/subarch specific PCI/MSI domain to deal with
quirks above the IOMMU level. As we can't proliferate that into the new
endpoint domain, that needs to be done as a shim layer in between which
has no real other functionality than applying the quirks. Yes, it's all
pretty. Welcome to my wonderful world.

>>        - The irqchip callbacks which can be implemented by these top
>>          level domains are going to be restricted.
>
> OK - I think it is great that the driver will see a special ops struct
> that is 'ops for device's MSI addr/data pair storage'. It makes it
> really clear what it is

It will need some more than that, e.g. mask/unmask and as we discussed
quite some time ago something like the irq_buslock/unlock pair, so you
can handle updates to the state from thread context via a command queue
(IIRC).

>>        - For the irqchip callbacks which are allowed/required the rules
>>          vs. following down the hierarchy need to be defined and
>>          enforced.
>
> The driver should be the ultimate origin of the interrupt so it is
> always end-point in the hierarchy, opposite the CPU?
>
> I would hope the driver doesn't have an exposure to hierarchy?

No.
  
> So we have a new concept: 'device MSI storage ops'
>
> Put them along with the xarray holding the msi_descs and you've got my
> msi_table :)

Hehe.
  
>>      Sorry Jason, no tables for you. :)
>
> How does the driver select with 'device MSI storage ops' it is
> requesting a MSI for ?

Via some cookie, reference whatever as discussed in the other
mail. We'll bikeshed the naming once I get there :)

>>   1) I'm going to post part 1-3 of the series once more with the fallout
>>      and review comments addressed.
>
> OK, I didn't see anything in there that was making anything harder in
> this direction

It's helping to keep the existing stuff including the !irqdomain parts
sufficiently self contained so I can actually change the inner workings
of msi domains without going back to any of these places (hopefully).
  
>>   5) Implement an IMS user.
>> 
>>      The obvious candidate which should be halfways accessible is the
>>      ath11 PCI driver which falls into that category.
>
> Aiiee:

Yes.

> drivers/net/wireless/ath/ath11k/pci.c:  ab_pci->msi_ep_base_data = msi_desc->msg.data;

That's only one part of it. Look how the address is retrieved.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-06 15:47                                                     ` Thomas Gleixner
@ 2021-12-06 17:00                                                       ` Jason Gunthorpe
  2021-12-06 20:28                                                         ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-06 17:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu, Kalle Valo

On Mon, Dec 06, 2021 at 04:47:58PM +0100, Thomas Gleixner wrote:

> >>        - The irqchip callbacks which can be implemented by these top
> >>          level domains are going to be restricted.
> >
> > OK - I think it is great that the driver will see a special ops struct
> > that is 'ops for device's MSI addr/data pair storage'. It makes it
> > really clear what it is
> 
> It will need some more than that, e.g. mask/unmask and as we discussed
> quite some time ago something like the irq_buslock/unlock pair, so you
> can handle updates to the state from thread context via a command queue
> (IIRC).

Yes, I was thinking about all of that in here.

Let me ask a slightly different question

pci_msi_create_irq_domain() hooks into the platforms irq_chip as an
alternative to hierarchical irq domains (?)

eg:

		chip->irq_write_msi_msg = pci_msi_domain_write_msg;

And we see:

void pci_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg *msg)
{
	struct msi_desc *desc = irq_data_get_msi_desc(irq_data);

Now, if we have your idea to have some:

struct msi_storage_ops {
   write_msg
   mask;
   unmask;
   lock;
   unlock;
};

Which is how a driver plugs in its storage operations.

In almost all cases 'ops' will come along with a 'state', so lets
create one:

struct msi_storage {  // Look, I avoided the word table!
   const struct msi_storage_ops *ops
};

ie:

struct msi_storage_ops {
   void (*write_msg)(struct msi_storage *msi, struct msi_desc *desc, struct msi_msg *msg);


Now, what if we made a 'generic_msi_create_irq_domain()' that hooks
the irq_chip with something like this:

void generic_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg *msg)
{
	struct msi_desc *desc = irq_data_get_msi_desc(irq_data);
	struct msi_storage *msi = desc->msi;

	msi->ops->write_msg(msi, desc, msg);
}

And then have what pci_msi_domain_write_msg() did now accomplished by
having it set desc->storage to a pci_msi_storage or pci_msix_storage
with those msi_storage_ops pointing at pci_msi_domain_write_msg/etc

Then we can transform:

void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
	struct pci_dev *dev = msi_desc_to_pci_dev(entry);

into

struct pci_msi_storage {
        struct msi_storage storage;
        struct pci_dev *dev;
	unsigned int msi_cap_off;
};

void pci_write_msi64_msg(struct msi_storage *storage,  struct msi_desc *desc, struct msi_msg *msg)
{
	struct pci_msi_storage *msi = container_of(storage, struct pci_msi_storage, storage);
	unsigned int pos = storage->msi_cap_off;
	struct pci_dev *dev = msi->dev;

	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
	msgctl &= ~PCI_MSI_FLAGS_QSIZE;
	msgctl |= desc->msi_attrib.multiple << 4;
	pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);

	pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO, msg->address_lo);
	pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI, msg->address_hi);
	pci_write_config_word(dev, pos + PCI_MSI_DATA_64, msg->data);
	/* Ensure that the writes are visible in the device */
	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
	desc->msg = *msg; // in core code instead ?
}

And others for the different cases. Look no ifs!

OK?

Now, we have some duplication between the struct msi_storage_ops and
the struct irq_chip. Let's see what that is about:

arch/x86/kernel/apic/msi.c:     .irq_write_msi_msg      = dmar_msi_write_msg,
arch/x86/kernel/hpet.c: .irq_write_msi_msg = hpet_msi_write_msg,

Surprised! These are actually IMS. The HPET and DMAR devices both have
device-specific message storage! So these could use
msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
apic/msi.c ???

arch/powerpc/platforms/pseries/msi.c:   .irq_write_msi_msg      = pseries_msi_write_msg,

AFAICT this is really like virtualization? The hypervisor is
controlling the real MSI table and what the OS sees is faked out
somewhat.

This is more of quirk in the PCI MSI implementation (do not touch the
storage) and a block on non-PCI uses of MSI similar to what x86 needs?

drivers/irqchip/irq-gic-v2m.c:  .irq_write_msi_msg      = pci_msi_domain_write_msg,
drivers/irqchip/irq-gic-v3-its-pci-msi.c:       .irq_write_msi_msg      = pci_msi_domain_write_msg,
drivers/irqchip/irq-gic-v3-mbi.c:       .irq_write_msi_msg      = pci_msi_domain_write_msg,

ARM seems to be replacing the 'mask at source' with 'mask at
destination' - I wonder why?

Should this really be hierarchical where we mask *both* the MSI
originating device (storage_ops->mask) and at the CPU IRQ controller?
(gicv2m_mask_msi_irq ?) if it can?

drivers/base/platform-msi.c:            chip->irq_write_msi_msg = platform_msi_write_msg;

Oh! this is doing what I kind of just suggested, just non-generically
and hacked into platform bus drivers the same as PCI does:

static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
{
	struct msi_desc *desc = irq_data_get_msi_desc(data);
	struct platform_msi_priv_data *priv_data;

	priv_data = desc->platform.msi_priv_data;

	priv_data->write_msg(desc, msg);
}

platform_msi entirely gets deleted. Instead all platform drivers using
it will use IMS - set up a msi_storage_ops with
storage_ops->write_msg == platform_msi_priv_data::write_msg and
allocate a msi_storage someplace.

So, at least at this first order, we could have world where the
irq_chip does not overlap struct msi_storage_ops - ie we move the MSI
related things from irq_chip to msi_storage_ops.

Then, all the core code places calling into chip->[msi] would instead 
find the msi_storage_op from the irq_data. ie like (or better)

  irq_data->desc->storage->ops

Now we have a real clear split of responsibility.

The irq_domain hierarchy and irq_chip is all about classic wired
interrupts and interrupt controller path after the device launches the
message. For MSI it's job is to determine the addr/data.

msi_storage_ops is all about interrupt origin points that can hold an
addr/data pair. It's job is to store the addr/data into the device
specific registers, do mask-at-source, locking etc.

PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip

Then you get what you asked at the start, the platform's irq_domain is
now fully generic and can be 1:N to different MSI storage providers.

For API compat every pci struct device will have to instantiate a
msi_storage someplace, but that seems easy enough.

Seems like a nice uniform solution?

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-06 17:00                                                       ` Jason Gunthorpe
@ 2021-12-06 20:28                                                         ` Thomas Gleixner
  2021-12-06 21:06                                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-06 20:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu, Kalle Valo

Jason,

On Mon, Dec 06 2021 at 13:00, Jason Gunthorpe wrote:
> On Mon, Dec 06, 2021 at 04:47:58PM +0100, Thomas Gleixner wrote:
>> It will need some more than that, e.g. mask/unmask and as we discussed
>> quite some time ago something like the irq_buslock/unlock pair, so you
>> can handle updates to the state from thread context via a command queue
>> (IIRC).
>
> pci_msi_create_irq_domain() hooks into the platforms irq_chip as an
> alternative to hierarchical irq domains (?)

It's based on hierarchical irqdomains. It's the outermost domain and irq
chip. On X86:

  [VECTOR chip=apic]->[PCI chip=PCI]
or
  [VECTOR chip=apic]->[IOMMU chip=IR]->[PCI chip=PCI-IR]

The chips are different because of quirks. See below :)

>		chip->irq_write_msi_msg = pci_msi_domain_write_msg;
> In almost all cases 'ops' will come along with a 'state', so lets
> create one:
>
> struct msi_storage {  // Look, I avoided the word table!

I already made a note, that I need to smuggle a table in somewhere :)

> And others for the different cases. Look no ifs!
>
> OK?

That's already the plan in some form, but there's a long way towards
that. See below.

Also there will be a question of how many different callbacks we're
going to create just to avoid one conditional. At some point this might
become silly.

> Now, we have some duplication between the struct msi_storage_ops and
> the struct irq_chip. Let's see what that is about:
>
> arch/x86/kernel/apic/msi.c:     .irq_write_msi_msg      = dmar_msi_write_msg,
> arch/x86/kernel/hpet.c: .irq_write_msi_msg = hpet_msi_write_msg,
>
> Surprised! These are actually IMS. The HPET and DMAR devices both have
> device-specific message storage! So these could use
> msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
> apic/msi.c ???

Historical reasons coming from the pre irqdomain aera. Also DMAR needs
direct access to the x86 low level composer which we didn't want to
expose. Plus DMAR is shared with ia64 to make it more interesting.

Yes, they can be converted. But that's the least of my worries. Those
are straight forward and not really relevant for the design.

> arch/powerpc/platforms/pseries/msi.c:   .irq_write_msi_msg      = pseries_msi_write_msg,
>
> AFAICT this is really like virtualization? The hypervisor is
> controlling the real MSI table and what the OS sees is faked out
> somewhat.
>
> This is more of quirk in the PCI MSI implementation (do not touch the
> storage) and a block on non-PCI uses of MSI similar to what x86 needs?

There is an underlying hypervisor of some sorts and that stuff needs to
deal with it. I leave that to the powerpc wizards to sort out.

> drivers/irqchip/irq-gic-v2m.c:  .irq_write_msi_msg      = pci_msi_domain_write_msg,
> drivers/irqchip/irq-gic-v3-its-pci-msi.c:       .irq_write_msi_msg      = pci_msi_domain_write_msg,
> drivers/irqchip/irq-gic-v3-mbi.c:       .irq_write_msi_msg      = pci_msi_domain_write_msg,
>
> ARM seems to be replacing the 'mask at source' with 'mask at
> destination' - I wonder why?

Because the majority of PCI/MSI endpoint implementations do not provide
masking...

We're telling hardware people for 15+ years that this is a horrible
idea, but it's as effective as talking to a wall. Sure the spec grants
them to make my life miserable...

> Should this really be hierarchical where we mask *both* the MSI
> originating device (storage_ops->mask) and at the CPU IRQ controller?
> (gicv2m_mask_msi_irq ?) if it can?

I wish I could mask underneath for some stuff on x86. Though that would
not help with the worst problem vs. affinity settings. See the horrible
dance in:

    x86/kernel/apic/msi.c::msi_set_affinity()

So this will end up with a shim as the base domain for !IOMMU systems:

		 			       |--[HPET]
  [VECTOR chip=apic]-|--[x86-msi chip=x86-msi]-|--[PCI/MSI]
		     |--[DMAR]		       |--[PCI/MSI-X]

That nonsense can't move into the apic domain set_affinity() callback as
this is not required when interrupt remapping is enabled.

With IOMMU this looks then:

		 		        |--[HPET]
  [VECTOR chip=apic]-|--[IOMMU chip=IR]-|--[PCI/MSI]
		     |--[DMAR]	        |--[PCI/MSI-X]
		 		        |--[PCI/IMS]

> drivers/base/platform-msi.c:            chip->irq_write_msi_msg = platform_msi_write_msg;
>
> Oh! this is doing what I kind of just suggested, just non-generically
> and hacked into platform bus drivers the same as PCI does:

Correct. It's a hack and it's on the list of things which need to
vanish. I was already discussing that with Marc on the side for quite a
while.

> PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip

With different parent domains. DMAR hangs always directly off the vector
domain. HPET has its own IOMMU zone.

You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
it's not using MSI domains so it's not in the way. I'm not going to
touch that with a ten foot pole. :)

There's also VMD, HyperV and the XEN crime which is a horrible shim to
make device->msi_domain consistent on x86. For fixing XEN properly I'm
not masochistic enough.

> For API compat every pci struct device will have to instantiate a
> msi_storage someplace, but that seems easy enough.

That's easy to hide in the existing driver interfaces for PCI/MSI and
PCI/MSI-X.

> Seems like a nice uniform solution?

That's where I'm heading.

I have a full inventory of the various horrors involved, so I have a
pretty good picture what kind of oddities are involved, where a shim
domain is required and which underlying platforms require the MSI irq
chip to do:

    irq_chip_mask()
       msi_ops->mask()
       parent->chip->mask()

and so forth. I need to figure out how the parent irq chip / irqdomain
transports that requirement.

But that part is not where the real work is. I'll get there eventually
once I sorted the underlying parts:

   - Building up the infrastructure in kernel/irq/

   - Decomposing pci/msi/* further

   - Make use of the infrastructure for an alternate pci/msi
     implemention.
   
   - Have a transition mechanism to convert one part at a time to keep
     the patch sizes reviewable and the whole mess bisectable.

Doing all this while keeping the full universe of legacy/arch, current
PCI/MSI domains alive makes that interesting. I broke the world in the
past, so I'm not afraid of doing it again. Though I try to avoid it to
the extent possible. :)

I have a pretty good picture in my head and notes already, which needs
to be dumped into code. But let me post part 1-3 V2 first, so that pile
gets out of the way. Not having to juggle 90 patches makes life easier.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-06 20:28                                                         ` Thomas Gleixner
@ 2021-12-06 21:06                                                           ` Jason Gunthorpe
  2021-12-06 22:21                                                             ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-06 21:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu, Kalle Valo

On Mon, Dec 06, 2021 at 09:28:47PM +0100, Thomas Gleixner wrote:

> That's already the plan in some form, but there's a long way towards
> that. See below.

Okay, then I think we are thinking the same sorts of things, it is
good to see
 
> Also there will be a question of how many different callbacks we're
> going to create just to avoid one conditional. At some point this might
> become silly.

Yes, but I think the the overal point is that it we could have ops
that do exactly what they need, and we can choose when we get there
which makes the most sense. ie the 64 vs 32 is probably silly.

> > Surprised! These are actually IMS. The HPET and DMAR devices both
> > have device-specific message storage! So these could use
> > msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
> > apic/msi.c ???
> 
> Historical reasons coming from the pre irqdomain aera. Also DMAR needs
> direct access to the x86 low level composer which we didn't want to
> expose. Plus DMAR is shared with ia64 to make it more interesting.
>
> Yes, they can be converted. But that's the least of my worries. Those
> are straight forward and not really relevant for the design.
> 
> > arch/powerpc/platforms/pseries/msi.c:   .irq_write_msi_msg      = pseries_msi_write_msg,
> >
> > AFAICT this is really like virtualization? The hypervisor is
> > controlling the real MSI table and what the OS sees is faked out
> > somewhat.
> >
> > This is more of quirk in the PCI MSI implementation (do not touch the
> > storage) and a block on non-PCI uses of MSI similar to what x86 needs?
> 
> There is an underlying hypervisor of some sorts and that stuff needs to
> deal with it. I leave that to the powerpc wizards to sort out.
> 
> > drivers/irqchip/irq-gic-v2m.c:  .irq_write_msi_msg      = pci_msi_domain_write_msg,
> > drivers/irqchip/irq-gic-v3-its-pci-msi.c:       .irq_write_msi_msg      = pci_msi_domain_write_msg,
> > drivers/irqchip/irq-gic-v3-mbi.c:       .irq_write_msi_msg      = pci_msi_domain_write_msg,
> >
> > ARM seems to be replacing the 'mask at source' with 'mask at
> > destination' - I wonder why?
> 
> Because the majority of PCI/MSI endpoint implementations do not provide
> masking...
> 
> We're telling hardware people for 15+ years that this is a horrible
> idea, but it's as effective as talking to a wall. Sure the spec grants
> them to make my life miserable...
> 
> > Should this really be hierarchical where we mask *both* the MSI
> > originating device (storage_ops->mask) and at the CPU IRQ controller?
> > (gicv2m_mask_msi_irq ?) if it can?
> 
> I wish I could mask underneath for some stuff on x86. Though that would
> not help with the worst problem vs. affinity settings. See the horrible
> dance in:

My thinking here is that this stuff in ARM is one of the different
cases (ie not using MSI_FLAG_USE_DEF_CHIP_OPS), and I guess we can
just handle it cleanly by having the core call both the irq_chip->mask
and the msi_storage_ops->mask and we don't need ARM to be different,
x86 just won't provide a mask at destination op.

>     x86/kernel/apic/msi.c::msi_set_affinity()

Okay, so it is complicated, but it is just calling
   irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);

So, from a msi_storage_ops perspective, things are still clean.

> > PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip
> 
> With different parent domains. DMAR hangs always directly off the vector
> domain. HPET has its own IOMMU zone.
> 
> You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
> it's not using MSI domains so it's not in the way. I'm not going to
> touch that with a ten foot pole. :)

I left off IOAPIC because I view it as conceptually different. I used
the phrasse "device that originated the interrupt" deliberately,
IOAPIC is just a middle box that converts from a physical interrupt
line to a message world, it belongs with the physical interrupt
infrastructure.

Possibly the IOAPIC considerations is what motivated some of this to
look the way it does today, because it really was trying to hide MSI
under normal PCI INTX physical pins with full compatability. We kind
of kept doing that as MSI grew into its own thing.

> > Seems like a nice uniform solution?
> 
> That's where I'm heading.

Okay, it looks like a lot TBH, but I think the direction is clean and
general.

I'm curious to see if you end up with irq_domains and irq_chips along
with what I labeled as the msi_storage above, or if those turn out to
be unnecesary for the driver to provide MSI programming.

Also, if msi_storage_ops can be robust enough you'd be comfortable
with it in a driver .c file and just a regex match in the MAINTAINERS
file :)

>    - Have a transition mechanism to convert one part at a time to keep
>      the patch sizes reviewable and the whole mess bisectable.

This seems difficult all on its own..

> I have a pretty good picture in my head and notes already, which needs
> to be dumped into code. But let me post part 1-3 V2 first, so that pile
> gets out of the way. Not having to juggle 90 patches makes life easier.

Okay! Thanks

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-06 21:06                                                           ` Jason Gunthorpe
@ 2021-12-06 22:21                                                             ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-06 22:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Kevin Tian, Megha Dey, Ashok Raj, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Dave Jiang, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu, Kalle Valo

On Mon, Dec 06 2021 at 17:06, Jason Gunthorpe wrote:
> On Mon, Dec 06, 2021 at 09:28:47PM +0100, Thomas Gleixner wrote:
>> I wish I could mask underneath for some stuff on x86. Though that would
>> not help with the worst problem vs. affinity settings. See the horrible
>> dance in:
>
> My thinking here is that this stuff in ARM is one of the different
> cases (ie not using MSI_FLAG_USE_DEF_CHIP_OPS), and I guess we can
> just handle it cleanly by having the core call both the irq_chip->mask
> and the msi_storage_ops->mask and we don't need ARM to be different,
> x86 just won't provide a mask at destination op.
>
>>     x86/kernel/apic/msi.c::msi_set_affinity()
>
> Okay, so it is complicated, but it is just calling
>    irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);
>
> So, from a msi_storage_ops perspective, things are still clean.

Yes.

>> You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
>> it's not using MSI domains so it's not in the way. I'm not going to
>> touch that with a ten foot pole. :)
>
> I left off IOAPIC because I view it as conceptually different. I used
> the phrasse "device that originated the interrupt" deliberately,
> IOAPIC is just a middle box that converts from a physical interrupt
> line to a message world, it belongs with the physical interrupt
> infrastructure.

I mentioned it because there is mbigen on arm64 which is the same thing,
translates hundreds of wire inputs into MSI. It's even worse than
IO/APIC. There is a horrible hack to make it "work" which Marc and I are
looking at whether we can kill it on the way.

> Possibly the IOAPIC considerations is what motivated some of this to
> look the way it does today, because it really was trying to hide MSI
> under normal PCI INTX physical pins with full compatability. We kind
> of kept doing that as MSI grew into its own thing.

Not really. It was more to avoid having a complete separate
infrastructure for irqdomain based MSI[X]. Lazyness and lack of time
added the rest of non-motivation :)

> I'm curious to see if you end up with irq_domains and irq_chips along
> with what I labeled as the msi_storage above, or if those turn out to
> be unnecesary for the driver to provide MSI programming.

I cant avoid irq chips because from the interrupt handling side of view
that's unavoidable unless we create a duplicate zoo there. What I have
in mind is to convert the msi ops provided by the device driver into a
real chip as that just falls in place without further changes.

The irqdomain will be real as well just to make things consistent and to
share as much code as possible.

> Also, if msi_storage_ops can be robust enough you'd be comfortable
> with it in a driver .c file and just a regex match in the MAINTAINERS
> file :)

That might work. Let's see when we are there.

>>    - Have a transition mechanism to convert one part at a time to keep
>>      the patch sizes reviewable and the whole mess bisectable.
>
> This seems difficult all on its own..

I've done that before. It just needs some thought.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-01 21:44                                 ` Thomas Gleixner
  2021-12-01 21:49                                   ` Dave Jiang
@ 2021-12-09  5:23                                   ` Tian, Kevin
  2021-12-09  8:37                                     ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-09  5:23 UTC (permalink / raw)
  To: Thomas Gleixner, Jiang, Dave, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 2, 2021 5:45 AM
> 
> On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
> > On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
> >>> The hardware implementation does not have enough MSIX vectors for
> >>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
> >>> vectors. So if we are to do MSI-X for all of them, then we need to do
> >>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
> >> Confused. Are you talking about passing a full IDXD device to the guest
> >> or about passing a carved out subdevice, aka. queue?
> >
> > I'm talking about carving out a subdevice. I had the impression of you
> > wanting IMS passed through for all variations. But it sounds like for a
> > sub-device, you are ok with the implementation of MSIX backed by IMS?
> 
> I don't see anything wrong with that. A subdevice is it's own entity and
> VFIO can chose the most conveniant representation of it to the guest
> obviously.
> 
> How that is backed on the host does not really matter. You can expose
> MSI-X to the guest with a INTx backing as well.
> 

Agree with this point. How the interrupts are represented to the guest
is orthogonal to how the backend resource is allocated. Physically MSI-X 
and IMS can be enabled simultaneously on an IDXD device. Once 
dynamic allocation is allowed for both, either one can be allocated for
a subdevice (with only difference on supported #subdevices). 

When an interrupt resource is exposed to the guest with the same type 
(e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the 
guest as long as a hypercall machinery is in place to get addr/data pair 
from the host (as you suggested earlier).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-02 13:55                                 ` Jason Gunthorpe
  2021-12-02 14:23                                   ` Greg Kroah-Hartman
  2021-12-02 19:25                                   ` Thomas Gleixner
@ 2021-12-09  5:41                                   ` Tian, Kevin
  2021-12-09  5:47                                     ` Jason Wang
  2 siblings, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-09  5:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Jiang, Dave, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, December 2, 2021 9:55 PM
> 
> Further, there is no reason why IMS should be reserved exclusively for
> VFIO! 

This is correct. Just as what you agreed with Thomas, the only difference
between IMS and MSI is on where the messages are stored. Physically 
it is unreasonable for the HW to check whether an interrupt is used for
a specific software usage (e.g. virtualization) since it doesn't have such
knowledge. At most an entry is associated to PASID, but again the PASID
can be used anywhere.

The wording in current IDXD spec is not clear on this part. We'll talk to 
the spec owner to have it fixed.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09  5:41                                   ` Tian, Kevin
@ 2021-12-09  5:47                                     ` Jason Wang
  0 siblings, 0 replies; 141+ messages in thread
From: Jason Wang @ 2021-12-09  5:47 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Thomas Gleixner, Logan Gunthorpe, LKML,
	Bjorn Helgaas, Marc Zygnier, Alex Williamson, Dey, Megha, Raj,
	Ashok, linux-pci, Greg Kroah-Hartman, Jon Mason, Jiang, Dave,
	Allen Hubbe, linux-ntb, linux-s390, Heiko Carstens,
	Christian Borntraeger, x86, Joerg Roedel, iommu

On Thu, Dec 9, 2021 at 1:41 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, December 2, 2021 9:55 PM
> >
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO!
>
> This is correct. Just as what you agreed with Thomas, the only difference
> between IMS and MSI is on where the messages are stored. Physically
> it is unreasonable for the HW to check whether an interrupt is used for
> a specific software usage (e.g. virtualization) since it doesn't have such
> knowledge. At most an entry is associated to PASID, but again the PASID
> can be used anywhere.

Note that vDPA had the plan to use IMS for the parent that can have a
huge amount of instances.

Thanks

>
> The wording in current IDXD spec is not clear on this part. We'll talk to
> the spec owner to have it fixed.
>
> Thanks
> Kevin
>


^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-03 16:41                                             ` Jason Gunthorpe
  2021-12-04 14:20                                               ` Thomas Gleixner
@ 2021-12-09  6:26                                               ` Tian, Kevin
  2021-12-09  9:03                                                 ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-09  6:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Gleixner
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Jiang, Dave, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, December 4, 2021 12:41 AM
> 
> > Or has each queue and controlblock and whatever access to a shared large
> > array where the messages are stored and the indices are handed out to
> > the queues and controlblocks?
> 
> > If each of them have their own small array, then queue relative indexing
> > makes a ton of sense, no?
> 
> Okay, I see.
> 
> I don't know of any use case for more than one interrupt on a queue,
> and if it did come up I'd probably approach it by making the queue
> handle above also specify the 'queue relative HW index'
> 

We have such use case with IDXD.

Basically the IDXD queue allows software to put an interrupt handle
(the index of MSI-X or IMS entry) in the submitted descriptor. Upon
completion of the descriptor the hardware finds the specified entry 
and then generate interrupt to notify software.

Conceptually descriptors submitted to a same queue can use different
handles, implying one queue can be associated to multiple interrupts.

One example is the shared work queue usage which allows multiple 
clients directly and simultaneously submitting descriptors to the 
same queue, by using ENQCMD(pasid, descriptor) instruction. In this 
case each client can be allocated with an interrupt entry (including the
information about the client's pasid for permission check when the
HW generates completion interrupt) and then use this entry for 
all descriptors submitted by that client.

Haven't completed reading of this thread, but would like to point 
out this usage so it is not ignored in the final rework. It basically
means one queue might be associated to multiple interrupt entries
and multiple pasids. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09  5:23                                   ` Tian, Kevin
@ 2021-12-09  8:37                                     ` Thomas Gleixner
  2021-12-09 12:31                                       ` Tian, Kevin
  2021-12-09 16:21                                       ` Jason Gunthorpe
  0 siblings, 2 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-09  8:37 UTC (permalink / raw)
  To: Tian, Kevin, Jiang, Dave, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> I don't see anything wrong with that. A subdevice is it's own entity and
>> VFIO can chose the most conveniant representation of it to the guest
>> obviously.
>> 
>> How that is backed on the host does not really matter. You can expose
>> MSI-X to the guest with a INTx backing as well.
>> 
>
> Agree with this point. How the interrupts are represented to the guest
> is orthogonal to how the backend resource is allocated. Physically MSI-X 
> and IMS can be enabled simultaneously on an IDXD device. Once 
> dynamic allocation is allowed for both, either one can be allocated for
> a subdevice (with only difference on supported #subdevices). 
>
> When an interrupt resource is exposed to the guest with the same type 
> (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the 
> guest as long as a hypercall machinery is in place to get addr/data pair 
> from the host (as you suggested earlier).

As I pointed out in the conclusion of this thread, IMS is only going to
be supported with interrupt remapping in place on both host and guest.

As these devices are requiring a vIOMMU on the guest anyway (PASID, User
IO page tables), the required hypercalls are part of the vIOMMU/IR
implementation. If you look at it from the irqdomain hierarchy view:

                         |- PCI-MSI
  VECTOR -- [v]IOMMU/IR -|- PCI-MSI-X
                         |- PCI-IMS

So host and guest use just the same representation which makes a ton of
sense.

There are two places where this matters:

  1) The activate() callback of the IR domain

  2) The irq_set_affinity() callback of the irqchip associated with the
     IR domain

Both callbacks are allowed to fail and the error code is handed back to
the originating call site.

If you look at the above hierarchy view then MSI/MSI-X/IMS are all
treated in exactly the same way. It all becomes the common case.

No?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09  6:26                                               ` Tian, Kevin
@ 2021-12-09  9:03                                                 ` Thomas Gleixner
  2021-12-09 12:17                                                   ` Tian, Kevin
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-09  9:03 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Jiang, Dave, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

Kevin,

On Thu, Dec 09 2021 at 06:26, Kevin Tian wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> I don't know of any use case for more than one interrupt on a queue,
>> and if it did come up I'd probably approach it by making the queue
>> handle above also specify the 'queue relative HW index'
>
> We have such use case with IDXD.
>
> Basically the IDXD queue allows software to put an interrupt handle
> (the index of MSI-X or IMS entry) in the submitted descriptor. Upon
> completion of the descriptor the hardware finds the specified entry 
> and then generate interrupt to notify software.
>
> Conceptually descriptors submitted to a same queue can use different
> handles, implying one queue can be associated to multiple interrupts.

I think you are looking at that from the internal implementation details
of IDXD. But you can just model it in an IDXD implementation agnostic
way:

     ENQCMD(PASID, IMS-ENTRY,.....)

implies an on demand allocation of a virtual queue, which is deallocated
when the command completes. The PASID and IMS-ENTRY act as the 'queue'
identifier.

The implementation detail of IDXD that it executes these computations on
an internal shared workqueue does not change that.

Such a workqueue can only execute one enqueued command at a time, which
means that during the execution of a particular command that IDXD
internal workqueue represents the 'virtual queue' which is identified by
the unique PASID/IMS-ENTRY pair.

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09  9:03                                                 ` Thomas Gleixner
@ 2021-12-09 12:17                                                   ` Tian, Kevin
  2021-12-09 15:57                                                     ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-09 12:17 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Jiang, Dave, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 9, 2021 5:03 PM
> 
> Kevin,
> 
> On Thu, Dec 09 2021 at 06:26, Kevin Tian wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> I don't know of any use case for more than one interrupt on a queue,
> >> and if it did come up I'd probably approach it by making the queue
> >> handle above also specify the 'queue relative HW index'
> >
> > We have such use case with IDXD.
> >
> > Basically the IDXD queue allows software to put an interrupt handle
> > (the index of MSI-X or IMS entry) in the submitted descriptor. Upon
> > completion of the descriptor the hardware finds the specified entry
> > and then generate interrupt to notify software.
> >
> > Conceptually descriptors submitted to a same queue can use different
> > handles, implying one queue can be associated to multiple interrupts.
> 
> I think you are looking at that from the internal implementation details
> of IDXD. But you can just model it in an IDXD implementation agnostic
> way:
> 
>      ENQCMD(PASID, IMS-ENTRY,.....)

Not exactly IMS-ENTRY. MSI-ENTRY also works.

> 
> implies an on demand allocation of a virtual queue, which is deallocated
> when the command completes. The PASID and IMS-ENTRY act as the 'queue'
> identifier.
> 
> The implementation detail of IDXD that it executes these computations on
> an internal shared workqueue does not change that.
> 
> Such a workqueue can only execute one enqueued command at a time,
> which
> means that during the execution of a particular command that IDXD
> internal workqueue represents the 'virtual queue' which is identified by
> the unique PASID/IMS-ENTRY pair.
> 
> No?
> 

While it's one way of looking at this model do we want to actually
create some objects to represent this 'virtual queue' concept? that
implies each ENQCMD must be moderated to create such short-lifespan
objects and I'm not sure the benefit of doing so.

If not then from driver p.o.v it's still one queue resource and driver 
needs to manage its association with multiple interrupt entries and 
PASIDs when it's connected to multiple clients.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09  8:37                                     ` Thomas Gleixner
@ 2021-12-09 12:31                                       ` Tian, Kevin
  2021-12-09 16:21                                       ` Jason Gunthorpe
  1 sibling, 0 replies; 141+ messages in thread
From: Tian, Kevin @ 2021-12-09 12:31 UTC (permalink / raw)
  To: Thomas Gleixner, Jiang, Dave, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 9, 2021 4:37 PM
> 
> On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> I don't see anything wrong with that. A subdevice is it's own entity and
> >> VFIO can chose the most conveniant representation of it to the guest
> >> obviously.
> >>
> >> How that is backed on the host does not really matter. You can expose
> >> MSI-X to the guest with a INTx backing as well.
> >>
> >
> > Agree with this point. How the interrupts are represented to the guest
> > is orthogonal to how the backend resource is allocated. Physically MSI-X
> > and IMS can be enabled simultaneously on an IDXD device. Once
> > dynamic allocation is allowed for both, either one can be allocated for
> > a subdevice (with only difference on supported #subdevices).
> >
> > When an interrupt resource is exposed to the guest with the same type
> > (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the
> > guest as long as a hypercall machinery is in place to get addr/data pair
> > from the host (as you suggested earlier).
> 
> As I pointed out in the conclusion of this thread, IMS is only going to
> be supported with interrupt remapping in place on both host and guest.

I still need to read the last few mails but thanks for pointing it out now.

> 
> As these devices are requiring a vIOMMU on the guest anyway (PASID, User
> IO page tables), the required hypercalls are part of the vIOMMU/IR
> implementation. If you look at it from the irqdomain hierarchy view:
> 
>                          |- PCI-MSI
>   VECTOR -- [v]IOMMU/IR -|- PCI-MSI-X
>                          |- PCI-IMS
> 
> So host and guest use just the same representation which makes a ton of
> sense.
> 
> There are two places where this matters:
> 
>   1) The activate() callback of the IR domain
> 
>   2) The irq_set_affinity() callback of the irqchip associated with the
>      IR domain
> 
> Both callbacks are allowed to fail and the error code is handed back to
> the originating call site.
> 
> If you look at the above hierarchy view then MSI/MSI-X/IMS are all
> treated in exactly the same way. It all becomes the common case.
> 
> No?
> 

Yes, I think above makes sense. 

For a new guest OS which supports this enlightened hierarchy the same
machinery works for all type of interrupt storages and we have a
failure path from host to guest in case of host-side resource shortage.
And no trap is required on guest access to the interrupt storage.

A legacy guest OS which doesn't support the enlightened hierarchy
can only use MSI/MSI-X which is still trapped. But with vector 
reallocation support from your work the situation already improves 
a lot than current awkward way in VFIO (free all previous vectors 
and then re-allocate).

Overall I think this is a good modeling.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09 12:17                                                   ` Tian, Kevin
@ 2021-12-09 15:57                                                     ` Thomas Gleixner
  2021-12-10  7:37                                                       ` Tian, Kevin
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-09 15:57 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Jiang, Dave, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Thu, Dec 09 2021 at 12:17, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> I think you are looking at that from the internal implementation details
>> of IDXD. But you can just model it in an IDXD implementation agnostic
>> way:
>> 
>>      ENQCMD(PASID, IMS-ENTRY,.....)
>
> Not exactly IMS-ENTRY. MSI-ENTRY also works.

Sure.

>> 
>> implies an on demand allocation of a virtual queue, which is deallocated
>> when the command completes. The PASID and IMS-ENTRY act as the 'queue'
>> identifier.
>> 
>> The implementation detail of IDXD that it executes these computations on
>> an internal shared workqueue does not change that.
>> 
>> Such a workqueue can only execute one enqueued command at a time,
>> which
>> means that during the execution of a particular command that IDXD
>> internal workqueue represents the 'virtual queue' which is identified by
>> the unique PASID/IMS-ENTRY pair.
>
> While it's one way of looking at this model do we want to actually
> create some objects to represent this 'virtual queue' concept? that
> implies each ENQCMD must be moderated to create such short-lifespan
> objects and I'm not sure the benefit of doing so.

You don't have to create anything. The PASID/ENTRY pair represents that
'virtual queue', right?

> If not then from driver p.o.v it's still one queue resource and driver 
> needs to manage its association with multiple interrupt entries and 
> PASIDs when it's connected to multiple clients.

That's correct, but there is nothing problematic with it. It's like
allocating multiple interrupts for any other hardware device or
subdevice, right?

What's probably more interresting is how the PASID/interrupt/RID
relations are managed.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09  8:37                                     ` Thomas Gleixner
  2021-12-09 12:31                                       ` Tian, Kevin
@ 2021-12-09 16:21                                       ` Jason Gunthorpe
  2021-12-09 20:32                                         ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-09 16:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tian, Kevin, Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> I don't see anything wrong with that. A subdevice is it's own entity and
> >> VFIO can chose the most conveniant representation of it to the guest
> >> obviously.
> >> 
> >> How that is backed on the host does not really matter. You can expose
> >> MSI-X to the guest with a INTx backing as well.
> >> 
> >
> > Agree with this point. How the interrupts are represented to the guest
> > is orthogonal to how the backend resource is allocated. Physically MSI-X 
> > and IMS can be enabled simultaneously on an IDXD device. Once 
> > dynamic allocation is allowed for both, either one can be allocated for
> > a subdevice (with only difference on supported #subdevices). 
> >
> > When an interrupt resource is exposed to the guest with the same type 
> > (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the 
> > guest as long as a hypercall machinery is in place to get addr/data pair 
> > from the host (as you suggested earlier).
> 
> As I pointed out in the conclusion of this thread, IMS is only going to
> be supported with interrupt remapping in place on both host and guest.
> 
> As these devices are requiring a vIOMMU on the guest anyway (PASID, User
> IO page tables), the required hypercalls are part of the vIOMMU/IR
> implementation. If you look at it from the irqdomain hierarchy view:

It is true for IDXD, but mlx5 will work without a PASID or vIOMMU in a
guest today, and there is no reason to imagine some future IMS would
have any different device requirements from MSI-X.

Though, vIOMMU operating in bypass mode seems like it is fine if it
helps the solution.

> If you look at the above hierarchy view then MSI/MSI-X/IMS are all
> treated in exactly the same way. It all becomes the common case.

Unfortunately in a guest they are not all the same - it is like the
PPC code I saw messing with MSI today - MSI setup is a hypercall,
either explicitly or implicitly by trapping device registers.

So MSI is special compared to everything else because MSI has that
hypervisor intrusion.

My ideal outcome would be to have the guest negotiate some new
capability with the hypervisor where the guest promises to use new
hypecalls to get addr/data pairs and the hypervisor removes all the
emulation around the PCI MSI. Then MSI == IMS again and we can have
everything treated in exactly the same way. Hypervisor doesn't care
how the guest tells the origin device what the addr/data pair is.

This moves the hypervisor trap to setup the interrupt remapping from
the MSI emulation to the new hypercall.

If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
MSI code needs to write a addr/data pair compatible with the emulation
and the IMS code needs to write an addr/data pair from the
hypercall. Seems like this scenario is best avoided!

From this perspective I haven't connected how virtual interrupt
remapping helps in the guest? Is this a way to provide the hypercall
I'm imagining above?

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09 16:21                                       ` Jason Gunthorpe
@ 2021-12-09 20:32                                         ` Thomas Gleixner
  2021-12-09 20:58                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-09 20:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
> MSI code needs to write a addr/data pair compatible with the emulation
> and the IMS code needs to write an addr/data pair from the
> hypercall. Seems like this scenario is best avoided!
>
> From this perspective I haven't connected how virtual interrupt
> remapping helps in the guest? Is this a way to provide the hypercall
> I'm imagining above?

That was my thought to avoid having different mechanisms.

The address/data pair is computed in two places:

  1) Activation of an interrupt
  2) Affinity setting on an interrupt

Both configure the IRTE when interrupt remapping is in place.

In both cases a vector is allocated in the vector domain and based on
the resulting target APIC / vector number pair the IRTE is
(re)configured.

So putting the hypercall into the vIRTE update is the obvious
place. Both activation and affinity setting can fail and propagate an
error code down to the originating caller.

Hmm?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09 20:32                                         ` Thomas Gleixner
@ 2021-12-09 20:58                                           ` Jason Gunthorpe
  2021-12-09 22:09                                             ` Thomas Gleixner
  2021-12-10  7:36                                             ` Tian, Kevin
  0 siblings, 2 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-09 20:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tian, Kevin, Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> > On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> > If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
> > MSI code needs to write a addr/data pair compatible with the emulation
> > and the IMS code needs to write an addr/data pair from the
> > hypercall. Seems like this scenario is best avoided!
> >
> > From this perspective I haven't connected how virtual interrupt
> > remapping helps in the guest? Is this a way to provide the hypercall
> > I'm imagining above?
> 
> That was my thought to avoid having different mechanisms.
> 
> The address/data pair is computed in two places:
> 
>   1) Activation of an interrupt
>   2) Affinity setting on an interrupt
> 
> Both configure the IRTE when interrupt remapping is in place.
> 
> In both cases a vector is allocated in the vector domain and based on
> the resulting target APIC / vector number pair the IRTE is
> (re)configured.
> 
> So putting the hypercall into the vIRTE update is the obvious
> place. Both activation and affinity setting can fail and propagate an
> error code down to the originating caller.
> 
> Hmm?

Okay, I think I get it. Would be nice to have someone from intel
familiar with the vIOMMU protocols and qemu code remark what the
hypervisor side can look like.

There is a bit more work here, we'd have to change VFIO to somehow
entirely disconnect the kernel IRQ logic from the MSI table and
directly pass control of it to the guest after the hypervisor IOMMU IR
secures it. ie directly mmap the msi-x table into the guest

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09 20:58                                           ` Jason Gunthorpe
@ 2021-12-09 22:09                                             ` Thomas Gleixner
  2021-12-10  0:26                                               ` Thomas Gleixner
  2021-12-10  7:36                                             ` Tian, Kevin
  1 sibling, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-09 22:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
>> That was my thought to avoid having different mechanisms.
>> 
>> The address/data pair is computed in two places:
>> 
>>   1) Activation of an interrupt
>>   2) Affinity setting on an interrupt
>> 
>> Both configure the IRTE when interrupt remapping is in place.
>> 
>> In both cases a vector is allocated in the vector domain and based on
>> the resulting target APIC / vector number pair the IRTE is
>> (re)configured.
>> 
>> So putting the hypercall into the vIRTE update is the obvious
>> place. Both activation and affinity setting can fail and propagate an
>> error code down to the originating caller.
>> 
>> Hmm?
>
> Okay, I think I get it. Would be nice to have someone from intel
> familiar with the vIOMMU protocols and qemu code remark what the
> hypervisor side can look like.
>
> There is a bit more work here, we'd have to change VFIO to somehow
> entirely disconnect the kernel IRQ logic from the MSI table and
> directly pass control of it to the guest after the hypervisor IOMMU IR
> secures it. ie directly mmap the msi-x table into the guest

That makes everything consistent and a clear cut on all levels, right?

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09 22:09                                             ` Thomas Gleixner
@ 2021-12-10  0:26                                               ` Thomas Gleixner
  2021-12-10  7:29                                                 ` Tian, Kevin
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-10  0:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Thu, Dec 09 2021 at 23:09, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
>> Okay, I think I get it. Would be nice to have someone from intel
>> familiar with the vIOMMU protocols and qemu code remark what the
>> hypervisor side can look like.
>>
>> There is a bit more work here, we'd have to change VFIO to somehow
>> entirely disconnect the kernel IRQ logic from the MSI table and
>> directly pass control of it to the guest after the hypervisor IOMMU IR
>> secures it. ie directly mmap the msi-x table into the guest
>
> That makes everything consistent and a clear cut on all levels, right?

Let me give a bit more rationale here, why I think this is the right
thing to do. There are several problems with IMS both on the host and on
the guest side:

  1) Contrary to MSI/MSI-X the address/data pair is not completely
     managed by the core. It's handed off to driver writers in the
     hope they get it right.

  2) Without interrupt remapping there is a fundamental issue on x86
     for the affinity setting case, as there is no guarantee that
     the magic protocol which we came up with (see msi_set_affinity()
     in the x86 code) is correctly implemented at the driver level or
     that the update is truly atomic so that the problem does not
     arise. My interrest in chasing these things is exactly zero.

     With interrupt remapping the affinity change happens at the IRTE
     level and not at the device level. It's a one time setup for the
     device.

     Just for the record:

     The ATH11 thing does not have that problem by pure luck because
     multi-vector MSI is not supported on X86 unless interrupt
     remapping is enabled. 

     The switchtec NTB thing will fall apart w/o remapping AFAICT.

  3) With remapping the message for the device is constructed at
     allocation time. It does not change after that because the affinity
     change happens at the remapping level, which eliminates #2 above.

     That has another advantage for IMS because it does not require any
     synchronization with the queue or whatever is involved. The next
     interrupt after the change at the remapping level ends up on the
     new target.

  4) For the guest side we agreed that we need an hypercall because the
     host can't trap the write to the MSI[-X] entry anymore.

     Aside of the fact that this creates a special case for IMS which is
     undesirable in my opinion, it's not really obvious where the
     hypercall should be placed to work for all scenarios so that it can
     also solve the existing issue of silent failures.

  5) It's not possible for the kernel to reliably detect whether it is
     running on bare metal or not. Yes we talked about heuristics, but
     that's something I really want to avoid.

When looking at the above I came to the conclusion that the consistent
way is to make IMS depend on IR both on the host and the guest as this
solves all of the above in one go.

How would that work? With IR the irqdomain hierarchy looks like this:

                   |--IO/APIC
                   |--MSI
    vector -- IR --|--MIX-X
                   |--IMS

There are several context where this matters:

  1) Allocation of an interrupt, e.g. pci_alloc_irq_vectors().

  2) Activation of an interrupt which happens during allocation and/or
     at request_irq() time

  3) Interrupt affinity setting

#1 Allocation

   That allocates an IRTE, which can fail

#2 Activation

   That's the step where actually a CPU vector is allocated, where the
   IRTE is updated and where the device message is composed to target
   the IRTE.

   On X86 activation is happening twice:

   1) During allocation it allocates a special CPU vector which is
      handed out to all allocated interrupts. That's called reservation
      mode. This was introduced to prevent vector exhaustion for two
      cases:
      
       - Devices allocating tons of MSI-X vectors without using
         them. That obviously needs to be fixed at the device driver
         level, but due to the fact that post probe() allocation is not
         supported, that's not always possible

       - CPU hotunplug

         All vectors targeting the outgoing CPU need to be migrated to a
         new target CPU, which can result in exhaustion of the vector
         space.

         Reservation mode avoids that because it just uses a unique
         vector for all interrupts which are allocated but not
         requested.

    2) On request_irq()

       As the vector assigned during allocation is just a place holder
       to make the MSI hardware happy it needs to be replaced by a
       real vector.

   Both can fail and the error is propagated through the call chain

#3 Changing the interrupt affinity

   This obviously needs to allocate a new target CPU vector and update
   the IRTE.

   Allocating a new target CPU vector can fail.

When looking at it from the host side, then the host needs to do the
same things:

  1) Allocate an IRTE for #1

  2) Update the IRTE for #2 and #3

But that does not necessarily mean that we need two hypercalls. We can
get away with one in the code which updates the IRTE and that would be
the point where the host side has to allocate the backing host
interrupt, which would replace that allocate on unmask mechanism which
is used today.

It might look awkward on first sight that an IRTE update can fail, but
it's not that awkward when put into context:

  The first update happens during activation and activation can fail for
  various reasons.
  
The charm is that his works for everything from INTx to IMS because all
of them go through the same procedure, except that INTx (IO/APIC) does
not support the reservation mode dance.

Thoughts?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10  0:26                                               ` Thomas Gleixner
@ 2021-12-10  7:29                                                 ` Tian, Kevin
  2021-12-10 12:13                                                   ` Thomas Gleixner
  2021-12-10 12:39                                                   ` Jason Gunthorpe
  0 siblings, 2 replies; 141+ messages in thread
From: Tian, Kevin @ 2021-12-10  7:29 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Friday, December 10, 2021 8:26 AM
> 
> On Thu, Dec 09 2021 at 23:09, Thomas Gleixner wrote:
> > On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
> >> Okay, I think I get it. Would be nice to have someone from intel
> >> familiar with the vIOMMU protocols and qemu code remark what the
> >> hypervisor side can look like.
> >>
> >> There is a bit more work here, we'd have to change VFIO to somehow
> >> entirely disconnect the kernel IRQ logic from the MSI table and
> >> directly pass control of it to the guest after the hypervisor IOMMU IR
> >> secures it. ie directly mmap the msi-x table into the guest
> >
> > That makes everything consistent and a clear cut on all levels, right?
> 
> Let me give a bit more rationale here, why I think this is the right
> thing to do. There are several problems with IMS both on the host and on
> the guest side:
> 
>   1) Contrary to MSI/MSI-X the address/data pair is not completely
>      managed by the core. It's handed off to driver writers in the
>      hope they get it right.
> 
>   2) Without interrupt remapping there is a fundamental issue on x86
>      for the affinity setting case, as there is no guarantee that
>      the magic protocol which we came up with (see msi_set_affinity()
>      in the x86 code) is correctly implemented at the driver level or
>      that the update is truly atomic so that the problem does not
>      arise. My interrest in chasing these things is exactly zero.
> 
>      With interrupt remapping the affinity change happens at the IRTE
>      level and not at the device level. It's a one time setup for the
>      device.

This is a good rationale for making IMS depend on IR on native.

> 
>      Just for the record:
> 
>      The ATH11 thing does not have that problem by pure luck because
>      multi-vector MSI is not supported on X86 unless interrupt
>      remapping is enabled.
> 
>      The switchtec NTB thing will fall apart w/o remapping AFAICT.
> 
>   3) With remapping the message for the device is constructed at
>      allocation time. It does not change after that because the affinity
>      change happens at the remapping level, which eliminates #2 above.
> 
>      That has another advantage for IMS because it does not require any
>      synchronization with the queue or whatever is involved. The next
>      interrupt after the change at the remapping level ends up on the
>      new target.

Yes

> 
>   4) For the guest side we agreed that we need an hypercall because the
>      host can't trap the write to the MSI[-X] entry anymore.

To be accurate I'd like to not call it "can't trap". The host still traps the 
MSI/MSI-X entry if the hypercall is not used. This is for current guest 
OS which doesn't have this hypercall mechanism. For future new guest
OS which will support this machinery then a handshake process from
such guest will disable the trap for MSI-X and map it for direct guest
access in the fly.

MSI has to be always trapped although the guest has acquired the right 
data/addr pair via the hypercall, since it's a PCI config capability.

> 
>      Aside of the fact that this creates a special case for IMS which is
>      undesirable in my opinion, it's not really obvious where the
>      hypercall should be placed to work for all scenarios so that it can
>      also solve the existing issue of silent failures.
> 
>   5) It's not possible for the kernel to reliably detect whether it is
>      running on bare metal or not. Yes we talked about heuristics, but
>      that's something I really want to avoid.

How would the hypercall mechanism avoid such heuristics?

> 
> When looking at the above I came to the conclusion that the consistent
> way is to make IMS depend on IR both on the host and the guest as this
> solves all of the above in one go.
> 
> How would that work? With IR the irqdomain hierarchy looks like this:
> 
>                    |--IO/APIC
>                    |--MSI
>     vector -- IR --|--MIX-X
>                    |--IMS
> 
> There are several context where this matters:
> 
>   1) Allocation of an interrupt, e.g. pci_alloc_irq_vectors().
> 
>   2) Activation of an interrupt which happens during allocation and/or
>      at request_irq() time
> 
>   3) Interrupt affinity setting
> 
> #1 Allocation
> 
>    That allocates an IRTE, which can fail
> 
> #2 Activation
> 
>    That's the step where actually a CPU vector is allocated, where the
>    IRTE is updated and where the device message is composed to target
>    the IRTE.
> 
>    On X86 activation is happening twice:
> 
>    1) During allocation it allocates a special CPU vector which is
>       handed out to all allocated interrupts. That's called reservation
>       mode. This was introduced to prevent vector exhaustion for two
>       cases:
> 
>        - Devices allocating tons of MSI-X vectors without using
>          them. That obviously needs to be fixed at the device driver
>          level, but due to the fact that post probe() allocation is not
>          supported, that's not always possible
> 
>        - CPU hotunplug
> 
>          All vectors targeting the outgoing CPU need to be migrated to a
>          new target CPU, which can result in exhaustion of the vector
>          space.
> 
>          Reservation mode avoids that because it just uses a unique
>          vector for all interrupts which are allocated but not
>          requested.
> 
>     2) On request_irq()
> 
>        As the vector assigned during allocation is just a place holder
>        to make the MSI hardware happy it needs to be replaced by a
>        real vector.
> 
>    Both can fail and the error is propagated through the call chain
> 
> #3 Changing the interrupt affinity
> 
>    This obviously needs to allocate a new target CPU vector and update
>    the IRTE.
> 
>    Allocating a new target CPU vector can fail.
> 
> When looking at it from the host side, then the host needs to do the
> same things:
> 
>   1) Allocate an IRTE for #1
> 
>   2) Update the IRTE for #2 and #3
> 
> But that does not necessarily mean that we need two hypercalls. We can
> get away with one in the code which updates the IRTE and that would be
> the point where the host side has to allocate the backing host
> interrupt, which would replace that allocate on unmask mechanism which
> is used today.
> 
> It might look awkward on first sight that an IRTE update can fail, but
> it's not that awkward when put into context:
> 
>   The first update happens during activation and activation can fail for
>   various reasons.
> 
> The charm is that his works for everything from INTx to IMS because all
> of them go through the same procedure, except that INTx (IO/APIC) does
> not support the reservation mode dance.
> 
> Thoughts?
> 

Above sounds the right direction to me. and here is more thought 
down the road.

First let's look at how the interrupt is delivered to the guest today.

With IR the physical interrupt is first delivered to the host, then 
converted into a virtual interrupt and finally injected to the guest.
Let's put posted interrupt aside since it's an optional platform capability.

    HW interrupt
        vfio_msihandler(): ->irqfd
            kvm_arch_set_irq_inatomic()
                kvm_set_msi_irq()
                kvm_irq_delivery_to_apic_fast()

Virtual interrupt injection is based on the virtual routing information 
registered by Qemu, via KVM_SET_GSI_ROUTING(gsi, routing_info). 

GSI is later associated to irqfd via KVM_IRQFD(irqfd, gsi). 

Qemu composes the virtual routing information based on trapping
of various interrupt storages (INTx, MSI, MSI-X, etc.). When IR is 
enabled in the vIOMMU, the routing info is composed by combining
the storage entry and vIRTE entry.

Now adding the new hypercall machinery to above flow, obviously
we also want the hypercall to carry the virtual routing information 
(vIRTE) to the host beside acquiring data/addr pair from the host. 
Without trap this information is now hidden from Qemu.

Then Qemu needs to find out the GSI number for the vIRTE handle. 
Again Qemu doesn't have such information since it doesn't know 
which MSI[-X] entry points to this handle due to no trap.

This implies that we may also need carry device ID, #msi entry, etc. 
in the hypercall, so Qemu can associate the virtual routing info
to the right [irqfd, gsi].

In your model the hypercall is raised by IR domain. Do you see
any problem of finding those information within IR domain?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09 20:58                                           ` Jason Gunthorpe
  2021-12-09 22:09                                             ` Thomas Gleixner
@ 2021-12-10  7:36                                             ` Tian, Kevin
  2021-12-10 12:30                                               ` Jason Gunthorpe
  2021-12-12  6:44                                               ` Mika Penttilä
  1 sibling, 2 replies; 141+ messages in thread
From: Tian, Kevin @ 2021-12-10  7:36 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Gleixner
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, December 10, 2021 4:59 AM
> 
> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
> > On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> > > On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> > > If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
> > > MSI code needs to write a addr/data pair compatible with the emulation
> > > and the IMS code needs to write an addr/data pair from the
> > > hypercall. Seems like this scenario is best avoided!
> > >
> > > From this perspective I haven't connected how virtual interrupt
> > > remapping helps in the guest? Is this a way to provide the hypercall
> > > I'm imagining above?
> >
> > That was my thought to avoid having different mechanisms.
> >
> > The address/data pair is computed in two places:
> >
> >   1) Activation of an interrupt
> >   2) Affinity setting on an interrupt
> >
> > Both configure the IRTE when interrupt remapping is in place.
> >
> > In both cases a vector is allocated in the vector domain and based on
> > the resulting target APIC / vector number pair the IRTE is
> > (re)configured.
> >
> > So putting the hypercall into the vIRTE update is the obvious
> > place. Both activation and affinity setting can fail and propagate an
> > error code down to the originating caller.
> >
> > Hmm?
> 
> Okay, I think I get it. Would be nice to have someone from intel
> familiar with the vIOMMU protocols and qemu code remark what the
> hypervisor side can look like.
> 
> There is a bit more work here, we'd have to change VFIO to somehow
> entirely disconnect the kernel IRQ logic from the MSI table and
> directly pass control of it to the guest after the hypervisor IOMMU IR
> secures it. ie directly mmap the msi-x table into the guest
> 

It's supported already:

/*
 * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
 * which allows direct access to non-MSIX registers which happened to be within
 * the same system page.
 *
 * Even though the userspace gets direct access to the MSIX data, the existing
 * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
 */
#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE      3

IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
other MMIO registers. Trapping MSI-X leads to performance downgrade on
accesses to adjacent registers. MSI-X can be mapped by userspace because
PPC already uses a hypercall mechanism for interrupt. Though unclear about
the detail it sounds a similar usage as proposed here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-09 15:57                                                     ` Thomas Gleixner
@ 2021-12-10  7:37                                                       ` Tian, Kevin
  0 siblings, 0 replies; 141+ messages in thread
From: Tian, Kevin @ 2021-12-10  7:37 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Jiang, Dave, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 9, 2021 11:58 PM
> 
> On Thu, Dec 09 2021 at 12:17, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> I think you are looking at that from the internal implementation details
> >> of IDXD. But you can just model it in an IDXD implementation agnostic
> >> way:
> >>
> >>      ENQCMD(PASID, IMS-ENTRY,.....)
> >
> > Not exactly IMS-ENTRY. MSI-ENTRY also works.
> 
> Sure.
> 
> >>
> >> implies an on demand allocation of a virtual queue, which is deallocated
> >> when the command completes. The PASID and IMS-ENTRY act as the
> 'queue'
> >> identifier.
> >>
> >> The implementation detail of IDXD that it executes these computations on
> >> an internal shared workqueue does not change that.
> >>
> >> Such a workqueue can only execute one enqueued command at a time,
> >> which
> >> means that during the execution of a particular command that IDXD
> >> internal workqueue represents the 'virtual queue' which is identified by
> >> the unique PASID/IMS-ENTRY pair.
> >
> > While it's one way of looking at this model do we want to actually
> > create some objects to represent this 'virtual queue' concept? that
> > implies each ENQCMD must be moderated to create such short-lifespan
> > objects and I'm not sure the benefit of doing so.
> 
> You don't have to create anything. The PASID/ENTRY pair represents that
> 'virtual queue', right?

Yes

> 
> > If not then from driver p.o.v it's still one queue resource and driver
> > needs to manage its association with multiple interrupt entries and
> > PASIDs when it's connected to multiple clients.
> 
> That's correct, but there is nothing problematic with it. It's like
> allocating multiple interrupts for any other hardware device or
> subdevice, right?

No question on this. Just want to point out such usage example 
since Jason mentioned it. 😊

> 
> What's probably more interresting is how the PASID/interrupt/RID
> relations are managed.
> 

yes, that's something we need further think of.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10  7:29                                                 ` Tian, Kevin
@ 2021-12-10 12:13                                                   ` Thomas Gleixner
  2021-12-11  8:06                                                     ` Tian, Kevin
  2021-12-10 12:39                                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-10 12:13 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

Kevin,

On Fri, Dec 10 2021 at 07:29, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>>   4) For the guest side we agreed that we need an hypercall because the
>>      host can't trap the write to the MSI[-X] entry anymore.
>
> To be accurate I'd like to not call it "can't trap". The host still traps the 
> MSI/MSI-X entry if the hypercall is not used. This is for current guest 
> OS which doesn't have this hypercall mechanism. For future new guest
> OS which will support this machinery then a handshake process from
> such guest will disable the trap for MSI-X and map it for direct guest
> access in the fly.

Right. What I'm suggesting is a clear cut between the current approach,
which obviously needs to be preserved, and a consistent new approach
which handles MSI, MSI-X and IMS in the exactly same way.

> MSI has to be always trapped although the guest has acquired the right 
> data/addr pair via the hypercall, since it's a PCI config capability.
>
>> 
>>      Aside of the fact that this creates a special case for IMS which is
>>      undesirable in my opinion, it's not really obvious where the
>>      hypercall should be placed to work for all scenarios so that it can
>>      also solve the existing issue of silent failures.
>> 
>>   5) It's not possible for the kernel to reliably detect whether it is
>>      running on bare metal or not. Yes we talked about heuristics, but
>>      that's something I really want to avoid.
>
> How would the hypercall mechanism avoid such heuristics?

The availability of IR remapping where the irqdomain which is provided
by the remapping unit signals that it supports this new scheme:

                    |--IO/APIC
                    |--MSI
     vector -- IR --|--MSI-X
                    |--IMS

while the current scheme is:

                    |--IO/APIC
     vector -- IR --|--PCI/MSI[-X]

or 

                    |--IO/APIC
     vector --------|--PCI/MSI[-X]

So in the new scheme the IR domain will advertise new features which are
not available on older kernels. The availability of these new features
is the indicator for the interrupt subsystem and subsequently for PCI
whether IMS is supported or not.

Bootup either finds an IR unit or not. In the bare metal case that's the
usual hardware/firmware detection. In the guest case it's the
availability of vIR including the required hypercall protocol.

So for the guest case the initialization would look like this:

   qemu starts guest
      Method of interrupt management defaults to current scheme
      restricted to MSI/MSI-X

      guest initializes
          older guest do not check for the hypercall so everything
          continues as of today

          new guest initializes vIR, detects hypercall and requests
          from the hypervisor to switch over to the new scheme.

          The hypervisor grants or rejects the request, i.e. either both
          switch to the new scheme or stay with the old one.

The new scheme means, that all variants, MSI, MSI-X, IMS are handled in
the same way. Staying on the old scheme means that IMS is not available
to the guest.

Having that clear separation is in my opinion way better than trying to
make all of that a big maze of conditionals.

I'm going to make that clear cut in the PCI/MSI management layer as
well. Trying to do that hybrid we do today for irqdomain and non
irqdomain based backends is just not feasible. The decision which way to
go will be very close to the driver exposed API, i.e.:

   pci_alloc_ims_vector()
        if (new_scheme())
        	return new_scheme_alloc_ims();
        else
        	return -ENOTSUPP;

and new_scheme_alloc_ims() will have:

   new_scheme_alloc_ims()
        if (!system_supports_ims())
        	return -ENOTSUPP;
        ....

system_supports_ims() makes its decision based on the feature flags
exposed by the underlying base MSI irqdomain, i.e. either vector or IR
on x86.

Vector domain will not have that feature flag set, but IR will have on
bare metal and eventually in the guest when the vIR setup and hypercall
detection and negotiation succeeds.

> Then Qemu needs to find out the GSI number for the vIRTE handle. 
> Again Qemu doesn't have such information since it doesn't know 
> which MSI[-X] entry points to this handle due to no trap.
>
> This implies that we may also need carry device ID, #msi entry, etc. 
> in the hypercall, so Qemu can associate the virtual routing info
> to the right [irqfd, gsi].
>
> In your model the hypercall is raised by IR domain. Do you see
> any problem of finding those information within IR domain?

IR has the following information available:

   Interrupt type
    - MSI:   Device, index and number of vectors
    - MSI-X: Device, index
    - IMS:   Device, index

  Target APIC/vector pair

IMS: The index depends on the storage type:

     For storage in device memory, e.g. IDXD, it's the array index.

     For storage in system memory, the index is a software artifact.

Does that answer your question?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10  7:36                                             ` Tian, Kevin
@ 2021-12-10 12:30                                               ` Jason Gunthorpe
  2021-12-12  6:44                                               ` Mika Penttilä
  1 sibling, 0 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-10 12:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Thomas Gleixner, Jiang, Dave, Logan Gunthorpe, LKML,
	Bjorn Helgaas, Marc Zygnier, Alex Williamson, Dey, Megha, Raj,
	Ashok, linux-pci, Greg Kroah-Hartman, Jon Mason, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Fri, Dec 10, 2021 at 07:36:12AM +0000, Tian, Kevin wrote:

> /*
>  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>  * which allows direct access to non-MSIX registers which happened to be within
>  * the same system page.
>  *
>  * Even though the userspace gets direct access to the MSIX data, the existing
>  * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
     ^^^^^^^^^^^^^^^^^^^^^

It is this I think we don't want, there should be no hypervisor
involvment on installing the addr/data pair

Guessing by what I saw in PPC there is a hypercall to install a MSI
and the guest never touches the page even though it is
mapped. Presumably there is some IR like thing making this secured

PPC has the same basic problem, I think they have a hypercall to
install a MSI, not a hypercall to provision a generic addr/data pair.

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10  7:29                                                 ` Tian, Kevin
  2021-12-10 12:13                                                   ` Thomas Gleixner
@ 2021-12-10 12:39                                                   ` Jason Gunthorpe
  2021-12-10 19:00                                                     ` Thomas Gleixner
  2021-12-11  7:52                                                     ` Tian, Kevin
  1 sibling, 2 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-10 12:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Thomas Gleixner, Jiang, Dave, Logan Gunthorpe, LKML,
	Bjorn Helgaas, Marc Zygnier, Alex Williamson, Dey, Megha, Raj,
	Ashok, linux-pci, Greg Kroah-Hartman, Jon Mason, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Fri, Dec 10, 2021 at 07:29:01AM +0000, Tian, Kevin wrote:
> >   5) It's not possible for the kernel to reliably detect whether it is
> >      running on bare metal or not. Yes we talked about heuristics, but
> >      that's something I really want to avoid.
> 
> How would the hypercall mechanism avoid such heuristics?

It is clever, we don't have an vIOMMU that supplies vIR today, so by
definition all guests are excluded and only bare metal works.

> > The charm is that his works for everything from INTx to IMS because all
> > of them go through the same procedure, except that INTx (IO/APIC) does
> > not support the reservation mode dance.

Do we even have vIOAPIC?

> > Thoughts?

It seems reasonable - do you have any idea how this all would work on
ARM too? IMS on baremetal ARM is surely interesting. I assume they
have a similar issue with trapping the MSI

> Then Qemu needs to find out the GSI number for the vIRTE handle. 
> Again Qemu doesn't have such information since it doesn't know 
> which MSI[-X] entry points to this handle due to no trap.

No this is already going wrong. qemu *cannot* know the MSI information
because there is no MSI information for IMS.

All qemu should get is the origin device information and data about
how the guest wants the interrupt setup.

Forget about guests and all of this complexity, design how to make
VFIO work with IMS in pure userspace like DPDK.

We must have a VFIO ioctl to acquire a addr/data pair and link it to
an event fd.

I'm not sure exactly how this should be done, it is 90% of what IMS
is, except the VFIO irq_chip cannot touch any real HW and certainly
cannot do mask/unmask..

Maybe that is OK now that it requires IR?

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10 12:39                                                   ` Jason Gunthorpe
@ 2021-12-10 19:00                                                     ` Thomas Gleixner
  2021-12-11  7:44                                                       ` Tian, Kevin
  2021-12-11  7:52                                                     ` Tian, Kevin
  1 sibling, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-10 19:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

Jason,

On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:

> On Fri, Dec 10, 2021 at 07:29:01AM +0000, Tian, Kevin wrote:
>> >   5) It's not possible for the kernel to reliably detect whether it is
>> >      running on bare metal or not. Yes we talked about heuristics, but
>> >      that's something I really want to avoid.
>> 
>> How would the hypercall mechanism avoid such heuristics?
>
> It is clever, we don't have an vIOMMU that supplies vIR today, so by
> definition all guests are excluded and only bare metal works.

Dammit. Now you spilled the beans. :)

>> > The charm is that his works for everything from INTx to IMS because all
>> > of them go through the same procedure, except that INTx (IO/APIC) does
>> > not support the reservation mode dance.
>
> Do we even have vIOAPIC?

It does not matter much. INTx via IOAPIC is different anyway because
INTx is shared so it's unclear from which device it originates.

> It seems reasonable - do you have any idea how this all would work on
> ARM too? IMS on baremetal ARM is surely interesting. I assume they
> have a similar issue with trapping the MSI

On baremetal it should just work once ARM is converted over. No idea
about guests. Marc should know.

>> Then Qemu needs to find out the GSI number for the vIRTE handle. 
>> Again Qemu doesn't have such information since it doesn't know 
>> which MSI[-X] entry points to this handle due to no trap.
>
> No this is already going wrong. qemu *cannot* know the MSI information
> because there is no MSI information for IMS.
>
> All qemu should get is the origin device information and data about
> how the guest wants the interrupt setup.
>
> Forget about guests and all of this complexity, design how to make
> VFIO work with IMS in pure userspace like DPDK.
>
> We must have a VFIO ioctl to acquire a addr/data pair and link it to
> an event fd.
>
> I'm not sure exactly how this should be done, it is 90% of what IMS
> is, except the VFIO irq_chip cannot touch any real HW and certainly
> cannot do mask/unmask..
>
> Maybe that is OK now that it requires IR?

IR unfortunately does not allow masking, but we surely can come up some
emergency button to press when e.g. an interrupt storm is detected.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10 19:00                                                     ` Thomas Gleixner
@ 2021-12-11  7:44                                                       ` Tian, Kevin
  2021-12-11 13:04                                                         ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-11  7:44 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Saturday, December 11, 2021 3:00 AM
> 
> Jason,
> 
> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
> 
> > On Fri, Dec 10, 2021 at 07:29:01AM +0000, Tian, Kevin wrote:
> >> >   5) It's not possible for the kernel to reliably detect whether it is
> >> >      running on bare metal or not. Yes we talked about heuristics, but
> >> >      that's something I really want to avoid.
> >>
> >> How would the hypercall mechanism avoid such heuristics?
> >
> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
> > definition all guests are excluded and only bare metal works.
> 
> Dammit. Now you spilled the beans. :)
> 

Unfortunately we do have that today. Qemu supports IR for
both AMD and Intel vIOMMU.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10 12:39                                                   ` Jason Gunthorpe
  2021-12-10 19:00                                                     ` Thomas Gleixner
@ 2021-12-11  7:52                                                     ` Tian, Kevin
  2021-12-12  0:12                                                       ` Thomas Gleixner
  1 sibling, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-11  7:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, Jiang, Dave, Logan Gunthorpe, LKML,
	Bjorn Helgaas, Marc Zygnier, Alex Williamson, Dey, Megha, Raj,
	Ashok, linux-pci, Greg Kroah-Hartman, Jon Mason, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, December 10, 2021 8:40 PM
> 
> 
> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> > Again Qemu doesn't have such information since it doesn't know
> > which MSI[-X] entry points to this handle due to no trap.
> 
> No this is already going wrong. qemu *cannot* know the MSI information
> because there is no MSI information for IMS.

I haven't thought of IMS at this step. The IR approach applies to
all types of interrupt storages, thus I'm more interested in how it
affect the storages which are already virtualized today (MSI[-X] 
in my thought practice).

> 
> All qemu should get is the origin device information and data about
> how the guest wants the interrupt setup.

yes, this is what I concluded in previous mail. The hypercall should
provide those information to the host and then get addr/data pair
back from the host.

> 
> Forget about guests and all of this complexity, design how to make
> VFIO work with IMS in pure userspace like DPDK.
> 
> We must have a VFIO ioctl to acquire a addr/data pair and link it to
> an event fd.

No question on it. Still need more thought down this road.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10 12:13                                                   ` Thomas Gleixner
@ 2021-12-11  8:06                                                     ` Tian, Kevin
  0 siblings, 0 replies; 141+ messages in thread
From: Tian, Kevin @ 2021-12-11  8:06 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Friday, December 10, 2021 8:13 PM
> 
> >>   5) It's not possible for the kernel to reliably detect whether it is
> >>      running on bare metal or not. Yes we talked about heuristics, but
> >>      that's something I really want to avoid.
> >
> > How would the hypercall mechanism avoid such heuristics?
> 
> The availability of IR remapping where the irqdomain which is provided
> by the remapping unit signals that it supports this new scheme:
> 
>                     |--IO/APIC
>                     |--MSI
>      vector -- IR --|--MSI-X
>                     |--IMS
> 
> while the current scheme is:
> 
>                     |--IO/APIC
>      vector -- IR --|--PCI/MSI[-X]
> 
> or
> 
>                     |--IO/APIC
>      vector --------|--PCI/MSI[-X]
> 
> So in the new scheme the IR domain will advertise new features which are
> not available on older kernels. The availability of these new features
> is the indicator for the interrupt subsystem and subsequently for PCI
> whether IMS is supported or not.
> 
> Bootup either finds an IR unit or not. In the bare metal case that's the
> usual hardware/firmware detection. In the guest case it's the
> availability of vIR including the required hypercall protocol.

Given we have vIR already, there are three scenarios:

1) Bare metal: IR (no hypercall, for sure)
2) VM: vIR (no hypercall, today)
3) VM: vIR (hypercall, tomorrow)

IMS should be allowed only for 1) and 3).

But how to differentiate 2) from 1) if no guest heuristics?

btw I checked Qemu history to find vIR was introduced in 2016:

commit 1121e0afdcfa0cd40e36bd3acff56a3fac4f70fd
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Jul 14 13:56:13 2016 +0800

    x86-iommu: introduce "intremap" property

    Adding one property for intel-iommu devices to specify whether we should
    support interrupt remapping. By default, IR is disabled. To enable it,
    we should use (take Intel IOMMU as example):

      -device intel_iommu,intremap=on

    This property can be shared by Intel and future AMD IOMMUs.

    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

> 
> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> > Again Qemu doesn't have such information since it doesn't know
> > which MSI[-X] entry points to this handle due to no trap.
> >
> > This implies that we may also need carry device ID, #msi entry, etc.
> > in the hypercall, so Qemu can associate the virtual routing info
> > to the right [irqfd, gsi].
> >
> > In your model the hypercall is raised by IR domain. Do you see
> > any problem of finding those information within IR domain?
> 
> IR has the following information available:
> 
>    Interrupt type
>     - MSI:   Device, index and number of vectors
>     - MSI-X: Device, index
>     - IMS:   Device, index
> 
>   Target APIC/vector pair
> 
> IMS: The index depends on the storage type:
> 
>      For storage in device memory, e.g. IDXD, it's the array index.
> 
>      For storage in system memory, the index is a software artifact.
> 
> Does that answer your question?
> 

Yes.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-11  7:44                                                       ` Tian, Kevin
@ 2021-12-11 13:04                                                         ` Thomas Gleixner
  2021-12-12  1:56                                                           ` Tian, Kevin
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-11 13:04 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

Kevin,

On Sat, Dec 11 2021 at 07:44, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
>> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
>> > definition all guests are excluded and only bare metal works.
>> 
>> Dammit. Now you spilled the beans. :)
>
> Unfortunately we do have that today. Qemu supports IR for
> both AMD and Intel vIOMMU.

can you point me to the code?

All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
vIR related there.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-11  7:52                                                     ` Tian, Kevin
@ 2021-12-12  0:12                                                       ` Thomas Gleixner
  2021-12-12  2:14                                                         ` Tian, Kevin
  2021-12-12 23:42                                                         ` Jason Gunthorpe
  0 siblings, 2 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-12  0:12 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

Kevin,

On Sat, Dec 11 2021 at 07:52, Kevin Tian wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> > Then Qemu needs to find out the GSI number for the vIRTE handle.
>> > Again Qemu doesn't have such information since it doesn't know
>> > which MSI[-X] entry points to this handle due to no trap.
>> 
>> No this is already going wrong. qemu *cannot* know the MSI information
>> because there is no MSI information for IMS.
>
> I haven't thought of IMS at this step. The IR approach applies to
> all types of interrupt storages, thus I'm more interested in how it
> affect the storages which are already virtualized today (MSI[-X] 
> in my thought practice).

They are not any different. As I explained several times now IMS is
nothing new at all. It existed since the invention of Message Signaled
interrupts. Why?

The principle behind Message Signaled Interrupts is:

    Device writes DATA to ADDRESS which raises an interrupt in a CPU

Message Signaled Interrupts obviously need some place to store the
ADDRESS/DATA pair so that the device can use it for raising an
interrupt, i.e. an

   Interrupt Message Store, short IMS.

PCI/MSI was the first implementation of this and the storage was defined
to be at a specified and therefore uniform and device independent place.

PCI/MSI-X followed the same approch. While it solved quite some of the
shortcomings of PCI/MSI it still has a specificed and uniform and device
independent place to store the message (ADDRESS/DATA pair)

Now the PCI wizards figured out that PCI/MSI[-X] is not longer up to the
task for various reasons and came up with the revolutionary new concept
of IMS, aka Interrupt Message Store. where the device defines where the
message is stored.

IOW, this is coming back full circle to the original problem of where to
store the message, i.e. the ADDRESS/DATA pair so that the device can
raise an interrupt in a CPU, which requires - drum roll - an

   Interrupt Message Store, short IMS.

So you simply have to look at it from a pure MSI (not PCI/MSI) point
of view:

   MSI at the conceptual level requires storage for the ADDRESS/DATA
   pair at some place so that the device or the compute unit embedded in
   the device can write DATA to ADDRESS.

That's it. Not more, not less.

When you look at it from this perspective, then you'll realize that

     PCI/MSI and PCI/MSI-X are just implementations of IMS

Not more, not less. The fact that they have very strict rules about the
storage space and the fact that they are mutually exclusive does not
change that at all.

That's where a lot of the confusion comes from. If you go back to all
the IDXD/IMS discussions which happened over time then you'll figure out
that _all_ of us where coming from the same wrong assumption:

    IMS is new and it's just another exclusive variant of PCI/MSI and
    PCi/MSI-X.

It took _all_ of us quite some time to realize that we need to look at
it from the other way around.

There was surely some other conceptual confusion vs. subdevices, queues
and whatever involved which contributed to that. Water under the bridge.

Coming back to your initial question:

> I haven't thought of IMS at this step. The IR approach applies to
> all types of interrupt storages, thus I'm more interested in how it
> affect the storages which are already virtualized today (MSI[-X] 
> in my thought practice).

Stop focussing on implementation details. Focus on the general concept
instead. See above.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-11 13:04                                                         ` Thomas Gleixner
@ 2021-12-12  1:56                                                           ` Tian, Kevin
  2021-12-12 20:55                                                             ` Thomas Gleixner
  0 siblings, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-12  1:56 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86, Rodel,
	Jorg, iommu

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Saturday, December 11, 2021 9:05 PM
> 
> Kevin,
> 
> On Sat, Dec 11 2021 at 07:44, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
> >> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
> >> > definition all guests are excluded and only bare metal works.
> >>
> >> Dammit. Now you spilled the beans. :)
> >
> > Unfortunately we do have that today. Qemu supports IR for
> > both AMD and Intel vIOMMU.
> 
> can you point me to the code?
> 
> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> vIR related there.
> 

Well, virtio-iommu is a para-virtualized vIOMMU implementations.

In reality there are also fully emulated vIOMMU implementations (e.g.
Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
the IR logic in existing iommu drivers just apply:

	drivers/iommu/intel/irq_remapping.c
	drivers/iommu/amd/iommu.c
	...

As I replied in another mail, the 1st vIR implementation was introduced
to Qemu back to 2016:

commit 1121e0afdcfa0cd40e36bd3acff56a3fac4f70fd
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Jul 14 13:56:13 2016 +0800

    x86-iommu: introduce "intremap" property

    Adding one property for intel-iommu devices to specify whether we should
    support interrupt remapping. By default, IR is disabled. To enable it,
    we should use (take Intel IOMMU as example):

      -device intel_iommu,intremap=on

    This property can be shared by Intel and future AMD IOMMUs.

    Signed-off-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-12  0:12                                                       ` Thomas Gleixner
@ 2021-12-12  2:14                                                         ` Tian, Kevin
  2021-12-12 20:50                                                           ` Thomas Gleixner
  2021-12-12 23:42                                                         ` Jason Gunthorpe
  1 sibling, 1 reply; 141+ messages in thread
From: Tian, Kevin @ 2021-12-12  2:14 UTC (permalink / raw)
  To: Thomas Gleixner, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86, Rodel,
	Jorg, iommu

Hi, Thomas,

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Sunday, December 12, 2021 8:12 AM
> 
> Kevin,
> 
> On Sat, Dec 11 2021 at 07:52, Kevin Tian wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> >> > Again Qemu doesn't have such information since it doesn't know
> >> > which MSI[-X] entry points to this handle due to no trap.
> >>
> >> No this is already going wrong. qemu *cannot* know the MSI information
> >> because there is no MSI information for IMS.
> >
> > I haven't thought of IMS at this step. The IR approach applies to
> > all types of interrupt storages, thus I'm more interested in how it
> > affect the storages which are already virtualized today (MSI[-X]
> > in my thought practice).
> 
> They are not any different. As I explained several times now IMS is
> nothing new at all. It existed since the invention of Message Signaled
> interrupts. Why?
> 
> The principle behind Message Signaled Interrupts is:
> 
>     Device writes DATA to ADDRESS which raises an interrupt in a CPU
> 
> Message Signaled Interrupts obviously need some place to store the
> ADDRESS/DATA pair so that the device can use it for raising an
> interrupt, i.e. an
> 
>    Interrupt Message Store, short IMS.
> 
> PCI/MSI was the first implementation of this and the storage was defined
> to be at a specified and therefore uniform and device independent place.
> 
> PCI/MSI-X followed the same approch. While it solved quite some of the
> shortcomings of PCI/MSI it still has a specificed and uniform and device
> independent place to store the message (ADDRESS/DATA pair)
> 
> Now the PCI wizards figured out that PCI/MSI[-X] is not longer up to the
> task for various reasons and came up with the revolutionary new concept
> of IMS, aka Interrupt Message Store. where the device defines where the
> message is stored.
> 
> IOW, this is coming back full circle to the original problem of where to
> store the message, i.e. the ADDRESS/DATA pair so that the device can
> raise an interrupt in a CPU, which requires - drum roll - an
> 
>    Interrupt Message Store, short IMS.
> 
> So you simply have to look at it from a pure MSI (not PCI/MSI) point
> of view:
> 
>    MSI at the conceptual level requires storage for the ADDRESS/DATA
>    pair at some place so that the device or the compute unit embedded in
>    the device can write DATA to ADDRESS.
> 
> That's it. Not more, not less.
> 
> When you look at it from this perspective, then you'll realize that
> 
>      PCI/MSI and PCI/MSI-X are just implementations of IMS
> 
> Not more, not less. The fact that they have very strict rules about the
> storage space and the fact that they are mutually exclusive does not
> change that at all.
> 
> That's where a lot of the confusion comes from. If you go back to all
> the IDXD/IMS discussions which happened over time then you'll figure out
> that _all_ of us where coming from the same wrong assumption:
> 
>     IMS is new and it's just another exclusive variant of PCI/MSI and
>     PCi/MSI-X.
> 
> It took _all_ of us quite some time to realize that we need to look at
> it from the other way around.
> 
> There was surely some other conceptual confusion vs. subdevices, queues
> and whatever involved which contributed to that. Water under the bridge.
> 
> Coming back to your initial question:
> 
> > I haven't thought of IMS at this step. The IR approach applies to
> > all types of interrupt storages, thus I'm more interested in how it
> > affect the storages which are already virtualized today (MSI[-X]
> > in my thought practice).
> 
> Stop focussing on implementation details. Focus on the general concept
> instead. See above.
> 

I have no any problem on understanding above relational. This has
been acked multiple times in my previous replies.

I just continue the thought practice along that direction to see what
the host flow will be like (step by step). Looking at the current 
implementation is just one necessary step in my thought practice to 
help refine the picture. When I found something which may be 
worth being aligned then I shared to avoid follow a wrong direction 
too far.

If both of your think it simply adds noise to this discussion, I can
surely hold back and focus on 'concept' only.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-10  7:36                                             ` Tian, Kevin
  2021-12-10 12:30                                               ` Jason Gunthorpe
@ 2021-12-12  6:44                                               ` Mika Penttilä
  2021-12-12 23:27                                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 141+ messages in thread
From: Mika Penttilä @ 2021-12-12  6:44 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Thomas Gleixner
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu



On 10.12.2021 9.36, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, December 10, 2021 4:59 AM
>>
>> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
>>> On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
>>>> On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
>>>> If we keep the MSI emulation in the hypervisor then MSI != IMS.  The
>>>> MSI code needs to write a addr/data pair compatible with the emulation
>>>> and the IMS code needs to write an addr/data pair from the
>>>> hypercall. Seems like this scenario is best avoided!
>>>>
>>>>  From this perspective I haven't connected how virtual interrupt
>>>> remapping helps in the guest? Is this a way to provide the hypercall
>>>> I'm imagining above?
>>> That was my thought to avoid having different mechanisms.
>>>
>>> The address/data pair is computed in two places:
>>>
>>>    1) Activation of an interrupt
>>>    2) Affinity setting on an interrupt
>>>
>>> Both configure the IRTE when interrupt remapping is in place.
>>>
>>> In both cases a vector is allocated in the vector domain and based on
>>> the resulting target APIC / vector number pair the IRTE is
>>> (re)configured.
>>>
>>> So putting the hypercall into the vIRTE update is the obvious
>>> place. Both activation and affinity setting can fail and propagate an
>>> error code down to the originating caller.
>>>
>>> Hmm?
>> Okay, I think I get it. Would be nice to have someone from intel
>> familiar with the vIOMMU protocols and qemu code remark what the
>> hypervisor side can look like.
>>
>> There is a bit more work here, we'd have to change VFIO to somehow
>> entirely disconnect the kernel IRQ logic from the MSI table and
>> directly pass control of it to the guest after the hypervisor IOMMU IR
>> secures it. ie directly mmap the msi-x table into the guest
>>
> It's supported already:
>
> /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
>   * the same system page.
>   *
>   * Even though the userspace gets direct access to the MSIX data, the existing
>   * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
>   */
> #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE      3
>
> IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
> other MMIO registers. Trapping MSI-X leads to performance downgrade on
> accesses to adjacent registers. MSI-X can be mapped by userspace because
> PPC already uses a hypercall mechanism for interrupt. Though unclear about
> the detail it sounds a similar usage as proposed here.
>
> Thanks
> Kevin

I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table 
is in its own bar, qemu never traps/emulates the access. On the other 
hand, qemu is said to depend on emulating masking. So how is this 
supposed to work, in case the table is not in the config bar?

Thanks,
Mika



^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-12  2:14                                                         ` Tian, Kevin
@ 2021-12-12 20:50                                                           ` Thomas Gleixner
  0 siblings, 0 replies; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-12 20:50 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86, Rodel,
	Jorg, iommu

Kevin,

On Sun, Dec 12 2021 at 02:14, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
> I just continue the thought practice along that direction to see what
> the host flow will be like (step by step). Looking at the current 
> implementation is just one necessary step in my thought practice to 
> help refine the picture. When I found something which may be 
> worth being aligned then I shared to avoid follow a wrong direction 
> too far.
>
> If both of your think it simply adds noise to this discussion, I can
> surely hold back and focus on 'concept' only.

All good. We _want_ your participartion for sure. Comparing and
contrasting it to the existing flow is fine.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-12  1:56                                                           ` Tian, Kevin
@ 2021-12-12 20:55                                                             ` Thomas Gleixner
  2021-12-12 23:37                                                               ` Jason Gunthorpe
  0 siblings, 1 reply; 141+ messages in thread
From: Thomas Gleixner @ 2021-12-12 20:55 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86, Rodel,
	Jorg, iommu

Kevin,

On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
>> vIR related there.
>
> Well, virtio-iommu is a para-virtualized vIOMMU implementations.
>
> In reality there are also fully emulated vIOMMU implementations (e.g.
> Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> the IR logic in existing iommu drivers just apply:
>
> 	drivers/iommu/intel/irq_remapping.c
> 	drivers/iommu/amd/iommu.c

thanks for the explanation. So that's a full IOMMU emulation. I was more
expecting a paravirtualized lightweight one.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-12  6:44                                               ` Mika Penttilä
@ 2021-12-12 23:27                                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-12 23:27 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Tian, Kevin, Thomas Gleixner, Jiang, Dave, Logan Gunthorpe, LKML,
	Bjorn Helgaas, Marc Zygnier, Alex Williamson, Dey, Megha, Raj,
	Ashok, linux-pci, Greg Kroah-Hartman, Jon Mason, Allen Hubbe,
	linux-ntb, linux-s390, Heiko Carstens, Christian Borntraeger,
	x86, Joerg Roedel, iommu

On Sun, Dec 12, 2021 at 08:44:46AM +0200, Mika Penttilä wrote:

> > /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >   * which allows direct access to non-MSIX registers which happened to be within
> >   * the same system page.
> >   *
> >   * Even though the userspace gets direct access to the MSIX data, the existing
> >   * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
> >   */
> > #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE      3
> > 
> > IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
> > other MMIO registers. Trapping MSI-X leads to performance downgrade on
> > accesses to adjacent registers. MSI-X can be mapped by userspace because
> > PPC already uses a hypercall mechanism for interrupt. Though unclear about
> > the detail it sounds a similar usage as proposed here.
> > 
> > Thanks
> > Kevin
>
> I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table is
> in its own bar, qemu never traps/emulates the access. 

It is some backwards compat, the kernel always sets it to indicate a
new kernel, that doesn't mean qemu doesn't trap.

As the comment says, ""VFIO_DEVICE_SET_IRQS interface must still be
used for MSIX configuration"" so there is no way qemu can meet that
without either trapping the MSI page or using a special hypercall
(ppc)

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-12 20:55                                                             ` Thomas Gleixner
@ 2021-12-12 23:37                                                               ` Jason Gunthorpe
  2021-12-13  7:50                                                                 ` Tian, Kevin
  0 siblings, 1 reply; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-12 23:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tian, Kevin, Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86, Rodel,
	Jorg, iommu

On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> Kevin,
> 
> On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> >> vIR related there.
> >
> > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> >
> > In reality there are also fully emulated vIOMMU implementations (e.g.
> > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > the IR logic in existing iommu drivers just apply:
> >
> > 	drivers/iommu/intel/irq_remapping.c
> > 	drivers/iommu/amd/iommu.c
> 
> thanks for the explanation. So that's a full IOMMU emulation. I was more
> expecting a paravirtualized lightweight one.

Kevin can you explain what on earth vIR is for and how does it work??

Obviously we don't expose the IR machinery to userspace, so at best
this is somehow changing what the MSI trap does?

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-12  0:12                                                       ` Thomas Gleixner
  2021-12-12  2:14                                                         ` Tian, Kevin
@ 2021-12-12 23:42                                                         ` Jason Gunthorpe
  1 sibling, 0 replies; 141+ messages in thread
From: Jason Gunthorpe @ 2021-12-12 23:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tian, Kevin, Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas,
	Marc Zygnier, Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86,
	Joerg Roedel, iommu

On Sun, Dec 12, 2021 at 01:12:05AM +0100, Thomas Gleixner wrote:

>      PCI/MSI and PCI/MSI-X are just implementations of IMS
> 
> Not more, not less. The fact that they have very strict rules about the
> storage space and the fact that they are mutually exclusive does not
> change that at all.

And the mess we have is that virtualiation broke this
design. Virtualization made MSI/MSI-X special!

I am wondering if we just need to bite the bullet and force the
introduction of a new ACPI flag for the APIC that says one of:
- message addr/data pairs work correctly (baremetal)
- creating message addr/data pairs need to use a hypercall protocol
- property not defined so assume only MSI/MSI-X/etc work.

Intel was originally trying to do this with the 'IMS enabled' PCI
Capability block, but a per PCI device capability is in the wrong
layer.

Jason

^ permalink raw reply	[flat|nested] 141+ messages in thread

* RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
  2021-12-12 23:37                                                               ` Jason Gunthorpe
@ 2021-12-13  7:50                                                                 ` Tian, Kevin
  0 siblings, 0 replies; 141+ messages in thread
From: Tian, Kevin @ 2021-12-13  7:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Thomas Gleixner
  Cc: Jiang, Dave, Logan Gunthorpe, LKML, Bjorn Helgaas, Marc Zygnier,
	Alex Williamson, Dey, Megha, Raj, Ashok, linux-pci,
	Greg Kroah-Hartman, Jon Mason, Allen Hubbe, linux-ntb,
	linux-s390, Heiko Carstens, Christian Borntraeger, x86, Rodel,
	Jorg, iommu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, December 13, 2021 7:37 AM
> 
> On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> > Kevin,
> >
> > On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> > >> From: Thomas Gleixner <tglx@linutronix.de>
> > >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> > >> vIR related there.
> > >
> > > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> > >
> > > In reality there are also fully emulated vIOMMU implementations (e.g.
> > > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > > the IR logic in existing iommu drivers just apply:
> > >
> > > 	drivers/iommu/intel/irq_remapping.c
> > > 	drivers/iommu/amd/iommu.c
> >
> > thanks for the explanation. So that's a full IOMMU emulation. I was more
> > expecting a paravirtualized lightweight one.
> 
> Kevin can you explain what on earth vIR is for and how does it work??
> 
> Obviously we don't expose the IR machinery to userspace, so at best
> this is somehow changing what the MSI trap does?
> 

Initially it was introduced for supporting more than 255 vCPUs. Due to 
full emulation this capability can certainly support other vIR usages
as observed on bare metal.

vIR doesn't rely on the physical IR presence.

First if the guest doesn't have vfio device then the physical capability
doesn't matter.

Even with vfio device, IR by definition is just about remapping instead
of injection (talk about this part later). The interrupts are always routed
to the host handler first (vfio_msihandler() in this case), which then 
triggers irqfd to call virtual interrupt injection handler (irqfd_wakeup())
in kvm.

This suggests a clear role split between vfio and kvm:

  - vfio is responsible for irq allocation/startup as it is the device driver;
  - kvm takes care of virtual interrupt injection, being a VMM;

The two are connected via irqfd.

Following this split vIR information is completely hidden in userspace.
Qemu calculates the routing information between vGSI and vCPU
(with or without vIR, and for whatever trapped interrupt storages) 
and then registers it to kvm.

When kvm receives a notification via irqfd, it follows irqfd->vGSI->vCPU
and injects a virtual interrupt into the target vCPU.

Then comes an interesting scenario about IOMMU posted interrupt (PI).
This capability allows the IR engine directly converting a physical
interrupt into virtual and then inject it into the guest. Kinda offloading 
the virtual routing information into the hardware.

This is currently achieved via IRQ bypass manager, which helps connect 
vfio (IRQ producer) to kvm (IRQ consumer) around a specific Linux irq 
number. Once the connection is established, kvm calls 
irq_set_vcpu_affinity() to update IRTE with virtual routing information 
for that irq number.

With that design Qemu doesn't know whether IR or PI is enabled 
physically. It always talks to vfio for having IRQ resource allocated and
to kvm for registering virtual routing information.

Then adding the new hypercall machinery into this picture:

  1) The hypercall needs to carry all necessary virtual routing 
     information due to no-trap;

  2) Before querying IRTE data/pair, Qemu needs to complete necessary
     operations as of today to have IRTE ready:

	a) Register irqfd and related GSI routing info to kvm
	b) Allocates/startup IRQs via vfio;

     When PI is enabled, IRTE is ready only after both are completed.

  3) Qemu gets IRTE data/pair from kernel and return to the guest.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 141+ messages in thread

end of thread, other threads:[~2021-12-13  7:50 UTC | newest]

Thread overview: 141+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-27  1:23 [patch 00/32] genirq/msi, PCI/MSI: Spring cleaning - Part 2 Thomas Gleixner
2021-11-27  1:23 ` [patch 01/32] genirq/msi: Move descriptor list to struct msi_device_data Thomas Gleixner
2021-11-27 12:19   ` Greg Kroah-Hartman
2021-11-27  1:23 ` [patch 02/32] genirq/msi: Add mutex for MSI list protection Thomas Gleixner
2021-11-27  1:23 ` [patch 03/32] genirq/msi: Provide msi_domain_alloc/free_irqs_descs_locked() Thomas Gleixner
2021-11-27  1:23 ` [patch 04/32] genirq/msi: Provide a set of advanced MSI accessors and iterators Thomas Gleixner
2021-11-28  1:00   ` Jason Gunthorpe
2021-11-28 19:22     ` Thomas Gleixner
2021-11-29  9:26       ` Thomas Gleixner
2021-11-29 14:01         ` Jason Gunthorpe
2021-11-29 14:46           ` Thomas Gleixner
2021-11-27  1:23 ` [patch 05/32] genirq/msi: Provide msi_alloc_msi_desc() and a simple allocator Thomas Gleixner
2021-11-27  1:23 ` [patch 06/32] genirq/msi: Provide domain flags to allocate/free MSI descriptors automatically Thomas Gleixner
2021-11-27  1:23 ` [patch 07/32] genirq/msi: Count the allocated MSI descriptors Thomas Gleixner
2021-11-27 12:19   ` Greg Kroah-Hartman
2021-11-27 19:22     ` Thomas Gleixner
2021-11-27 19:45       ` Thomas Gleixner
2021-11-28 11:07         ` Greg Kroah-Hartman
2021-11-28 19:23           ` Thomas Gleixner
2021-11-27  1:23 ` [patch 08/32] PCI/MSI: Protect MSI operations Thomas Gleixner
2021-11-27  1:23 ` [patch 09/32] PCI/MSI: Use msi_add_msi_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 10/32] PCI/MSI: Let core code free MSI descriptors Thomas Gleixner
2021-11-27  1:23 ` [patch 11/32] PCI/MSI: Use msi_on_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 12/32] x86/pci/xen: Use msi_for_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 13/32] xen/pcifront: Rework MSI handling Thomas Gleixner
2021-11-27  1:23 ` [patch 14/32] s390/pci: Rework MSI descriptor walk Thomas Gleixner
2021-11-29 10:31   ` Niklas Schnelle
2021-11-29 13:04     ` Thomas Gleixner
2021-11-27  1:23 ` [patch 15/32] powerpc/4xx/hsta: Rework MSI handling Thomas Gleixner
2021-11-27  1:23 ` [patch 16/32] powerpc/cell/axon_msi: Convert to msi_on_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 17/32] powerpc/pasemi/msi: Convert to msi_on_each_dec() Thomas Gleixner
2021-11-27  1:23 ` [patch 18/32] powerpc/fsl_msi: Use msi_for_each_desc() Thomas Gleixner
2021-11-27  1:23 ` [patch 19/32] powerpc/mpic_u3msi: Use msi_for_each-desc() Thomas Gleixner
2021-11-27  1:24 ` [patch 20/32] PCI: hv: Rework MSI handling Thomas Gleixner
2021-11-27  1:24 ` [patch 21/32] NTB/msi: Convert to msi_on_each_desc() Thomas Gleixner
2021-11-29 18:21   ` Logan Gunthorpe
2021-11-29 20:51     ` Thomas Gleixner
2021-11-29 22:27       ` Logan Gunthorpe
2021-11-29 22:50         ` Dave Jiang
2021-11-29 23:31         ` Jason Gunthorpe
2021-11-29 23:52           ` Logan Gunthorpe
2021-11-30  0:01             ` Jason Gunthorpe
2021-11-30  0:29         ` Thomas Gleixner
2021-11-30 19:21           ` Logan Gunthorpe
2021-11-30 19:48             ` Thomas Gleixner
2021-11-30 20:14               ` Logan Gunthorpe
2021-11-30 20:28               ` Jason Gunthorpe
2021-11-30 21:23                 ` Thomas Gleixner
2021-12-01  0:17                   ` Jason Gunthorpe
2021-12-01 10:16                     ` Thomas Gleixner
2021-12-01 13:00                       ` Jason Gunthorpe
2021-12-01 17:35                         ` Thomas Gleixner
2021-12-01 18:14                           ` Jason Gunthorpe
2021-12-01 18:46                             ` Logan Gunthorpe
2021-12-01 20:21                             ` Thomas Gleixner
2021-12-02  0:01                               ` Thomas Gleixner
2021-12-02 13:55                                 ` Jason Gunthorpe
2021-12-02 14:23                                   ` Greg Kroah-Hartman
2021-12-02 14:45                                     ` Jason Gunthorpe
2021-12-02 19:25                                   ` Thomas Gleixner
2021-12-02 20:00                                     ` Jason Gunthorpe
2021-12-02 22:31                                       ` Thomas Gleixner
2021-12-03  0:37                                         ` Jason Gunthorpe
2021-12-03 15:07                                           ` Thomas Gleixner
2021-12-03 16:41                                             ` Jason Gunthorpe
2021-12-04 14:20                                               ` Thomas Gleixner
2021-12-05 14:16                                                 ` Thomas Gleixner
2021-12-06 14:43                                                   ` Jason Gunthorpe
2021-12-06 15:47                                                     ` Thomas Gleixner
2021-12-06 17:00                                                       ` Jason Gunthorpe
2021-12-06 20:28                                                         ` Thomas Gleixner
2021-12-06 21:06                                                           ` Jason Gunthorpe
2021-12-06 22:21                                                             ` Thomas Gleixner
2021-12-06 14:19                                                 ` Jason Gunthorpe
2021-12-06 15:06                                                   ` Thomas Gleixner
2021-12-09  6:26                                               ` Tian, Kevin
2021-12-09  9:03                                                 ` Thomas Gleixner
2021-12-09 12:17                                                   ` Tian, Kevin
2021-12-09 15:57                                                     ` Thomas Gleixner
2021-12-10  7:37                                                       ` Tian, Kevin
2021-12-09  5:41                                   ` Tian, Kevin
2021-12-09  5:47                                     ` Jason Wang
2021-12-01 16:28                       ` Dave Jiang
2021-12-01 18:41                         ` Thomas Gleixner
2021-12-01 18:47                           ` Dave Jiang
2021-12-01 20:25                             ` Thomas Gleixner
2021-12-01 21:21                               ` Dave Jiang
2021-12-01 21:44                                 ` Thomas Gleixner
2021-12-01 21:49                                   ` Dave Jiang
2021-12-01 22:03                                     ` Thomas Gleixner
2021-12-01 22:53                                       ` Dave Jiang
2021-12-01 23:57                                         ` Thomas Gleixner
2021-12-09  5:23                                   ` Tian, Kevin
2021-12-09  8:37                                     ` Thomas Gleixner
2021-12-09 12:31                                       ` Tian, Kevin
2021-12-09 16:21                                       ` Jason Gunthorpe
2021-12-09 20:32                                         ` Thomas Gleixner
2021-12-09 20:58                                           ` Jason Gunthorpe
2021-12-09 22:09                                             ` Thomas Gleixner
2021-12-10  0:26                                               ` Thomas Gleixner
2021-12-10  7:29                                                 ` Tian, Kevin
2021-12-10 12:13                                                   ` Thomas Gleixner
2021-12-11  8:06                                                     ` Tian, Kevin
2021-12-10 12:39                                                   ` Jason Gunthorpe
2021-12-10 19:00                                                     ` Thomas Gleixner
2021-12-11  7:44                                                       ` Tian, Kevin
2021-12-11 13:04                                                         ` Thomas Gleixner
2021-12-12  1:56                                                           ` Tian, Kevin
2021-12-12 20:55                                                             ` Thomas Gleixner
2021-12-12 23:37                                                               ` Jason Gunthorpe
2021-12-13  7:50                                                                 ` Tian, Kevin
2021-12-11  7:52                                                     ` Tian, Kevin
2021-12-12  0:12                                                       ` Thomas Gleixner
2021-12-12  2:14                                                         ` Tian, Kevin
2021-12-12 20:50                                                           ` Thomas Gleixner
2021-12-12 23:42                                                         ` Jason Gunthorpe
2021-12-10  7:36                                             ` Tian, Kevin
2021-12-10 12:30                                               ` Jason Gunthorpe
2021-12-12  6:44                                               ` Mika Penttilä
2021-12-12 23:27                                                 ` Jason Gunthorpe
2021-12-01 14:52                   ` Thomas Gleixner
2021-12-01 15:11                     ` Jason Gunthorpe
2021-12-01 18:37                       ` Thomas Gleixner
2021-12-01 18:47                         ` Jason Gunthorpe
2021-12-01 20:26                           ` Thomas Gleixner
2021-11-27  1:24 ` [patch 22/32] soc: ti: ti_sci_inta_msi: Rework MSI descriptor allocation Thomas Gleixner
2021-11-27  1:24 ` [patch 23/32] soc: ti: ti_sci_inta_msi: Remove ti_sci_inta_msi_domain_free_irqs() Thomas Gleixner
2021-11-27  1:24 ` [patch 24/32] bus: fsl-mc-msi: Simplify MSI descriptor handling Thomas Gleixner
2021-11-27  1:24 ` [patch 25/32] platform-msi: Let core code handle MSI descriptors Thomas Gleixner
2021-11-27  1:24 ` [patch 26/32] platform-msi: Simplify platform device MSI code Thomas Gleixner
2021-11-27  1:24 ` [patch 27/32] genirq/msi: Make interrupt allocation less convoluted Thomas Gleixner
2021-11-27  1:24 ` [patch 28/32] genirq/msi: Convert to new functions Thomas Gleixner
2021-11-27  1:24 ` [patch 29/32] genirq/msi: Mop up old interfaces Thomas Gleixner
2021-11-27  1:24 ` [patch 30/32] genirq/msi: Add abuse prevention comment to msi header Thomas Gleixner
2021-11-27  1:24 ` [patch 31/32] genirq/msi: Simplify sysfs handling Thomas Gleixner
2021-11-27 12:32   ` Greg Kroah-Hartman
2021-11-27 19:31     ` Thomas Gleixner
2021-11-28 11:07       ` Greg Kroah-Hartman
2021-11-28 19:33         ` Thomas Gleixner
2021-11-27  1:24 ` [patch 32/32] genirq/msi: Convert storage to xarray Thomas Gleixner
2021-11-27 12:33   ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).