linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation
@ 2022-11-11 13:58 Thomas Gleixner
  2022-11-11 13:58 ` [patch 01/33] genirq/msi: Rearrange MSI domain flags Thomas Gleixner
                   ` (32 more replies)
  0 siblings, 33 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Hi!

This is the third part of a three part series which provides support for
per device MSI interrupt domains. This solves conceptual problems of the
current PCI/MSI design which are in the way of providing support for
PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device.

Part 2 can be found here:

  https://lore.kernel.org/all/20221111131813.914374272@linutronix.de

IMS (Interrupt Message Store] is a new specification which allows device
manufactures to provide implementaton defined storage for MSI messages
contrary to the uniform and specification defined storage mechanisms for
PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitions
of the MSI-X table, but also gives the device manufacturer the freedom to
store the message in arbitrary places, even in host memory which is shared
with the device.

There have been several attempts to glue this into the current MSI code,
but after lengthy discussions in various threads:

  https://lore.kernel.org/all/20211126230957.239391799@linutronix.de
  https://lore.kernel.org/all/MWHPR11MB188603D0D809C1079F5817DC8C099@MWHPR11MB1886.namprd11.prod.outlook.com
  https://lore.kernel.org/all/160408357912.912050.17005584526266191420.stgit@djiang5-desk3.ch.intel.com

it turned out that there is a fundamental design problem in the current
PCI/MSI-X implementation. This needs some historical background.

When PCI/MSI[-X] support was added around 2003 interrupt management was
completely different from what we have today in the actively developed
architectures. Interrupt management was completely architecture specific
and while there were attempts to create common infrastructure the
commonalities were rudimental and just providing shared data structures and
interfaces so that drivers could be written in an architecture agnostic
way.

The initial PCI/MSI[-X] support obviously plugged into this model which
resulted in some basic shared infrastructure in the PCI core code for
setting up MSI descriptors, which are a pure software construct for holding
data relevant for a particular MSI interrupt, but the actual association to
Linux interrupts was completely architecture specific. This model is still
supported today to keep museum architectures and notorious stranglers
alive.

In 2013 Intel tried to add support for hotplugable IO/APICs to the kernel
which was creating yet another architecture specific mechanism and resulted
in an unholy mess on top of the existing horrors of x86 interrupt handling.
The x86 interrupt management code was already an uncomprehensible maze of
indirections between the CPU vector management, interrupt remapping and the
actual IO/APIC and PCI/MSI[-X] implementation.

At roughly the same time ARM struggled with the ever growing SoC specific
extensions which were glued on top of the architected GIC interrupt
controller.

This resulted in a fundamental redesign of interrupt management and
provided the today prevailing concept of hierarchical interrupt
domains. This allows to disentangle the interactions between x86 vector
domain and interrupt remapping and also allowed ARM to handle the zoo of
SoC specific interrupt components in a sane way.

The concept of hierarchical interrupt domains aims to encapsulate the
functionality of particual IP blocks which are involved in interrupt
delivery so that they become extensible and pluggable. The X86
encapsulation looks like this:

                                         |--- device 1
     [Vector]---[Remapping]---[PCI/MSI]--|...
                                         |--- device N

where the remapping domain is an optional component and in case that it is
not available the PCI/MSI[-X] domains have the vector domain as their
parent. This reduced the required interaction between the domains pretty
much to the initialization phase where it is obviously required to
establish the proper parent relation ship in the components of the
hierarchy.

While in most cases the model is strictly representing the chain of IP
blocks and abstracting them so they can be plugged together to form a
hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware
it's clear that the actual PCI/MSI[-X] interrupt controller is not a global
entity, but strict a per PCI device entity.

Here we took a short cut on the hierarchical model and went for the easy
solution of providing "global" PCI/MSI domains which was possible because
the PCI/MSI[-X] handling is uniform accross the devices. This also allowed
to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in
turn made it simple to keep the existing architecture specific management
alive.

A similar problem was created in the ARM world with support for IP block
specific message storage. Instead of going all the way to stack a IP block
specific domain on top of the generic MSI domain this ended in a construct
which provides a "global" platform MSI domain which allows overriding the
irq_write_msi_msg() callback per allocation.

In course of the lengthy discussions we identified other abuse of the MSI
infrastructure in wireless drivers, NTB etc. where support for
implementation specific message storage was just mindlessly glued into the
existing infrastructure. Some of this just works by chance on particular
platforms but will fail in hard to diagnose ways when the driver is used
on platforms where the underlying MSI interrupt management code does not
expect the creative abuse.

Another shortcoming of todays PCI/MSI-X support is the inability to
allocate or free individual vectors after the initial enablement of
MSI-X. This results in an works by chance implementation of VFIO (PCI
passthrough) where interrupts on the host side are not set up upfront to
avoid resource exhaustion. They are expanded at runtime when the guest
actually tries to use them. The way how this is implemented is that the
host disables MSI-X and the enables it with a larger number of vectors
again. That works by chance because most device drivers set up all
interrupts before the device actually will utilize them. But that's not
universally true because some drivers allocate a large enough number of
vectors but do not utilize them until it's actually required,
e.g. acceleration support. But at that point other interrupts of the device
might be in active use and the MSI-X disable/enable dance can just result
in losing interrupts and therefore hard to diagnose subtle problems.

Last but not least the "global" PCI/MSI-X domain approach prevents to
utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS
is not longer providing a uniform storage and configuration model.

The solution to this is to implement the missing step and switch from
global MSI domains to per device MSI domains. The resulting hierarchy then
looks like this:

                              |--- [PCI/MSI] device 1
     [Vector]---[Remapping]---|...
                              |--- [PCI/MSI] device N

which in turn allows provide support for multiple domains per device:

                              |--- [PCI/MSI] device 1
                              |--- [PCI/IMS] device 1
     [Vector]---[Remapping]---|...
                              |--- [PCI/MSI] device N
                              |--- [PCI/IMS] device N

To achieve this and to provide solutions for the other identified issues,
e.g. VFIO, this needs a major overhaul of the affected infrastructure.

The 90+ patches series is split into three parts for submission:

  1) General cleanups of the core infrastructure and the PCI/MSI code

  2) Preparatory changes for per device (multiple) MSI domain support
     including a complete replacement of the MSI core interfaces
     switching from a domain pointer based to a domain ID based model
     and providing support for proper range based allocation/free

  3) The actual implementation for per device domains, the conversion of
     the PCI/MSI-X infrastructure, dynamic allocation/free for MSI-X,
     initial support for PCI/IMS and enablement for X86. Plus a demo
     IMS driver for IDXD.

The three parts are available from git:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-v1-part1
   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-v1-part2
   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-v1-part3

To complete the picture we went all the way and converted ARM64 including
the platform-MSI horrors over to the new model. It's barely tested in a VM
(at least the PCI/MSI-X) part can be validated that way. For the rest this
just compiles and we can't do much more as we lack hardware. The reason why
this conversion was done is to ensure that the design, the underlying data
structures and the resulting interfaces are correct and can handle the
requirements of ARM64. The result looks pretty good and while the initial
support does not cover some of the oddball issues of the ARM64 zoo, it
turned out that the extra functionality required is just extending the
provided infrastructure but does not require any design changes. This is
also available from git for the adventurous, but be warned that it might
eat your harddisk, confuse your cat and make your kids miss the bus:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-v1-arm

This is not going to be posted as its work in progress. It's provided for
reference and for Marc to play with.

We did look into NTB and other places like VFIO, but did not come around to
actually convert them over partly because of lack of time, but also because
the code is simply incomprehensible.

We tested the creation of a secondary MSI domain with a mockup driver, but
again due to lack of hardware there is no way to validate any functionality.

The dynamic allocation/free of MSI-X interrupts post MSI-X enable was
tested by hacking up a device driver and allocating only one interrupt at
MSI-X enable time and then allocate the rest with the new interfaces. Also
the dynamic free was tested that way.

Note, that dynamic allocation of MSI-X interrupt requires opt-in of the
underlying MSI parent domain. It is neither supported on the legacy
architecture specific MSI mechanism, which is still in use on ia64, sparc,
PPC and a few subarchitectures of MIPS.

The reason why this cannot be supported unconditionally is that due to the
history of PCI/MSI support in the kernel there are many implementations
which expect that MSI[-X] enable is a one off operation which allocates and
associates all required interrupts right at that point. Even architectures
which utilize hierarchical irq domains have such assumptions which are in
some cases even enforced through the underlying hypervisor/firmware.

IMS is opt-in too and it requires that the architecture/platform has been
converted to the per device MSI model and the underlying interrupt domains
have the necessary support in place, which might never happen for ia64 and
some parts of MIPS, SPARC, PPC and S390.

That means driver writers have to be careful about the limitiations of
this. For dynamic MSI-X allocation/free there is a query interface. For IMS
domains that's momentarily just the domain creation failing with an error
code. If necessary for driver conveniance then implementing a query
interface is trivial enough.

Enough of history and theory. Here comes part 3:

It provides the actual per device domain implementation and related
functionality:

  1) Provide infrastructure to create and remove per device MSI domains

  2) Implement per device MSI domains in the PCI/MSI code and make
     them conditional on the availability of a suitable parent MSI
     domain. This allows to convert the existing domains one by one
     and keeps both the legacy and the current "global" PCI/MSI domain
     model working.

  3) Convert the related x86 MSI domains over (vector and remapping).

  4) Provide core infrastructure for dynamic allocations

  5) Provide PCI/MSI-X interfaces for device drivers to do post
     MSI-X enable allocation/free

  6) Enable dynamic allocation support on the x86 MSI parent domains

  7) Provide infrastructure to create PCI/IMS domains

  8) Enable IMS support on the x86 MSI parent domains

  9) Provide a driver for IDXD which demonstrates how IMS domains
     look like.

The IDXD part is untested and the core IMS functionality has only been
exposed to a mockup.

The overall impact of the full series against 6.1-rc4 is:

 54 files changed, 2912 insertions(+), 1366 deletions(-)

while the subsequent work in progress conversion of ARM64 actually has a
negative diffstat:

 30 files changed, 900 insertions(+), 1274 deletions(-)

Thanks,

	tglx
---
 arch/x86/include/asm/irq_remapping.h       |    4 
 arch/x86/include/asm/msi.h                 |    6 
 arch/x86/include/asm/pci.h                 |    1 
 arch/x86/kernel/apic/msi.c                 |  208 +++++++++-------
 drivers/iommu/amd/amd_iommu_types.h        |    1 
 drivers/iommu/amd/iommu.c                  |   23 +
 drivers/iommu/intel/iommu.h                |    1 
 drivers/iommu/intel/irq_remapping.c        |   31 +-
 drivers/irqchip/Kconfig                    |    7 
 drivers/irqchip/Makefile                   |    1 
 drivers/irqchip/irq-pci-intel-idxd.c       |  143 +++++++++++
 drivers/pci/msi/api.c                      |  117 +++++++++
 drivers/pci/msi/irqdomain.c                |  278 +++++++++++++++++++--
 drivers/pci/msi/msi.c                      |  192 ++++++++------
 drivers/pci/msi/msi.h                      |    4 
 include/linux/irqchip/irq-pci-intel-idxd.h |   22 +
 include/linux/irqdomain.h                  |    9 
 include/linux/irqdomain_defs.h             |    5 
 include/linux/msi.h                        |  130 ++++++++--
 include/linux/msi_api.h                    |   38 ++
 include/linux/pci.h                        |   14 +
 kernel/irq/msi.c                           |  375 +++++++++++++++++++++++++++--
 22 files changed, 1357 insertions(+), 253 deletions(-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 01/33] genirq/msi: Rearrange MSI domain flags
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 18:41   ` Jason Gunthorpe
  2022-11-11 13:58 ` [patch 02/33] genirq/msi: Provide struct msi_parent_ops Thomas Gleixner
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

These flags got added as necessary and have no obvious structure. For
feature support checks and masking it's convenient to have two blocks of
flags:

   1) Flags to control the internal behaviour like allocating/freeing
      MSI descriptors. Those flags do not need any support from the
      underlying MSI parent domain. They are mostly under the control
      of the outermost domain which implements the actual MSI support.

   2) Flags to expose features, e.g. PCI multi-MSI or requirements
      which can depend on a underlying domain.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |   49 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 34 insertions(+), 15 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -24,6 +24,8 @@
 #include <linux/xarray.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
+#include <linux/bits.h>
+
 #include <asm/msi.h>
 
 /* Dummy shadow structures if an architecture does not define them */
@@ -428,7 +430,16 @@ struct msi_domain_info {
 	void				*data;
 };
 
-/* Flags for msi_domain_info */
+/*
+ * Flags for msi_domain_info
+ *
+ * Bit 0-15:	Generic MSI functionality which is not subject to restriction
+ *		by parent domains
+ *
+ * Bit 16-31:	Functionality which depends on the underlying parent domain and
+ *		can be masked out by msi_parent_ops::init_dev_msi_info() when
+ *		a device MSI domain is initialized.
+ */
 enum {
 	/*
 	 * Init non implemented ops callbacks with default MSI domain
@@ -440,33 +451,41 @@ enum {
 	 * callbacks.
 	 */
 	MSI_FLAG_USE_DEF_CHIP_OPS	= (1 << 1),
-	/* Support multiple PCI MSI interrupts */
-	MSI_FLAG_MULTI_PCI_MSI		= (1 << 2),
-	/* Support PCI MSIX interrupts */
-	MSI_FLAG_PCI_MSIX		= (1 << 3),
 	/* Needs early activate, required for PCI */
-	MSI_FLAG_ACTIVATE_EARLY		= (1 << 4),
+	MSI_FLAG_ACTIVATE_EARLY		= (1 << 2),
 	/*
 	 * Must reactivate when irq is started even when
 	 * MSI_FLAG_ACTIVATE_EARLY has been set.
 	 */
-	MSI_FLAG_MUST_REACTIVATE	= (1 << 5),
-	/* Is level-triggered capable, using two messages */
-	MSI_FLAG_LEVEL_CAPABLE		= (1 << 6),
+	MSI_FLAG_MUST_REACTIVATE	= (1 << 3),
 	/* Populate sysfs on alloc() and destroy it on free() */
-	MSI_FLAG_DEV_SYSFS		= (1 << 7),
-	/* MSI-X entries must be contiguous */
-	MSI_FLAG_MSIX_CONTIGUOUS	= (1 << 8),
+	MSI_FLAG_DEV_SYSFS		= (1 << 4),
 	/* Allocate simple MSI descriptors */
-	MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS	= (1 << 9),
+	MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS	= (1 << 5),
 	/* Free MSI descriptors */
-	MSI_FLAG_FREE_MSI_DESCS		= (1 << 10),
+	MSI_FLAG_FREE_MSI_DESCS		= (1 << 6),
 	/*
 	 * Quirk to handle MSI implementations which do not provide
 	 * masking. Currently known to affect x86, but has to be partially
 	 * handled in the core MSI code.
 	 */
-	MSI_FLAG_NOMASK_QUIRK		= (1 << 11),
+	MSI_FLAG_NOMASK_QUIRK		= (1 << 7),
+
+	/* Mask for the generic functionality */
+	MSI_GENERIC_FLAGS_MASK		= GENMASK(15, 0),
+
+	/* Mask for the domain specific functionality */
+	MSI_DOMAIN_FLAGS_MASK		= GENMASK(31, 16),
+
+	/* Support multiple PCI MSI interrupts */
+	MSI_FLAG_MULTI_PCI_MSI		= (1 << 16),
+	/* Support PCI MSIX interrupts */
+	MSI_FLAG_PCI_MSIX		= (1 << 17),
+	/* Is level-triggered capable, using two messages */
+	MSI_FLAG_LEVEL_CAPABLE		= (1 << 18),
+	/* MSI-X entries must be contiguous */
+	MSI_FLAG_MSIX_CONTIGUOUS	= (1 << 19),
+
 };
 
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 02/33] genirq/msi: Provide struct msi_parent_ops
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
  2022-11-11 13:58 ` [patch 01/33] genirq/msi: Rearrange MSI domain flags Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 18:57   ` Jason Gunthorpe
  2022-11-11 13:58 ` [patch 03/33] genirq/msi: Provide data structs for per device domains Thomas Gleixner
                   ` (30 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

MSI parent domains must have some control over the MSI domains which are
built on top. On domain creation they need to fill in e.g. architecture
specific chip callbacks or msi domain ops to make the outermost domain
parent agnostic which is obviously required for architecture independence
etc.

The structure contains:

    1) A bitfield which exposes the supported functional features. This
       allows to check for features and is also used in the initialization
       callback to mask out unsupported features when the actual domain
       implementation requests a broader range, e.g. on x86 PCI multi-MSI
       is only supported by remapping domains but not by the underlying
       vector domain. The PCI/MSI code can then always request multi-MSI
       support, but the resulting feature set after creation might not
       have it set.

    2) An optional string prefix which is put in front of domain and chip
       names during creation of the MSI domain. That allows to keep the
       naming schemes e.g. on x86 where PCI-MSI domains have a IR- prefix
       when interrupt remapping is enabled.

    3) An initialization callback to sanity check the domain info of
       the to be created MSI domain, to restrict features and to
       apply changes in MSI ops and interrupt chip callbacks to
       accomodate to the particular MSI parent implementation and/or
       the underlying hierarchy.

Add a conveniance function to delegate the initialization from the
MSI parent domain to an underlying domain in the hierarchy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irqdomain.h |    5 +++++
 include/linux/msi.h       |   20 ++++++++++++++++++++
 kernel/irq/msi.c          |   36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+)

--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -46,6 +46,7 @@ struct irq_desc;
 struct cpumask;
 struct seq_file;
 struct irq_affinity_desc;
+struct msi_parent_ops;
 
 #define IRQ_DOMAIN_IRQ_SPEC_PARAMS 16
 
@@ -134,6 +135,7 @@ struct irq_domain_chip_generic;
  * @pm_dev:	Pointer to a device that can be utilized for power management
  *		purposes related to the irq domain.
  * @parent:	Pointer to parent irq_domain to support hierarchy irq_domains
+ * @msi_parent_ops: Pointer to MSI parent domain methods for per device domain init
  *
  * Revmap data, used internally by the irq domain code:
  * @revmap_size:	Size of the linear map table @revmap[]
@@ -157,6 +159,9 @@ struct irq_domain {
 #ifdef	CONFIG_IRQ_DOMAIN_HIERARCHY
 	struct irq_domain		*parent;
 #endif
+#ifdef CONFIG_GENERIC_MSI_IRQ
+	const struct msi_parent_ops	*msi_parent_ops;
+#endif
 
 	/* reverse map data. The linear map gets appended to the irq_domain */
 	irq_hw_number_t			hwirq_max;
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -488,6 +488,26 @@ enum {
 
 };
 
+/**
+ * struct msi_parent_ops - MSI parent domain callbacks and configuration info
+ *
+ * @supported_flags:	Required: The supported MSI flags of the parent domain
+ * @prefix:		Optional: Prefix for the domain and chip name
+ * @init_dev_msi_info:	Required: Callback for MSI parent domains to setup parent
+ *			domain specific domain flags, domain ops and interrupt chip
+ *			callbacks when a per device domain is created.
+ */
+struct msi_parent_ops {
+	u32		supported_flags;
+	const char	*prefix;
+	bool		(*init_dev_msi_info)(struct device *dev, struct irq_domain *domain,
+					     struct irq_domain *real_parent,
+					     struct msi_domain_info *info);
+};
+
+bool msi_parent_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
+				  struct irq_domain *real_parent, struct msi_domain_info *info);
+
 int msi_domain_set_affinity(struct irq_data *data, const struct cpumask *mask,
 			    bool force);
 
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -825,6 +825,42 @@ struct irq_domain *msi_create_irq_domain
 	return domain;
 }
 
+/**
+ * msi_parent_init_dev_msi_info - Delegate initialization of device MSI info to parent domain
+ * @dev:		The device for which the domain should be created
+ * @domain:		The domain which delegates
+ * @real_parent:	The real parent domain of the to be initialized MSI domain
+ * @info:		The MSI domain info to initialize
+ *
+ * Return: true on success, false otherwise
+ *
+ * This is the most complex problem of per device MSI domains and the
+ * underlying interrupt domain hierarchy:
+ *
+ * The device domain to be initialized requests the broadest feature set
+ * possible and the underlying domain hierarchy puts restrictions on it.
+ *
+ * That's working perfectly fine for a strict parent->device model, but it
+ * falls apart with a root_parent->real_parent->device chain because the
+ * intermediate 'real parent' can expand the capabilities which the
+ * 'root_parent' domain is providing. So that creates a classic hen and egg
+ * problem: Which entity is doing the restrictions/expansions?
+ *
+ * One solution is to let the root parent domain handle the initialization
+ * that's why there is the @domain and the @real_parent pointer.
+ */
+bool msi_parent_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
+				  struct irq_domain *real_parent, struct msi_domain_info *info)
+{
+	struct irq_domain *parent = domain->parent;
+
+	if (WARN_ON_ONCE(!parent || !parent->msi_parent_ops ||
+			 !parent->msi_parent_ops->init_dev_msi_info))
+		return false;
+
+	return parent->msi_parent_ops->init_dev_msi_info(dev, parent, real_parent, info);
+}
+
 int msi_domain_prepare_irqs(struct irq_domain *domain, struct device *dev,
 			    int nvec, msi_alloc_info_t *arg)
 {


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 03/33] genirq/msi: Provide data structs for per device domains
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
  2022-11-11 13:58 ` [patch 01/33] genirq/msi: Rearrange MSI domain flags Thomas Gleixner
  2022-11-11 13:58 ` [patch 02/33] genirq/msi: Provide struct msi_parent_ops Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 04/33] genirq/msi: Add size info to struct msi_domain_info Thomas Gleixner
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Provide struct msi_domain_template which contains a bundle of struct
irq_chip, struct msi_domain_ops and struct msi_domain_info and a name
field.

This template is used by MSI device domain implementations to provide the
domain specific functionality, feature bits etc.

When a MSI domain is created the template is duplicated in the core code
so that it can be modified per instance. That means templates can be
marked const at the MSI device domain code.

The template is a bundle to avoid several allocations and duplications
of the involved structures.

The name field is used to construct the final domain and chip name via:

    $PREFIX-$NAME-$DEVNAME

where prefix is the optional prefix of the MSI parent domain, $NAME is the
provided name in template::chip and the device name so that the domain
is properly identified. On x86 this results for PCI/MSI in:

   PCI-MSI-0000:3d:00.1 or IR-PCI-MSIX-0000:3d:00.1

depending on the domain type and the availability of remapping.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -24,6 +24,7 @@
 #include <linux/xarray.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
+#include <linux/irq.h>
 #include <linux/bits.h>
 
 #include <asm/msi.h>
@@ -74,7 +75,6 @@ struct msi_msg {
 
 extern int pci_msi_ignore_mask;
 /* Helper functions */
-struct irq_data;
 struct msi_desc;
 struct pci_dev;
 struct platform_msi_priv_data;
@@ -430,6 +430,20 @@ struct msi_domain_info {
 	void				*data;
 };
 
+/**
+ * struct msi_domain_template - Template for MSI device domains
+ * @name:	Storage for the resulting name. Filled in by the core.
+ * @chip:	Interrupt chip for this domain
+ * @ops:	MSI domain ops
+ * @info:	MSI domain info data
+ */
+struct msi_domain_template {
+	char			name[48];
+	struct irq_chip		chip;
+	struct msi_domain_ops	ops;
+	struct msi_domain_info	info;
+};
+
 /*
  * Flags for msi_domain_info
  *


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 04/33] genirq/msi: Add size info to struct msi_domain_info
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (2 preceding siblings ...)
  2022-11-11 13:58 ` [patch 03/33] genirq/msi: Provide data structs for per device domains Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 05/33] genirq/msi: Split msi_create_irq_domain() Thomas Gleixner
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

To allow proper range checking especially for dynamic allocations add a
size field to struct msi_domain_info. If the field is 0 then the size is
unknown or unlimited (up to MSI_MAX_INDEX) to provide backwards
compability.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -410,6 +410,7 @@ struct msi_domain_ops {
  * struct msi_domain_info - MSI interrupt domain data
  * @flags:		Flags to decribe features and capabilities
  * @bus_token:		The domain bus token
+ * @hwsize:		The hardware table size (0 if unknown/unlimited)
  * @ops:		The callback data structure
  * @chip:		Optional: associated interrupt chip
  * @chip_data:		Optional: associated interrupt chip data
@@ -421,6 +422,7 @@ struct msi_domain_ops {
 struct msi_domain_info {
 	u32				flags;
 	enum irq_domain_bus_token	bus_token;
+	unsigned int			hwsize;
 	struct msi_domain_ops		*ops;
 	struct irq_chip			*chip;
 	void				*chip_data;


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 05/33] genirq/msi: Split msi_create_irq_domain()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (3 preceding siblings ...)
  2022-11-11 13:58 ` [patch 04/33] genirq/msi: Add size info to struct msi_domain_info Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 06/33] genirq/irqdomain: Add irq_domain::dev for per device MSI domains Thomas Gleixner
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Split the functionality of msi_create_irq_domain() so it can
be reused for creating per device irq domains.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/irq/msi.c |   32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -795,17 +795,10 @@ static void msi_domain_update_chip_ops(s
 		chip->irq_set_affinity = msi_domain_set_affinity;
 }
 
-/**
- * msi_create_irq_domain - Create an MSI interrupt domain
- * @fwnode:	Optional fwnode of the interrupt controller
- * @info:	MSI domain info
- * @parent:	Parent irq domain
- *
- * Return: pointer to the created &struct irq_domain or %NULL on failure
- */
-struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode,
-					 struct msi_domain_info *info,
-					 struct irq_domain *parent)
+static struct irq_domain *__msi_create_irq_domain(struct fwnode_handle *fwnode,
+						  struct msi_domain_info *info,
+						  unsigned int flags,
+						  struct irq_domain *parent)
 {
 	struct irq_domain *domain;
 
@@ -813,7 +806,7 @@ struct irq_domain *msi_create_irq_domain
 	if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS)
 		msi_domain_update_chip_ops(info);
 
-	domain = irq_domain_create_hierarchy(parent, IRQ_DOMAIN_FLAG_MSI, 0,
+	domain = irq_domain_create_hierarchy(parent, flags | IRQ_DOMAIN_FLAG_MSI, 0,
 					     fwnode, &msi_domain_ops, info);
 
 	if (domain) {
@@ -826,6 +819,21 @@ struct irq_domain *msi_create_irq_domain
 }
 
 /**
+ * msi_create_irq_domain - Create an MSI interrupt domain
+ * @fwnode:	Optional fwnode of the interrupt controller
+ * @info:	MSI domain info
+ * @parent:	Parent irq domain
+ *
+ * Return: pointer to the created &struct irq_domain or %NULL on failure
+ */
+struct irq_domain *msi_create_irq_domain(struct fwnode_handle *fwnode,
+					 struct msi_domain_info *info,
+					 struct irq_domain *parent)
+{
+	return __msi_create_irq_domain(fwnode, info, 0, parent);
+}
+
+/**
  * msi_parent_init_dev_msi_info - Delegate initialization of device MSI info to parent domain
  * @dev:		The device for which the domain should be created
  * @domain:		The domain which delegates


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 06/33] genirq/irqdomain: Add irq_domain::dev for per device MSI domains
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (4 preceding siblings ...)
  2022-11-11 13:58 ` [patch 05/33] genirq/msi: Split msi_create_irq_domain() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 07/33] genirq/msi: Provide msi_create/free_device_irq_domain() Thomas Gleixner
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Per device domains require the device pointer of the device which
instantiated the domain for some purposes. Add the pointer to struct
irq_domain. It will be used in the next step which provides the
infrastructure to create per device MSI domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irqdomain.h |    4 ++++
 1 file changed, 4 insertions(+)

--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -132,6 +132,9 @@ struct irq_domain_chip_generic;
  * @gc:		Pointer to a list of generic chips. There is a helper function for
  *		setting up one or more generic chips for interrupt controllers
  *		drivers using the generic chip library which uses this pointer.
+ * @dev:	Pointer to the device which instantiated the irqdomain
+ *		With per device irq domains this is not necessarily the same
+ *		as @pm_dev.
  * @pm_dev:	Pointer to a device that can be utilized for power management
  *		purposes related to the irq domain.
  * @parent:	Pointer to parent irq_domain to support hierarchy irq_domains
@@ -155,6 +158,7 @@ struct irq_domain {
 	struct fwnode_handle		*fwnode;
 	enum irq_domain_bus_token	bus_token;
 	struct irq_domain_chip_generic	*gc;
+	struct device			*dev;
 	struct device			*pm_dev;
 #ifdef	CONFIG_IRQ_DOMAIN_HIERARCHY
 	struct irq_domain		*parent;


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 07/33] genirq/msi: Provide msi_create/free_device_irq_domain()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (5 preceding siblings ...)
  2022-11-11 13:58 ` [patch 06/33] genirq/irqdomain: Add irq_domain::dev for per device MSI domains Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 08/33] genirq/msi: Provide msi_match_device_domain() Thomas Gleixner
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Now that all prerequsites are in place, provide the actual interfaces for
creating and removing per device interrupt domains.

MSI device interrupt domains are created from the provided
msi_domain_template which is duplicated so that it can be modified for the
particular device.

The name of the domain and the name of the interrupt chip are composed by
"$(PREFIX)$(CHIPNAME)-$(DEVNAME)"

  $PREFIX:   The optional prefix provided by the underlying MSI parent domain
             via msi_parent_ops::prefix.
  $CHIPNAME: The name of the irq_chip in the template
  $DEVNAME:  The name of the device

The domain is further initialized through a MSI parent domain callback which
fills in the required functionality for the parent domain or domains further
down the hierarchy. This initialization can fail, e.g. when the requested
feature or MSI domain type cannot be supported.

The domain pointer is stored in the pointer array inside of msi_device_data
which is attached to the domain.

The domain can be removed via the API or left for disposal via devres when
the device is torn down. The API removal is useful e.g. for PCI to have
seperate domains for MSI and MSI-X, which are mutually exclusive and always
occupy the default domain id slot.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    6 ++
 kernel/irq/msi.c    |  142 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 148 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -531,6 +531,12 @@ struct irq_domain *msi_create_irq_domain
 					 struct msi_domain_info *info,
 					 struct irq_domain *parent);
 
+bool msi_create_device_irq_domain(struct device *dev, unsigned int domid,
+				  const struct msi_domain_template *template,
+				  unsigned int hwsize, void *domain_data,
+				  void *chip_data);
+void msi_remove_device_irq_domain(struct device *dev, unsigned int domid);
+
 int msi_domain_alloc_irqs_range_locked(struct device *dev, unsigned int domid,
 				       unsigned int first, unsigned int last);
 int msi_domain_alloc_irqs_range(struct device *dev, unsigned int domid,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -52,6 +52,14 @@ static inline void msi_setup_default_irq
 		md->__irqdomains[MSI_DEFAULT_DOMAIN] = dev->msi.domain;
 }
 
+static inline void msi_remove_device_irqdomains(struct device *dev, struct msi_device_data *md)
+{
+	int domid;
+
+	for (domid = 0; domid < MSI_MAX_DEVICE_IRQDOMAINS; domid++)
+		msi_remove_device_irq_domain(dev, domid);
+}
+
 static int msi_get_domain_base_index(struct device *dev, unsigned int domid)
 {
 	lockdep_assert_held(&dev->msi.data->mutex);
@@ -281,6 +289,7 @@ static void msi_device_data_release(stru
 {
 	struct msi_device_data *md = res;
 
+	msi_remove_device_irqdomains(dev, md);
 	WARN_ON_ONCE(!xa_empty(&md->__store));
 	xa_destroy(&md->__store);
 	dev->msi.data = NULL;
@@ -869,6 +878,139 @@ bool msi_parent_init_dev_msi_info(struct
 	return parent->msi_parent_ops->init_dev_msi_info(dev, parent, real_parent, info);
 }
 
+/**
+ * msi_create_device_irq_domain - Create a device MSI interrupt domain
+ * @dev:		Pointer to the device
+ * @domid:		Domain id
+ * @template:		MSI domain info bundle used as template
+ * @hwsize:		Maximum number of MSI table entries (0 if unknown or unlimited)
+ * @domain_data:	Optional pointer to domain specific data which is set in
+ *			msi_domain_info::data
+ * @chip_data:		Optional pointer to chip specific data which is set in
+ *			msi_domain_info::chip_data
+ *
+ * Return: True on success, false otherwise
+ *
+ * There is no firmware node required for this interface because the per
+ * device domains are software constructs which are actually closer to the
+ * hardware reality than any firmware can describe them.
+ *
+ * The domain name and the irq chip name for a MSI device domain are
+ * composed by: "$(PREFIX)$(CHIPNAME)-$(DEVNAME)"
+ *
+ * $PREFIX:   Optional prefix provided by the underlying MSI parent domain
+ *	      via msi_parent_ops::prefix. If that pointer is NULL the prefix
+ *	      is empty.
+ * $CHIPNAME: The name of the irq_chip in @template
+ * $DEVNAME:  The name of the device
+ *
+ * This results in understandable chip names and hardware interrupt numbers
+ * in e.g. /proc/interrupts
+ *
+ * PCI-MSI-0000:00:1c.0     0-edge  Parent domain has no prefix
+ * IR-PCI-MSI-0000:00:1c.4  0-edge  Same with interrupt remapping prefix 'IR-'
+ *
+ * IR-PCI-MSIX-0000:3d:00.0 0-edge  Hardware interrupt numbers reflect
+ * IR-PCI-MSIX-0000:3d:00.0 1-edge  the real MSI-X index on that device
+ * IR-PCI-MSIX-0000:3d:00.0 2-edge
+ *
+ * On IMS domains the hardware interrupt number is either a table entry
+ * index or a purely software managed index but it is guaranteed to be
+ * unique.
+ *
+ * The domain pointer is stored in @dev::msi::data::__irqdomains[]. All
+ * subsequent operations on the domain depend on the domain id.
+ *
+ * The domain is automatically freed when the device is removed via devres
+ * in the context of @dev::msi::data freeing, but it can also be
+ * independently removed via @msi_remove_device_irq_domain().
+ */
+bool msi_create_device_irq_domain(struct device *dev, unsigned int domid,
+				  const struct msi_domain_template *template,
+				  unsigned int hwsize, void *domain_data,
+				  void *chip_data)
+{
+	struct irq_domain *domain, *parent = dev->msi.domain;
+	const struct msi_parent_ops *pops;
+	struct msi_domain_template *bundle;
+	struct fwnode_handle *fwnode;
+
+	if (!irq_domain_is_msi_parent(parent))
+		return false;
+
+	if (domid >= MSI_MAX_DEVICE_IRQDOMAINS)
+		return false;
+
+	bundle = kmemdup(template, sizeof(*bundle), GFP_KERNEL);
+	if (!bundle)
+		return false;
+
+	bundle->info.hwsize = hwsize ? hwsize : MSI_MAX_INDEX;
+	bundle->info.chip = &bundle->chip;
+	bundle->info.ops = &bundle->ops;
+	bundle->info.data = domain_data;
+	bundle->info.chip_data = chip_data;
+
+	pops = parent->msi_parent_ops;
+	snprintf(bundle->name, sizeof(bundle->name), "%s%s-%s",
+		 pops->prefix ? : "", bundle->chip.name, dev_name(dev));
+	bundle->chip.name = bundle->name;
+
+	fwnode = irq_domain_alloc_named_fwnode(bundle->name);
+	if (!fwnode)
+		goto free_bundle;
+
+	msi_lock_descs(dev);
+
+	if (WARN_ON_ONCE(msi_get_device_domain(dev, domid)))
+		goto fail;
+
+	if (!pops->init_dev_msi_info(dev, parent, parent, &bundle->info))
+		goto fail;
+
+	domain = __msi_create_irq_domain(fwnode, &bundle->info, IRQ_DOMAIN_FLAG_MSI_DEVICE, parent);
+	if (!domain)
+		goto fail;
+
+	domain->dev = dev;
+	dev->msi.data->__irqdomains[domid] = domain;
+	msi_unlock_descs(dev);
+	return true;
+
+fail:
+	msi_unlock_descs(dev);
+	kfree(fwnode);
+free_bundle:
+	kfree(bundle);
+	return false;
+}
+
+/**
+ * msi_remove_device_irq_domain - Free a device MSI interrupt domain
+ * @dev:	Pointer to the device
+ * @domid:	Domain id
+ */
+void msi_remove_device_irq_domain(struct device *dev, unsigned int domid)
+{
+	struct msi_domain_info *info;
+	struct irq_domain *domain;
+
+	msi_lock_descs(dev);
+
+	domain = msi_get_device_domain(dev, domid);
+
+	if (!domain || !irq_domain_is_msi_device(domain))
+		goto unlock;
+
+	dev->msi.data->__irqdomains[domid] = NULL;
+	info = domain->host_data;
+	irq_domain_remove(domain);
+	kfree(container_of(info, struct msi_domain_template, info));
+
+unlock:
+	msi_unlock_descs(dev);
+}
+
 int msi_domain_prepare_irqs(struct irq_domain *domain, struct device *dev,
 			    int nvec, msi_alloc_info_t *arg)
 {


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 08/33] genirq/msi: Provide msi_match_device_domain()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (6 preceding siblings ...)
  2022-11-11 13:58 ` [patch 07/33] genirq/msi: Provide msi_create/free_device_irq_domain() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 09/33] genirq/msi: Add range checking to msi_insert_desc() Thomas Gleixner
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Provide an interface to match a per device domain bus token. This allows to
query which type of domain is installed for a particular domain id. Will be
used for PCI to avoid frequent create/remove cycles for the MSI resp. MSI-X
domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    3 +++
 kernel/irq/msi.c    |   25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -537,6 +537,9 @@ bool msi_create_device_irq_domain(struct
 				  void *chip_data);
 void msi_remove_device_irq_domain(struct device *dev, unsigned int domid);
 
+bool msi_match_device_irq_domain(struct device *dev, unsigned int domid,
+				 enum irq_domain_bus_token bus_token);
+
 int msi_domain_alloc_irqs_range_locked(struct device *dev, unsigned int domid,
 				       unsigned int first, unsigned int last);
 int msi_domain_alloc_irqs_range(struct device *dev, unsigned int domid,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -1011,6 +1011,31 @@ void msi_remove_device_irq_domain(struct
 	msi_unlock_descs(dev);
 }
 
+/**
+ * msi_match_device_irq_domain - Match a device irq domain against a bus token
+ * @dev:	Pointer to the device
+ * @domid:	Domain id
+ * @bus_token:	Bus token to match against the domain bus token
+ *
+ * Return: True if device domain exists and bus tokens match.
+ */
+bool msi_match_device_irq_domain(struct device *dev, unsigned int domid,
+				 enum irq_domain_bus_token bus_token)
+{
+	struct msi_domain_info *info;
+	struct irq_domain *domain;
+	bool ret = false;
+
+	msi_lock_descs(dev);
+	domain = msi_get_device_domain(dev, domid);
+	if (domain && irq_domain_is_msi_device(domain)) {
+		info = domain->host_data;
+		ret = info->bus_token == bus_token;
+	}
+	msi_unlock_descs(dev);
+	return ret;
+}
+
 int msi_domain_prepare_irqs(struct irq_domain *domain, struct device *dev,
 			    int nvec, msi_alloc_info_t *arg)
 {


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 09/33] genirq/msi: Add range checking to msi_insert_desc()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (7 preceding siblings ...)
  2022-11-11 13:58 ` [patch 08/33] genirq/msi: Provide msi_match_device_domain() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 10/33] PCI/MSI: Split __pci_write_msi_msg() Thomas Gleixner
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Per device domains provide the domain size to the core code. This allows
range checking on insertion of MSI descriptors and also paves the way for
dynamic index allocations which are required e.g. for IMS. This avoids
external mechanisms like bitmaps on the device side and just utilizes
the core internal MSI descriptor store for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/irq/msi.c |   38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -73,6 +73,7 @@ static int msi_get_domain_base_index(str
 	return domid * MSI_XA_DOMAIN_SIZE;
 }
 
+static unsigned int msi_domain_get_hwsize(struct device *dev, unsigned int domid);
 
 /**
  * msi_alloc_desc - Allocate an initialized msi_desc
@@ -115,6 +116,7 @@ static int msi_insert_desc(struct device
 			   unsigned int domid, unsigned int index)
 {
 	struct msi_device_data *md = dev->msi.data;
+	unsigned int hwsize;
 	int baseidx, ret;
 
 	baseidx = msi_get_domain_base_index(dev, domid);
@@ -123,6 +125,12 @@ static int msi_insert_desc(struct device
 		goto fail;
 	}
 
+	hwsize = msi_domain_get_hwsize(dev, domid);
+	if (index >= hwsize) {
+		ret = -ERANGE;
+		goto fail;
+	}
+
 	desc->msi_index = index;
 	index += baseidx;
 	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
@@ -181,9 +189,11 @@ static bool msi_desc_match(struct msi_de
 
 static bool msi_ctrl_valid(struct device *dev, struct msi_ctrl *ctrl)
 {
+	unsigned int hwsize = msi_domain_get_hwsize(dev, ctrl->domid);
+
 	if (WARN_ON_ONCE(ctrl->first > ctrl->last ||
-			 ctrl->first >= MSI_MAX_INDEX ||
-			 ctrl->last >= MSI_MAX_INDEX))
+			 ctrl->first >= hwsize ||
+			 ctrl->last >= hwsize))
 		return false;
 	return true;
 }
@@ -613,6 +623,25 @@ static struct irq_domain *msi_get_device
 	return domain;
 }
 
+static unsigned int msi_domain_get_hwsize(struct device *dev, unsigned int domid)
+{
+	struct msi_domain_info *info;
+	struct irq_domain *domain;
+
+	/*
+	 * Retrieve the MSI domain for range checking. If there is no
+	 * domain or the domain is not a per device domain, then assume
+	 * full MSI range and pray that the calling subsystem knows what it
+	 * is doing.
+	 */
+	domain = msi_get_device_domain(dev, domid);
+	if (domain && irq_domain_is_msi_device(domain)) {
+		info = domain->host_data;
+		return info->hwsize;
+	}
+	return MSI_MAX_INDEX;
+}
+
 static inline void irq_chip_write_msi_msg(struct irq_data *data,
 					  struct msi_msg *msg)
 {
@@ -1380,7 +1409,7 @@ int msi_domain_alloc_irqs_all_locked(str
 	struct msi_ctrl ctrl = {
 		.domid	= domid,
 		.first	= 0,
-		.last	= MSI_MAX_INDEX,
+		.last	= msi_domain_get_hwsize(dev, domid) - 1,
 		.nirqs	= nirqs,
 	};
 
@@ -1496,7 +1525,8 @@ void msi_domain_free_irqs_range(struct d
  */
 void msi_domain_free_irqs_all_locked(struct device *dev, unsigned int domid)
 {
-	msi_domain_free_irqs_range_locked(dev, domid, 0, MSI_MAX_INDEX);
+	msi_domain_free_irqs_range_locked(dev, domid, 0,
+					  msi_domain_get_hwsize(dev, domid) - 1);
 }
 
 /**


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 10/33] PCI/MSI: Split __pci_write_msi_msg()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (8 preceding siblings ...)
  2022-11-11 13:58 ` [patch 09/33] genirq/msi: Add range checking to msi_insert_desc() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 20:10   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 11/33] genirq/msi: Provide BUS_DEVICE_PCI_MSI[X] Thomas Gleixner
                   ` (22 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

The upcoming per device MSI domains will create different domains for MSI
and MSI-X. Split the write message function into MSI and MSI-X helpers so
they can be used by those new domain functions seperately.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/msi.c |  104 +++++++++++++++++++++++++-------------------------
 1 file changed, 54 insertions(+), 50 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -180,6 +180,58 @@ void __pci_read_msi_msg(struct msi_desc
 	}
 }
 
+static inline void pci_write_msg_msi(struct pci_dev *dev, struct msi_desc *desc,
+				     struct msi_msg *msg)
+{
+	int pos = dev->msi_cap;
+	u16 msgctl;
+
+	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
+	msgctl &= ~PCI_MSI_FLAGS_QSIZE;
+	msgctl |= desc->pci.msi_attrib.multiple << 4;
+	pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);
+
+	pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO, msg->address_lo);
+	if (desc->pci.msi_attrib.is_64) {
+		pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI,  msg->address_hi);
+		pci_write_config_word(dev, pos + PCI_MSI_DATA_64, msg->data);
+	} else {
+		pci_write_config_word(dev, pos + PCI_MSI_DATA_32, msg->data);
+	}
+	/* Ensure that the writes are visible in the device */
+	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
+}
+
+static inline void pci_write_msg_msix(struct msi_desc *desc, struct msi_msg *msg)
+{
+	void __iomem *base = pci_msix_desc_addr(desc);
+	u32 ctrl = desc->pci.msix_ctrl;
+	bool unmasked = !(ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT);
+
+	if (desc->pci.msi_attrib.is_virtual)
+		return;
+	/*
+	 * The specification mandates that the entry is masked
+	 * when the message is modified:
+	 *
+	 * "If software changes the Address or Data value of an
+	 * entry while the entry is unmasked, the result is
+	 * undefined."
+	 */
+	if (unmasked)
+		pci_msix_write_vector_ctrl(desc, ctrl | PCI_MSIX_ENTRY_CTRL_MASKBIT);
+
+	writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
+	writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
+	writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
+
+	if (unmasked)
+		pci_msix_write_vector_ctrl(desc, ctrl);
+
+	/* Ensure that the writes are visible in the device */
+	readl(base + PCI_MSIX_ENTRY_DATA);
+}
+
 void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 {
 	struct pci_dev *dev = msi_desc_to_pci_dev(entry);
@@ -187,63 +239,15 @@ void __pci_write_msi_msg(struct msi_desc
 	if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
 		/* Don't touch the hardware now */
 	} else if (entry->pci.msi_attrib.is_msix) {
-		void __iomem *base = pci_msix_desc_addr(entry);
-		u32 ctrl = entry->pci.msix_ctrl;
-		bool unmasked = !(ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT);
-
-		if (entry->pci.msi_attrib.is_virtual)
-			goto skip;
-
-		/*
-		 * The specification mandates that the entry is masked
-		 * when the message is modified:
-		 *
-		 * "If software changes the Address or Data value of an
-		 * entry while the entry is unmasked, the result is
-		 * undefined."
-		 */
-		if (unmasked)
-			pci_msix_write_vector_ctrl(entry, ctrl | PCI_MSIX_ENTRY_CTRL_MASKBIT);
-
-		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
-		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
-		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
-
-		if (unmasked)
-			pci_msix_write_vector_ctrl(entry, ctrl);
-
-		/* Ensure that the writes are visible in the device */
-		readl(base + PCI_MSIX_ENTRY_DATA);
+		pci_write_msg_msix(entry, msg);
 	} else {
-		int pos = dev->msi_cap;
-		u16 msgctl;
-
-		pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
-		msgctl &= ~PCI_MSI_FLAGS_QSIZE;
-		msgctl |= entry->pci.msi_attrib.multiple << 4;
-		pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);
-
-		pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO,
-				       msg->address_lo);
-		if (entry->pci.msi_attrib.is_64) {
-			pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI,
-					       msg->address_hi);
-			pci_write_config_word(dev, pos + PCI_MSI_DATA_64,
-					      msg->data);
-		} else {
-			pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
-					      msg->data);
-		}
-		/* Ensure that the writes are visible in the device */
-		pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
+		pci_write_msg_msi(dev, entry, msg);
 	}
 
-skip:
 	entry->msg = *msg;
 
 	if (entry->write_msi_msg)
 		entry->write_msi_msg(entry, entry->write_msi_msg_data);
-
 }
 
 void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 11/33] genirq/msi: Provide BUS_DEVICE_PCI_MSI[X]
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (9 preceding siblings ...)
  2022-11-11 13:58 ` [patch 10/33] PCI/MSI: Split __pci_write_msi_msg() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains Thomas Gleixner
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Provide new bus tokens for the upcoming per device PCI/MSI and PCI/MSIX
interrupt domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irqdomain_defs.h |    2 ++
 kernel/irq/msi.c               |    4 ++++
 2 files changed, 6 insertions(+)

--- a/include/linux/irqdomain_defs.h
+++ b/include/linux/irqdomain_defs.h
@@ -21,6 +21,8 @@ enum irq_domain_bus_token {
 	DOMAIN_BUS_TI_SCI_INTA_MSI,
 	DOMAIN_BUS_WAKEUP,
 	DOMAIN_BUS_VMD_MSI,
+	DOMAIN_BUS_PCI_DEVICE_MSI,
+	DOMAIN_BUS_PCI_DEVICE_MSIX,
 };
 
 #endif /* _LINUX_IRQDOMAIN_DEFS_H */
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -1137,6 +1137,8 @@ static bool msi_check_reservation_mode(s
 
 	switch(domain->bus_token) {
 	case DOMAIN_BUS_PCI_MSI:
+	case DOMAIN_BUS_PCI_DEVICE_MSI:
+	case DOMAIN_BUS_PCI_DEVICE_MSIX:
 	case DOMAIN_BUS_VMD_MSI:
 		break;
 	default:
@@ -1162,6 +1164,8 @@ static int msi_handle_pci_fail(struct ir
 {
 	switch(domain->bus_token) {
 	case DOMAIN_BUS_PCI_MSI:
+	case DOMAIN_BUS_PCI_DEVICE_MSI:
+	case DOMAIN_BUS_PCI_DEVICE_MSIX:
 	case DOMAIN_BUS_VMD_MSI:
 		if (IS_ENABLED(CONFIG_PCI_MSI))
 			break;


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (10 preceding siblings ...)
  2022-11-11 13:58 ` [patch 11/33] genirq/msi: Provide BUS_DEVICE_PCI_MSI[X] Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 19:13   ` Jason Gunthorpe
  2022-11-16 20:22   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 13/33] x86/apic/vector: Provide MSI parent domain Thomas Gleixner
                   ` (20 subsequent siblings)
  32 siblings, 2 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Provide a template and the necessary callbacks to create PCI/MSI and
PCI/MSI-X domains.

The domains are created when MSI or MSI-X is enabled. The domains lifetime
is either the device life time or in case that e.g. MSI-X was tried first
and failed, then the MSI-X domain is removed and a MSI domain is created as
both are mutually exclusive and reside in the default domain id slot of the
per device domain pointer array.

Also expand pci_msi_domain_supports() to handle feature checks correctly
even in the case that the per device domain was not yet created by checking
the features supported by the MSI parent.

Add the necessary setup calls into the MSI and MSI-X enable code path.
These setup calls are backwards compatible. They return success when there
is no parent domain found, which means the existing global domains or the
legacy allocation path keep just working.

Co-developed-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |  188 +++++++++++++++++++++++++++++++++++++++++++-
 drivers/pci/msi/msi.c       |   16 +++
 drivers/pci/msi/msi.h       |    2 
 3 files changed, 201 insertions(+), 5 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -139,6 +139,170 @@ struct irq_domain *pci_msi_create_irq_do
 }
 EXPORT_SYMBOL_GPL(pci_msi_create_irq_domain);
 
+/*
+ * Per device MSI[-X] domain functionality
+ */
+static void pci_device_domain_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
+{
+	arg->desc = desc;
+	arg->hwirq = desc->msi_index;
+}
+
+static void pci_mask_msi(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+
+	pci_msi_mask(desc, BIT(data->irq - desc->irq));
+}
+
+static void pci_unmask_msi(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+
+	pci_msi_unmask(desc, BIT(data->irq - desc->irq));
+}
+
+#ifdef CONFIG_GENERIC_IRQ_RESERVATION_MODE
+# define MSI_REACTIVATE		MSI_FLAG_MUST_REACTIVATE
+#else
+# define MSI_REACTIVATE		0
+#endif
+
+#define MSI_COMMON_FLAGS	(MSI_FLAG_FREE_MSI_DESCS |	\
+				 MSI_FLAG_ACTIVATE_EARLY |	\
+				 MSI_FLAG_DEV_SYSFS |		\
+				 MSI_REACTIVATE)
+
+static struct msi_domain_template pci_msi_template = {
+	.chip = {
+		.name			= "PCI-MSI",
+		.irq_mask		= pci_mask_msi,
+		.irq_unmask		= pci_unmask_msi,
+		.irq_write_msi_msg	= pci_msi_domain_write_msg,
+		.flags			= IRQCHIP_ONESHOT_SAFE,
+	},
+
+	.ops = {
+		.set_desc		= pci_device_domain_set_desc,
+	},
+
+	.info = {
+		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_MULTI_PCI_MSI,
+		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSI,
+	},
+};
+
+static void pci_mask_msix(struct irq_data *data)
+{
+	pci_msix_mask(irq_data_get_msi_desc(data));
+}
+
+static void pci_unmask_msix(struct irq_data *data)
+{
+	pci_msix_unmask(irq_data_get_msi_desc(data));
+}
+
+static struct msi_domain_template pci_msix_template = {
+	.chip = {
+		.name			= "PCI-MSIX",
+		.irq_mask		= pci_mask_msix,
+		.irq_unmask		= pci_unmask_msix,
+		.irq_write_msi_msg	= pci_msi_domain_write_msg,
+		.flags			= IRQCHIP_ONESHOT_SAFE,
+	},
+
+	.ops = {
+		.set_desc		= pci_device_domain_set_desc,
+	},
+
+	.info = {
+		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
+		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSIX,
+	},
+};
+
+static bool pci_match_device_domain(struct pci_dev *pdev, enum irq_domain_bus_token bus_token)
+{
+	return msi_match_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN, bus_token);
+}
+
+static bool pci_create_device_domain(struct pci_dev *pdev, struct msi_domain_template *tmpl,
+				     unsigned int hwsize)
+{
+	struct irq_domain *domain = dev_get_msi_domain(&pdev->dev);
+
+	if (!domain || !irq_domain_is_msi_parent(domain))
+		return true;
+
+	return msi_create_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN, tmpl,
+					    hwsize, NULL, NULL);
+}
+
+/**
+ * pci_setup_msi_device_domain - Setup a device MSI interrupt domain
+ * @pdev:	The PCI device to create the domain on
+ *
+ * Return:
+ *  True when:
+ *	- The device does not have a MSI parent irq domain associated,
+ *	  which keeps the legacy architecture specific and the global
+ *	  PCI/MSI domain models working
+ *	- The MSI domain exists already
+ *	- The MSI domain was successfully allocated
+ *  False when:
+ *	- MSI-X is enabled
+ *	- The domain creation fails.
+ *
+ * The created MSI domain is preserved until:
+ *	- The device is removed
+ *	- MSI is disabled and a MSI-X domain is created
+ */
+bool pci_setup_msi_device_domain(struct pci_dev *pdev)
+{
+	if (WARN_ON_ONCE(pdev->msix_enabled))
+		return false;
+
+	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
+		return true;
+	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
+		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
+
+	return pci_create_device_domain(pdev, &pci_msi_template, 1);
+}
+
+/**
+ * pci_setup_msix_device_domain - Setup a device MSI-X interrupt domain
+ * @pdev:	The PCI device to create the domain on
+ * @hwsize:	The size of the MSI-X vector table
+ *
+ * Return:
+ *  True when:
+ *	- The device does not have a MSI parent irq domain associated,
+ *	  which keeps the legacy architecture specific and the global
+ *	  PCI/MSI domain models working
+ *	- The MSI-X domain exists already
+ *	- The MSI-X domain was successfully allocated
+ *  False when:
+ *	- MSI is enabled
+ *	- The domain creation fails.
+ *
+ * The created MSI-X domain is preserved until:
+ *	- The device is removed
+ *	- MSI-X is disabled and a MSI domain is created
+ */
+bool pci_setup_msix_device_domain(struct pci_dev *pdev, unsigned int hwsize)
+{
+	if (WARN_ON_ONCE(pdev->msix_enabled))
+		return false;
+
+	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
+		return true;
+	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
+		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
+
+	return pci_create_device_domain(pdev, &pci_msix_template, hwsize);
+}
+
 /**
  * pci_msi_domain_supports - Check for support of a particular feature flag
  * @pdev:		The PCI device to operate on
@@ -152,13 +316,33 @@ bool pci_msi_domain_supports(struct pci_
 {
 	struct msi_domain_info *info;
 	struct irq_domain *domain;
+	unsigned int supported;
 
 	domain = dev_get_msi_domain(&pdev->dev);
 
 	if (!domain || !irq_domain_is_hierarchy(domain))
 		return mode == ALLOW_LEGACY;
-	info = domain->host_data;
-	return (info->flags & feature_mask) == feature_mask;
+
+	if (!irq_domain_is_msi_parent(domain)) {
+		/*
+		 * For "global" PCI/MSI interrupt domains the associated
+		 * msi_domain_info::flags is the authoritive source of
+		 * information.
+		 */
+		info = domain->host_data;
+		supported = info->flags;
+	} else {
+		/*
+		 * For MSI parent domains the supported feature set
+		 * is avaliable in the parent ops. This makes checks
+		 * possible before actually instantiating the
+		 * per device domain because the parent is never
+		 * expanding the PCI/MSI functionality.
+		 */
+		supported = domain->msi_parent_ops->supported_flags;
+	}
+
+	return (supported & feature_mask) == feature_mask;
 }
 
 /*
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -436,6 +436,9 @@ int __pci_enable_msi_range(struct pci_de
 	if (rc)
 		return rc;
 
+	if (!pci_setup_msi_device_domain(dev))
+		return -ENODEV;
+
 	for (;;) {
 		if (affd) {
 			nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
@@ -787,9 +790,13 @@ int __pci_enable_msix_range(struct pci_d
 	if (!pci_msix_validate_entries(dev, entries, nvec, hwsize))
 		return -EINVAL;
 
-	/* PCI_IRQ_VIRTUAL is a horrible hack! */
-	if (nvec > hwsize && !(flags & PCI_IRQ_VIRTUAL))
-		nvec = hwsize;
+	if (hwsize < nvec) {
+		/* Keep the IRQ virtual hackery working */
+		if (flags & PCI_IRQ_VIRTUAL)
+			hwsize = nvec;
+		else
+			nvec = hwsize;
+	}
 
 	if (nvec < minvec)
 		return -ENOSPC;
@@ -798,6 +805,9 @@ int __pci_enable_msix_range(struct pci_d
 	if (rc)
 		return rc;
 
+	if (!pci_setup_msix_device_domain(dev, hwsize))
+		return -ENODEV;
+
 	for (;;) {
 		if (affd) {
 			nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -105,6 +105,8 @@ enum support_mode {
 };
 
 bool pci_msi_domain_supports(struct pci_dev *dev, unsigned int feature_mask, enum support_mode mode);
+bool pci_setup_msi_device_domain(struct pci_dev *pdev);
+bool pci_setup_msix_device_domain(struct pci_dev *pdev, unsigned int hwsize);
 
 /* Legacy (!IRQDOMAIN) fallbacks */
 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 13/33] x86/apic/vector: Provide MSI parent domain
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (11 preceding siblings ...)
  2022-11-11 13:58 ` [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 19:18   ` Jason Gunthorpe
  2022-11-11 13:58 ` [patch 14/33] PCI/MSI: Remove unused pci_dev_has_special_msi_domain() Thomas Gleixner
                   ` (19 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Enable MSI parent domain support in the x86 vector domain and fixup the
checks in the iommu implementations to check whether device::msi::domain is
the default MSI parent domain. That keeps the existing logic to protect
e.g. devices behind VMD working.

The interrupt remap PCI/MSI code still works because the underlying vector
domain still provides the same functionality.

None of the other x86 PCI/MSI, e.g. XEN and HyperV, implementations are
affected either. They still work the same way both at the low level and the
PCI/MSI implementations they provide.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/msi.h          |    6 +
 arch/x86/include/asm/pci.h          |    1 
 arch/x86/kernel/apic/msi.c          |  176 ++++++++++++++++++++++++++----------
 drivers/iommu/amd/iommu.c           |    2 
 drivers/iommu/intel/irq_remapping.c |    2 
 5 files changed, 138 insertions(+), 49 deletions(-)

--- a/arch/x86/include/asm/msi.h
+++ b/arch/x86/include/asm/msi.h
@@ -62,4 +62,10 @@ typedef struct x86_msi_addr_hi {
 struct msi_msg;
 u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid);
 
+#define X86_VECTOR_MSI_FLAGS_SUPPORTED					\
+	(MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
+
+#define X86_VECTOR_MSI_FLAGS_REQUIRED					\
+	(MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)
+
 #endif /* _ASM_X86_MSI_H */
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -92,6 +92,7 @@ void pcibios_scan_root(int bus);
 struct irq_routing_table *pcibios_get_irq_routing_table(void);
 int pcibios_set_irq_routing(struct pci_dev *dev, int pin, int irq);
 
+bool pci_dev_has_default_msi_parent_domain(struct pci_dev *dev);
 
 #define HAVE_PCI_MMAP
 #define arch_can_pci_mmap_wc()	pat_enabled()
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -142,67 +142,131 @@ msi_set_affinity(struct irq_data *irqd,
 	return ret;
 }
 
-/*
- * IRQ Chip for MSI PCI/PCI-X/PCI-Express Devices,
- * which implement the MSI or MSI-X Capability Structure.
+/**
+ * pci_dev_has_default_msi_parent_domain - Check whether the device has the default
+ *					   MSI parent domain associated
+ * @dev:	Pointer to the PCI device
  */
-static struct irq_chip pci_msi_controller = {
-	.name			= "PCI-MSI",
-	.irq_unmask		= pci_msi_unmask_irq,
-	.irq_mask		= pci_msi_mask_irq,
-	.irq_ack		= irq_chip_ack_parent,
-	.irq_retrigger		= irq_chip_retrigger_hierarchy,
-	.irq_set_affinity	= msi_set_affinity,
-	.flags			= IRQCHIP_SKIP_SET_WAKE |
-				  IRQCHIP_AFFINITY_PRE_STARTUP,
-};
+bool pci_dev_has_default_msi_parent_domain(struct pci_dev *dev)
+{
+	struct irq_domain *domain = dev_get_msi_domain(&dev->dev);
 
-int pci_msi_prepare(struct irq_domain *domain, struct device *dev, int nvec,
-		    msi_alloc_info_t *arg)
+	if (!domain)
+		domain = dev_get_msi_domain(&dev->bus->dev);
+	if (!domain)
+		return false;
+
+	return domain == x86_vector_domain;
+}
+
+/**
+ * x86_msi_prepare - Setup of msi_alloc_info_t for allocations
+ * @domain:	The domain for which this setup happens
+ * @dev:	The device for which interrupts are allocated
+ * @nvec:	The number of vectors to allocate
+ * @alloc:	The allocation info structure to initialize
+ *
+ * This function is to be used for all types of MSI domains above the x86
+ * vector domain and any intermediates. It is always invoked from the
+ * top level interrupt domain. The domain specific allocation
+ * functionality is determined via the @domain's bus token which allows to
+ * map the X86 specific allocation type.
+ */
+static int x86_msi_prepare(struct irq_domain *domain, struct device *dev,
+			   int nvec, msi_alloc_info_t *alloc)
 {
-	init_irq_alloc_info(arg, NULL);
-	if (to_pci_dev(dev)->msix_enabled)
-		arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSIX;
-	else
-		arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSI;
+	struct msi_domain_info *info = domain->host_data;
 
-	return 0;
+	init_irq_alloc_info(alloc, NULL);
+
+	switch (info->bus_token) {
+	case DOMAIN_BUS_PCI_DEVICE_MSI:
+		alloc->type = X86_IRQ_ALLOC_TYPE_PCI_MSI;
+		return 0;
+	case DOMAIN_BUS_PCI_DEVICE_MSIX:
+		alloc->type = X86_IRQ_ALLOC_TYPE_PCI_MSIX;
+		return 0;
+	default:
+		return -EINVAL;
+	}
 }
-EXPORT_SYMBOL_GPL(pci_msi_prepare);
 
-static struct msi_domain_ops pci_msi_domain_ops = {
-	.msi_prepare	= pci_msi_prepare,
-};
+/**
+ * x86_vector_init_dev_msi_info - Domain info setup for MSI domains
+ * @dev:		The device for which the domain should be created
+ * @domain:		The (root) domain providing this callback
+ * @real_parent:	The real parent domain of the to initialize domain
+ * @info:		The domain info for the to initialize domain
+ *
+ * This function is to be used for all types of MSI domains above the x86
+ * vector domain and any intermediates. The domain specific functionality
+ * is determined via the @real_parent.
+ */
+static bool x86_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
+				  struct irq_domain *real_parent, struct msi_domain_info *info)
+{
+	const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
+
+	/* MSI parent domain specific settings */
+	switch (real_parent->bus_token) {
+	case DOMAIN_BUS_ANY:
+		/* Only the vector domain can have the ANY token */
+		if (WARN_ON_ONCE(domain != real_parent))
+			return false;
+		info->chip->irq_set_affinity = msi_set_affinity;
+		/* See msi_set_affinity() for the gory details */
+		info->flags |= MSI_FLAG_NOMASK_QUIRK;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return false;
+	}
+
+	/* Is the target supported? */
+	switch(info->bus_token) {
+	case DOMAIN_BUS_PCI_DEVICE_MSI:
+	case DOMAIN_BUS_PCI_DEVICE_MSIX:
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return false;
+	}
+
+	/*
+	 * Mask out the domain specific MSI feature flags which are not
+	 * supported by the real parent.
+	 */
+	info->flags			&= pops->supported_flags;
+	/* Enforce the required flags */
+	info->flags			|= X86_VECTOR_MSI_FLAGS_REQUIRED;
+
+	/* This is always invoked from the top level MSI domain! */
+	info->ops->msi_prepare		= x86_msi_prepare;
+
+	info->chip->irq_ack		= irq_chip_ack_parent;
+	info->chip->irq_retrigger	= irq_chip_retrigger_hierarchy;
+	info->chip->flags		|= IRQCHIP_SKIP_SET_WAKE |
+					   IRQCHIP_AFFINITY_PRE_STARTUP;
 
-static struct msi_domain_info pci_msi_domain_info = {
-	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
-			  MSI_FLAG_PCI_MSIX | MSI_FLAG_NOMASK_QUIRK,
-
-	.ops		= &pci_msi_domain_ops,
-	.chip		= &pci_msi_controller,
-	.handler	= handle_edge_irq,
-	.handler_name	= "edge",
+	info->handler			= handle_edge_irq;
+	info->handler_name		= "edge";
+
+	return true;
+}
+
+static const struct msi_parent_ops x86_vector_msi_parent_ops = {
+	.supported_flags	= X86_VECTOR_MSI_FLAGS_SUPPORTED,
+	.init_dev_msi_info	= x86_init_dev_msi_info,
 };
 
 struct irq_domain * __init native_create_pci_msi_domain(void)
 {
-	struct fwnode_handle *fn;
-	struct irq_domain *d;
-
 	if (disable_apic)
 		return NULL;
 
-	fn = irq_domain_alloc_named_fwnode("PCI-MSI");
-	if (!fn)
-		return NULL;
-
-	d = pci_msi_create_irq_domain(fn, &pci_msi_domain_info,
-				      x86_vector_domain);
-	if (!d) {
-		irq_domain_free_fwnode(fn);
-		pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
-	}
-	return d;
+	x86_vector_domain->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
+	x86_vector_domain->msi_parent_ops = &x86_vector_msi_parent_ops;
+	return x86_vector_domain;
 }
 
 void __init x86_create_pci_msi_domain(void)
@@ -210,7 +274,25 @@ void __init x86_create_pci_msi_domain(vo
 	x86_pci_msi_default_domain = x86_init.irqs.create_pci_msi_domain();
 }
 
+/* Keep around for hyperV and the remap code below */
+int pci_msi_prepare(struct irq_domain *domain, struct device *dev, int nvec,
+		    msi_alloc_info_t *arg)
+{
+	init_irq_alloc_info(arg, NULL);
+
+	if (to_pci_dev(dev)->msix_enabled)
+		arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSIX;
+	else
+		arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSI;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_msi_prepare);
+
 #ifdef CONFIG_IRQ_REMAP
+static struct msi_domain_ops pci_msi_domain_ops = {
+	.msi_prepare	= pci_msi_prepare,
+};
+
 static struct irq_chip pci_msi_ir_controller = {
 	.name			= "IR-PCI-MSI",
 	.irq_unmask		= pci_msi_unmask_irq,
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -812,7 +812,7 @@ static void
 amd_iommu_set_pci_msi_domain(struct device *dev, struct amd_iommu *iommu)
 {
 	if (!irq_remapping_enabled || !dev_is_pci(dev) ||
-	    pci_dev_has_special_msi_domain(to_pci_dev(dev)))
+	    !pci_dev_has_default_msi_parent_domain(to_pci_dev(dev)))
 		return;
 
 	dev_set_msi_domain(dev, iommu->msi_domain);
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1107,7 +1107,7 @@ static int reenable_irq_remapping(int ei
  */
 void intel_irq_remap_add_device(struct dmar_pci_notify_info *info)
 {
-	if (!irq_remapping_enabled || pci_dev_has_special_msi_domain(info->dev))
+	if (!irq_remapping_enabled || !pci_dev_has_default_msi_parent_domain(info->dev))
 		return;
 
 	dev_set_msi_domain(&info->dev->dev, map_dev_to_ir(info->dev));


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 14/33] PCI/MSI: Remove unused pci_dev_has_special_msi_domain()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (12 preceding siblings ...)
  2022-11-11 13:58 ` [patch 13/33] x86/apic/vector: Provide MSI parent domain Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 20:13   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 15/33] iommu/vt-d: Switch to MSI parent domains Thomas Gleixner
                   ` (18 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

The check for special MSI domains like VMD which prevents the interrupt
remapping code to overwrite device::msi::domain is not longer required and
has been replaced by an x86 specific version which is aware of MSI parent
domains.

Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |   21 ---------------------
 include/linux/msi.h         |    2 --
 2 files changed, 23 deletions(-)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -414,24 +414,3 @@ struct irq_domain *pci_msi_get_device_do
 					     DOMAIN_BUS_PCI_MSI);
 	return dom;
 }
-
-/**
- * pci_dev_has_special_msi_domain - Check whether the device is handled by
- *				    a non-standard PCI-MSI domain
- * @pdev:	The PCI device to check.
- *
- * Returns: True if the device irqdomain or the bus irqdomain is
- * non-standard PCI/MSI.
- */
-bool pci_dev_has_special_msi_domain(struct pci_dev *pdev)
-{
-	struct irq_domain *dom = dev_get_msi_domain(&pdev->dev);
-
-	if (!dom)
-		dom = dev_get_msi_domain(&pdev->bus->dev);
-
-	if (!dom)
-		return true;
-
-	return dom->bus_token != DOMAIN_BUS_PCI_MSI;
-}
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -599,8 +599,6 @@ struct irq_domain *pci_msi_create_irq_do
 					     struct irq_domain *parent);
 u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
 struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
-bool pci_dev_has_special_msi_domain(struct pci_dev *pdev);
-
 #endif /* CONFIG_GENERIC_MSI_IRQ */
 
 #endif /* LINUX_MSI_H */


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 15/33] iommu/vt-d: Switch to MSI parent domains
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (13 preceding siblings ...)
  2022-11-11 13:58 ` [patch 14/33] PCI/MSI: Remove unused pci_dev_has_special_msi_domain() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 16/33] iommu/amd: Switch to MSI base domains Thomas Gleixner
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Remove the global PCI/MSI irqdomain implementation and provide the required
MSI parent ops so the PCI/MSI code can detect the new parent and setup per
device domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/msi.c          |    2 ++
 drivers/iommu/intel/iommu.h         |    1 -
 drivers/iommu/intel/irq_remapping.c |   27 ++++++++++++---------------
 include/linux/irqdomain_defs.h      |    1 +
 4 files changed, 15 insertions(+), 16 deletions(-)

--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -217,6 +217,8 @@ static bool x86_init_dev_msi_info(struct
 		/* See msi_set_affinity() for the gory details */
 		info->flags |= MSI_FLAG_NOMASK_QUIRK;
 		break;
+	case DOMAIN_BUS_DMAR:
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		return false;
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -600,7 +600,6 @@ struct intel_iommu {
 #ifdef CONFIG_IRQ_REMAP
 	struct ir_table *ir_table;	/* Interrupt remapping info */
 	struct irq_domain *ir_domain;
-	struct irq_domain *ir_msi_domain;
 #endif
 	struct iommu_device iommu;  /* IOMMU core code handle */
 	int		node;
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -82,6 +82,7 @@ static const struct irq_domain_ops intel
 
 static void iommu_disable_irq_remapping(struct intel_iommu *iommu);
 static int __init parse_ioapics_under_ir(void);
+static const struct msi_parent_ops dmar_msi_parent_ops;
 
 static bool ir_pre_enabled(struct intel_iommu *iommu)
 {
@@ -230,7 +231,7 @@ static struct irq_domain *map_dev_to_ir(
 {
 	struct dmar_drhd_unit *drhd = dmar_find_matched_drhd_unit(dev);
 
-	return drhd ? drhd->iommu->ir_msi_domain : NULL;
+	return drhd ? drhd->iommu->ir_domain : NULL;
 }
 
 static int clear_entries(struct irq_2_iommu *irq_iommu)
@@ -573,10 +574,10 @@ static int intel_setup_irq_remapping(str
 		pr_err("IR%d: failed to allocate irqdomain\n", iommu->seq_id);
 		goto out_free_fwnode;
 	}
-	iommu->ir_msi_domain =
-		arch_create_remap_msi_irq_domain(iommu->ir_domain,
-						 "INTEL-IR-MSI",
-						 iommu->seq_id);
+
+	irq_domain_update_bus_token(iommu->ir_domain,  DOMAIN_BUS_DMAR);
+	iommu->ir_domain->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
+	iommu->ir_domain->msi_parent_ops = &dmar_msi_parent_ops;
 
 	ir_table->base = page_address(pages);
 	ir_table->bitmap = bitmap;
@@ -620,9 +621,6 @@ static int intel_setup_irq_remapping(str
 	return 0;
 
 out_free_ir_domain:
-	if (iommu->ir_msi_domain)
-		irq_domain_remove(iommu->ir_msi_domain);
-	iommu->ir_msi_domain = NULL;
 	irq_domain_remove(iommu->ir_domain);
 	iommu->ir_domain = NULL;
 out_free_fwnode:
@@ -644,13 +642,6 @@ static void intel_teardown_irq_remapping
 	struct fwnode_handle *fn;
 
 	if (iommu && iommu->ir_table) {
-		if (iommu->ir_msi_domain) {
-			fn = iommu->ir_msi_domain->fwnode;
-
-			irq_domain_remove(iommu->ir_msi_domain);
-			irq_domain_free_fwnode(fn);
-			iommu->ir_msi_domain = NULL;
-		}
 		if (iommu->ir_domain) {
 			fn = iommu->ir_domain->fwnode;
 
@@ -1437,6 +1428,12 @@ static const struct irq_domain_ops intel
 	.deactivate = intel_irq_remapping_deactivate,
 };
 
+static const struct msi_parent_ops dmar_msi_parent_ops = {
+	.supported_flags	= X86_VECTOR_MSI_FLAGS_SUPPORTED | MSI_FLAG_MULTI_PCI_MSI,
+	.prefix			= "IR-",
+	.init_dev_msi_info	= msi_parent_init_dev_msi_info,
+};
+
 /*
  * Support of Interrupt Remapping Unit Hotplug
  */
--- a/include/linux/irqdomain_defs.h
+++ b/include/linux/irqdomain_defs.h
@@ -23,6 +23,7 @@ enum irq_domain_bus_token {
 	DOMAIN_BUS_VMD_MSI,
 	DOMAIN_BUS_PCI_DEVICE_MSI,
 	DOMAIN_BUS_PCI_DEVICE_MSIX,
+	DOMAIN_BUS_DMAR,
 };
 
 #endif /* _LINUX_IRQDOMAIN_DEFS_H */


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 16/33] iommu/amd: Switch to MSI base domains
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (14 preceding siblings ...)
  2022-11-11 13:58 ` [patch 15/33] iommu/vt-d: Switch to MSI parent domains Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 17/33] x86/apic/msi: Remove arch_create_remap_msi_irq_domain() Thomas Gleixner
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Remove the global PCI/MSI irqdomain implementation and provide the required
MSI parent ops so the PCI/MSI code can detect the new parent and setup per
device domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/msi.c          |    1 +
 drivers/iommu/amd/amd_iommu_types.h |    1 -
 drivers/iommu/amd/iommu.c           |   19 +++++++++++++------
 include/linux/irqdomain_defs.h      |    1 +
 4 files changed, 15 insertions(+), 7 deletions(-)

--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -218,6 +218,7 @@ static bool x86_init_dev_msi_info(struct
 		info->flags |= MSI_FLAG_NOMASK_QUIRK;
 		break;
 	case DOMAIN_BUS_DMAR:
+	case DOMAIN_BUS_AMDVI:
 		break;
 	default:
 		WARN_ON_ONCE(1);
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -734,7 +734,6 @@ struct amd_iommu {
 	u8 max_counters;
 #ifdef CONFIG_IRQ_REMAP
 	struct irq_domain *ir_domain;
-	struct irq_domain *msi_domain;
 
 	struct amd_irte_ops *irte_ops;
 #endif
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -815,7 +815,7 @@ amd_iommu_set_pci_msi_domain(struct devi
 	    !pci_dev_has_default_msi_parent_domain(to_pci_dev(dev)))
 		return;
 
-	dev_set_msi_domain(dev, iommu->msi_domain);
+	dev_set_msi_domain(dev, iommu->ir_domain);
 }
 
 #else /* CONFIG_IRQ_REMAP */
@@ -3648,6 +3648,12 @@ static struct irq_chip amd_ir_chip = {
 	.irq_compose_msi_msg	= ir_compose_msi_msg,
 };
 
+static const struct msi_parent_ops amdvi_msi_parent_ops = {
+	.supported_flags	= X86_VECTOR_MSI_FLAGS_SUPPORTED | MSI_FLAG_MULTI_PCI_MSI,
+	.prefix			= "IR-",
+	.init_dev_msi_info	= msi_parent_init_dev_msi_info,
+};
+
 int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
 {
 	struct fwnode_handle *fn;
@@ -3655,16 +3661,17 @@ int amd_iommu_create_irq_domain(struct a
 	fn = irq_domain_alloc_named_id_fwnode("AMD-IR", iommu->index);
 	if (!fn)
 		return -ENOMEM;
-	iommu->ir_domain = irq_domain_create_tree(fn, &amd_ir_domain_ops, iommu);
+	iommu->ir_domain = irq_domain_create_hierarchy(arch_get_ir_parent_domain(), 0, 0,
+						       fn, &amd_ir_domain_ops, iommu);
 	if (!iommu->ir_domain) {
 		irq_domain_free_fwnode(fn);
 		return -ENOMEM;
 	}
 
-	iommu->ir_domain->parent = arch_get_ir_parent_domain();
-	iommu->msi_domain = arch_create_remap_msi_irq_domain(iommu->ir_domain,
-							     "AMD-IR-MSI",
-							     iommu->index);
+	irq_domain_update_bus_token(iommu->ir_domain,  DOMAIN_BUS_AMDVI);
+	iommu->ir_domain->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
+	iommu->ir_domain->msi_parent_ops = &amdvi_msi_parent_ops;
+
 	return 0;
 }
 
--- a/include/linux/irqdomain_defs.h
+++ b/include/linux/irqdomain_defs.h
@@ -24,6 +24,7 @@ enum irq_domain_bus_token {
 	DOMAIN_BUS_PCI_DEVICE_MSI,
 	DOMAIN_BUS_PCI_DEVICE_MSIX,
 	DOMAIN_BUS_DMAR,
+	DOMAIN_BUS_AMDVI,
 };
 
 #endif /* _LINUX_IRQDOMAIN_DEFS_H */


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 17/33] x86/apic/msi: Remove arch_create_remap_msi_irq_domain()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (15 preceding siblings ...)
  2022-11-11 13:58 ` [patch 16/33] iommu/amd: Switch to MSI base domains Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 18/33] genirq/msi: Provide struct msi_map Thomas Gleixner
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

and related code which is not longer required now that the interrupt remap
code has been converted to MSI parent domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/irq_remapping.h |    4 ---
 arch/x86/kernel/apic/msi.c           |   42 -----------------------------------
 2 files changed, 1 insertion(+), 45 deletions(-)

--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -44,10 +44,6 @@ extern int irq_remapping_reenable(int);
 extern int irq_remap_enable_fault_handling(void);
 extern void panic_if_irq_remap(const char *msg);
 
-/* Create PCI MSI/MSIx irqdomain, use @parent as the parent irqdomain. */
-extern struct irq_domain *
-arch_create_remap_msi_irq_domain(struct irq_domain *par, const char *n, int id);
-
 /* Get parent irqdomain for interrupt remapping irqdomain */
 static inline struct irq_domain *arch_get_ir_parent_domain(void)
 {
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -277,7 +277,7 @@ void __init x86_create_pci_msi_domain(vo
 	x86_pci_msi_default_domain = x86_init.irqs.create_pci_msi_domain();
 }
 
-/* Keep around for hyperV and the remap code below */
+/* Keep around for hyperV */
 int pci_msi_prepare(struct irq_domain *domain, struct device *dev, int nvec,
 		    msi_alloc_info_t *arg)
 {
@@ -291,46 +291,6 @@ int pci_msi_prepare(struct irq_domain *d
 }
 EXPORT_SYMBOL_GPL(pci_msi_prepare);
 
-#ifdef CONFIG_IRQ_REMAP
-static struct msi_domain_ops pci_msi_domain_ops = {
-	.msi_prepare	= pci_msi_prepare,
-};
-
-static struct irq_chip pci_msi_ir_controller = {
-	.name			= "IR-PCI-MSI",
-	.irq_unmask		= pci_msi_unmask_irq,
-	.irq_mask		= pci_msi_mask_irq,
-	.irq_ack		= irq_chip_ack_parent,
-	.irq_retrigger		= irq_chip_retrigger_hierarchy,
-	.flags			= IRQCHIP_SKIP_SET_WAKE |
-				  IRQCHIP_AFFINITY_PRE_STARTUP,
-};
-
-static struct msi_domain_info pci_msi_ir_domain_info = {
-	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
-			  MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX,
-	.ops		= &pci_msi_domain_ops,
-	.chip		= &pci_msi_ir_controller,
-	.handler	= handle_edge_irq,
-	.handler_name	= "edge",
-};
-
-struct irq_domain *arch_create_remap_msi_irq_domain(struct irq_domain *parent,
-						    const char *name, int id)
-{
-	struct fwnode_handle *fn;
-	struct irq_domain *d;
-
-	fn = irq_domain_alloc_named_id_fwnode(name, id);
-	if (!fn)
-		return NULL;
-	d = pci_msi_create_irq_domain(fn, &pci_msi_ir_domain_info, parent);
-	if (!d)
-		irq_domain_free_fwnode(fn);
-	return d;
-}
-#endif
-
 #ifdef CONFIG_DMAR_TABLE
 /*
  * The Intel IOMMU (ab)uses the high bits of the MSI address to contain the


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 18/33] genirq/msi: Provide struct msi_map
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (16 preceding siblings ...)
  2022-11-11 13:58 ` [patch 17/33] x86/apic/msi: Remove arch_create_remap_msi_irq_domain() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 19/33] genirq/msi: Provide msi_desc::msi_data Thomas Gleixner
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

A simple struct to hold a MSI index / Linux interrupt number pair. It will
be returned from the dynamic vector allocation function and handed back to
the corresponding free() function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi_api.h |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- a/include/linux/msi_api.h
+++ b/include/linux/msi_api.h
@@ -18,6 +18,19 @@ enum msi_domain_ids {
 	MSI_MAX_DEVICE_IRQDOMAINS,
 };
 
+/**
+ * msi_map - Mapping between MSI index and Linux interrupt number
+ * @index:	The MSI index, e.g. slot in the MSI-X table or
+ *		a software managed index if >= 0. If negative
+ *		the allocation function failed and it contains
+ *		the error code.
+ * @virq:	The associated Linux interrupt number
+ */
+struct msi_map {
+	int	index;
+	int	virq;
+};
+
 unsigned int msi_domain_get_virq(struct device *dev, unsigned int domid, unsigned int index);
 
 /**


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (17 preceding siblings ...)
  2022-11-11 13:58 ` [patch 18/33] genirq/msi: Provide struct msi_map Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 19:28   ` Jason Gunthorpe
  2022-11-11 13:58 ` [patch 20/33] genirq/msi: Provide msi_domain_ops::prepare_desc() Thomas Gleixner
                   ` (13 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

The upcoming support for PCI/IMS requires to store some information related
to the message handling in the MSI descriptor, e.g. PASID or a pointer to a
queue.

Provide a generic storage struct which maps over the existing PCI specific
storage which means the size of struct msi_desc is not getting bigger.

It contains a iomem pointer for device memory based IMS and a union of a
u64 and a void pointer which allows the device specific IMS implementations
to store the necessary information.

The iomem pointer is set up by the domain allocation functions.

The data union msi_dev_cookie is going to be handed in when allocating an
interrupt on an IMS domain so the irq chip callbacks of the IMS domain have
the necessary per vector information available. It also comes in handy when
cleaning up the platform MSI code for wire to MSI bridges which need to
hand down the type information to the underlying interrupt domain.

For the core code the cookie is opaque and meaningless. It just stores it
during an allocation through the upcoming interfaces for IMS and wire to
MSI brigdes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h     |   19 ++++++++++++++++++-
 include/linux/msi_api.h |   17 +++++++++++++++++
 2 files changed, 35 insertions(+), 1 deletion(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -121,6 +121,19 @@ struct pci_msi_desc {
 	};
 };
 
+/**
+ * struct msi_desc_data - Generic MSI descriptor data
+ * @iobase:     Pointer to the IOMEM base adress for interrupt callbacks
+ * @cookie:	Device cookie provided at allocation time
+ *
+ * The content of this data is implementation defined, e.g. PCI/IMS
+ * implementations will define the meaning of the data.
+ */
+struct msi_desc_data {
+	void			__iomem *iobase;
+	union msi_dev_cookie	cookie;
+};
+
 #define MSI_MAX_INDEX		((unsigned int)USHRT_MAX)
 
 /**
@@ -138,6 +151,7 @@ struct pci_msi_desc {
  *
  * @msi_index:	Index of the msi descriptor
  * @pci:	PCI specific msi descriptor data
+ * @data:	Generic MSI descriptor data
  */
 struct msi_desc {
 	/* Shared device/bus type independent data */
@@ -157,7 +171,10 @@ struct msi_desc {
 	void *write_msi_msg_data;
 
 	u16				msi_index;
-	struct pci_msi_desc		pci;
+	union {
+		struct pci_msi_desc	pci;
+		struct msi_desc_data	data;
+	};
 };
 
 /*
--- a/include/linux/msi_api.h
+++ b/include/linux/msi_api.h
@@ -19,6 +19,23 @@ enum msi_domain_ids {
 };
 
 /**
+ * union msi_dev_cookie - MSI device cookie
+ * @value:	u64 value store
+ * @ptr:	Pointer
+ *
+ * This data is handed to the IMS allocation function and stored
+ * in the MSI descriptor for the interrupt chip callbacks.
+ *
+ * The content of this data is implementation defined, e.g. PCI/IMS
+ * implementations will define the meaning of the data, e.g. PASID or a
+ * pointer to queue memory.
+ */
+union msi_dev_cookie {
+	u64	value;
+	void	*ptr;
+};
+
+/**
  * msi_map - Mapping between MSI index and Linux interrupt number
  * @index:	The MSI index, e.g. slot in the MSI-X table or
  *		a software managed index if >= 0. If negative


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 20/33] genirq/msi: Provide msi_domain_ops::prepare_desc()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (18 preceding siblings ...)
  2022-11-11 13:58 ` [patch 19/33] genirq/msi: Provide msi_desc::msi_data Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at() Thomas Gleixner
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

The existing MSI domain ops msi_prepare() and set_desc() turned out to be
unsuitable for implementing IMS support.

msi_prepare() does not operate on the MSI descriptors. set_desc() lacks
an irq_domain pointer and has a completely different purpose.

Introduce a prepare_desc() op which allows IMS implementations to amend an
MSI descriptor which was allocated by the core code, e.g. by adjusting the
iomem base or adding some data based on the allocated index. This is way
better than requiring that all IMS domain implementations preallocate the
MSI descriptor and then allocate the interrupt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    6 +++++-
 kernel/irq/msi.c    |    3 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -379,6 +379,8 @@ struct msi_domain_info;
  * @msi_init:		Domain specific init function for MSI interrupts
  * @msi_free:		Domain specific function to free a MSI interrupts
  * @msi_prepare:	Prepare the allocation of the interrupts in the domain
+ * @prepare_desc:	Optional function to prepare the allocated MSI descriptor
+ *			in the domain
  * @set_desc:		Set the msi descriptor for an interrupt
  * @domain_alloc_irqs:	Optional function to override the default allocation
  *			function.
@@ -390,7 +392,7 @@ struct msi_domain_info;
  * @get_hwirq, @msi_init and @msi_free are callbacks used by the underlying
  * irqdomain.
  *
- * @msi_check, @msi_prepare and @set_desc are callbacks used by the
+ * @msi_check, @msi_prepare, @prepare_desc and @set_desc are callbacks used by the
  * msi_domain_alloc/free_irqs*() variants.
  *
  * @domain_alloc_irqs, @domain_free_irqs can be used to override the
@@ -413,6 +415,8 @@ struct msi_domain_ops {
 	int		(*msi_prepare)(struct irq_domain *domain,
 				       struct device *dev, int nvec,
 				       msi_alloc_info_t *arg);
+	void		(*prepare_desc)(struct irq_domain *domain, msi_alloc_info_t *arg,
+					struct msi_desc *desc);
 	void		(*set_desc)(msi_alloc_info_t *arg,
 				    struct msi_desc *desc);
 	int		(*domain_alloc_irqs)(struct irq_domain *domain,
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -1275,6 +1275,9 @@ static int __msi_domain_alloc_irqs(struc
 		if (WARN_ON_ONCE(allocated >= ctrl->nirqs))
 			return -EINVAL;
 
+		if (ops->prepare_desc)
+			ops->prepare_desc(domain, &arg, desc);
+
 		ops->set_desc(&arg, desc);
 
 		virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used,


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (19 preceding siblings ...)
  2022-11-11 13:58 ` [patch 20/33] genirq/msi: Provide msi_domain_ops::prepare_desc() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 19:36   ` Jason Gunthorpe
  2022-11-17 23:33   ` Reinette Chatre
  2022-11-11 13:58 ` [patch 22/33] genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN Thomas Gleixner
                   ` (11 subsequent siblings)
  32 siblings, 2 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

For supporting post MSI-X enable allocations and for the upcoming PCI/IMS
support a seperate interface is required which allows not only the
allocation of a specific index, but also the allocation of any, i.e. the
next free index. The latter is especially required for IMS because IMS
completely does away with index to functionality mappings which are
often found in MSI/MSI-X implementation.

But even with MSI-X there are devices where only the first few indices have
a fixed functionality and the rest is freely assignable by software,
e.g. to queues.

msi_domain_alloc_irq_at() is also different from the range based interfaces
as it always enforces that the MSI descriptor is allocated by the core code
and not preallocated by the caller like the PCI/MSI[-X] enable code path
does.

msi_domain_alloc_irq_at() can be invoked with the index argument set to
MSI_ANY_INDEX which makes the core code pick the next free index. The irq
domain can provide a prepare_desc() operation callback in its
msi_domain_ops to do domain specific post allocation initialization before
the actual Linux interrupt and the associated interrupt descriptor and
hierarchy alloccations are conducted.

The function also takes an optional @cookie argument which is of type union
msi_dev_cookie. This cookie is not used by the core code and is stored in
the allocated msi_desc::data::cookie. The meaning of the cookie is
completely implementation defined. In case of IMS this might be a PASID or
a pointer to a device queue, but for the MSI core it's opaque and not used
in any way.

The function returns a struct msi_map which on success contains the
allocated index number and the Linux interrupt number so the caller can
spare the index to Linux interrupt number lookup.

On failure map::index contains the error code and map::virq is 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h     |    4 +
 include/linux/msi_api.h |    7 +++
 kernel/irq/msi.c        |  105 ++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 105 insertions(+), 11 deletions(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -80,6 +80,7 @@ struct pci_dev;
 struct platform_msi_priv_data;
 struct device_attribute;
 struct irq_domain;
+struct irq_affinity_desc;
 
 void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg);
@@ -567,6 +568,9 @@ int msi_domain_alloc_irqs_range(struct d
 				unsigned int first, unsigned int last);
 int msi_domain_alloc_irqs_all_locked(struct device *dev, unsigned int domid, int nirqs);
 
+struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index,
+				       const struct irq_affinity_desc *affdesc,
+				       union msi_dev_cookie *cookie);
 
 void msi_domain_free_irqs_range_locked(struct device *dev, unsigned int domid,
 				       unsigned int first, unsigned int last);
--- a/include/linux/msi_api.h
+++ b/include/linux/msi_api.h
@@ -48,6 +48,13 @@ struct msi_map {
 	int	virq;
 };
 
+/*
+ * Constant to be used for dynamic allocations when the allocation
+ * is any free MSI index (entry in the MSI-X table or a software
+ * managed index.
+ */
+#define MSI_ANY_INDEX		UINT_MAX
+
 unsigned int msi_domain_get_virq(struct device *dev, unsigned int domid, unsigned int index);
 
 /**
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -39,6 +39,7 @@ static inline int msi_sysfs_create_group
 /* Invalid XA index which is outside of any searchable range */
 #define MSI_XA_MAX_INDEX	(ULONG_MAX - 1)
 #define MSI_XA_DOMAIN_SIZE	(MSI_MAX_INDEX + 1)
+#define MSI_ANY_INDEX		UINT_MAX
 
 static inline void msi_setup_default_irqdomain(struct device *dev, struct msi_device_data *md)
 {
@@ -126,18 +127,34 @@ static int msi_insert_desc(struct device
 	}
 
 	hwsize = msi_domain_get_hwsize(dev, domid);
-	if (index >= hwsize) {
-		ret = -ERANGE;
-		goto fail;
-	}
 
-	desc->msi_index = index;
-	index += baseidx;
-	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
-	if (ret)
-		goto fail;
-	return 0;
+	if (index == MSI_ANY_INDEX) {
+		struct xa_limit limit;
+		unsigned int index;
+
+		limit.min = baseidx;
+		limit.max = baseidx + hwsize - 1;
 
+		/* Let the xarray allocate a free index within the limits */
+		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
+		if (ret)
+			goto fail;
+
+		desc->msi_index = index;
+		return 0;
+	} else {
+		if (index >= hwsize) {
+			ret = -ERANGE;
+			goto fail;
+		}
+
+		desc->msi_index = index;
+		index += baseidx;
+		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
+		if (ret)
+			goto fail;
+		return 0;
+	}
 fail:
 	msi_free_desc(desc);
 	return ret;
@@ -335,7 +352,7 @@ int msi_setup_device_data(struct device
 
 	msi_setup_default_irqdomain(dev, md);
 
-	xa_init(&md->__store);
+	xa_init_flags(&md->__store, XA_FLAGS_ALLOC);
 	mutex_init(&md->mutex);
 	md->__iter_idx = MSI_XA_MAX_INDEX;
 	dev->msi.data = md;
@@ -1423,6 +1440,72 @@ int msi_domain_alloc_irqs_all_locked(str
 	return msi_domain_alloc_locked(dev, &ctrl);
 }
 
+/**
+ * msi_domain_alloc_irq_at - Allocate an interrupt from a MSI interrupt domain at
+ *			     a given index - or at the next free index
+ *
+ * @dev:	Pointer to device struct of the device for which the interrupts
+ *		are allocated
+ * @domid:	Id of the interrupt domain to operate on
+ * @index:	Index for allocation. If @index == %MSI_ANY_INDEX the allocation
+ *		uses the next free index.
+ * @affdesc:	Optional pointer to an interrupt affinity descriptor structure
+ * @cookie:	Optional pointer to a descriptor specific cookie to be stored
+ *		in msi_desc::data. Must be NULL for MSI-X allocations
+ *
+ * This requires a MSI interrupt domain which lets the core code manage the
+ * MSI descriptors.
+ *
+ * Return: struct msi_map
+ *
+ *	On success msi_map::index contains the allocated index number and
+ *	msi_map::virq the corresponding Linux interrupt number
+ *
+ *	On failure msi_map::index contains the error code and msi_map::virq
+ *	is %0.
+ */
+struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index,
+				       const struct irq_affinity_desc *affdesc,
+				       union msi_dev_cookie *cookie)
+{
+	struct irq_domain *domain;
+	struct msi_map map = { };
+	struct msi_desc *desc;
+	int ret;
+
+	msi_lock_descs(dev);
+	domain = msi_get_device_domain(dev, domid);
+	if (!domain) {
+		map.index = -ENODEV;
+		goto unlock;
+	}
+
+	desc = msi_alloc_desc(dev, 1, affdesc);
+	if (!desc) {
+		map.index = -ENOMEM;
+		goto unlock;
+	}
+
+	if (cookie)
+		desc->data.cookie = *cookie;
+
+	ret = msi_insert_desc(dev, desc, domid, index);
+	if (ret) {
+		map.index = ret;
+		goto unlock;
+	}
+
+	map.index = desc->msi_index;
+	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
+	if (ret)
+		map.index = ret;
+	else
+		map.virq = desc->irq;
+unlock:
+	msi_unlock_descs(dev);
+	return map;
+}
+
 static void __msi_domain_free_irqs(struct device *dev, struct irq_domain *domain,
 				   struct msi_ctrl *ctrl)
 {


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 22/33] genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (20 preceding siblings ...)
  2022-11-11 13:58 ` [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 19:36   ` Jason Gunthorpe
  2022-11-11 13:58 ` [patch 23/33] PCI/MSI: Split MSIX descriptor setup Thomas Gleixner
                   ` (10 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Provide a new MSI feature flag in preparation for dynamic MSIX allocation
after the initial MSI-X enable has been done.

This needs to be an explicit MSI interrupt domain feature because quite
some implementations (both interrupt domains and legacy allocation mode)
have clear expectations that the allocation code is only invoked when MSI-X
is about to be enabled. They either talk to hypervisors or do some other
work and are not prepared to be invoked on an already MSI-X enabled device.

This is also explicit MSI-X only because rewriting the size of the MSI
entries is only possible when disabling MSI which in turn might cause lost
interrupts on the device.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/msi.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -523,7 +523,8 @@ enum {
 	MSI_FLAG_LEVEL_CAPABLE		= (1 << 18),
 	/* MSI-X entries must be contiguous */
 	MSI_FLAG_MSIX_CONTIGUOUS	= (1 << 19),
-
+	/* PCI/MSI-X vectors can be dynamically allocated/freed post MSI-X enable */
+	MSI_FLAG_PCI_MSIX_ALLOC_DYN	= (1 << 20),
 };
 
 /**


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 23/33] PCI/MSI: Split MSIX descriptor setup
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (21 preceding siblings ...)
  2022-11-11 13:58 ` [patch 22/33] genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 20:13   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op Thomas Gleixner
                   ` (9 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

The upcoming mechanism to allocate MSI-X vectors after enabling MSI-X needs
to share some of the MSI-X descriptor setup.

The regular descriptor setup on enable has the following code flow:

    1) Allocate descriptor
    2) Setup descriptor with PCI specific data
    3) Insert descriptor
    4) Allocate interrupts which in turn scans the inserted
       descriptors

This cannot be easily changed because the PCI/MSI code needs to handle the
legacy architecture specific allocation model and the irq domain model
where quite some domains have the assumption that the above flow is how it
works.

Ideally the code flow should look like this:

   1) Invoke allocation at the MSI core
   2) MSI core allocates descriptor
   3) MSI core calls back into the irq domain which fills in
      the domain specific parts

This could be done for underlying parent MSI domains which support
post-enable allocation/free but that would create significantly different
code pathes for MSI/MSI-X enable.

Though for dynamic allocation which wants to share the allocation code with
the upcoming PCI/IMS support its the right thing to do.

Split the MSIX descriptor setup into the preallocation part which just sets
the index and fills in the horrible hack of virtual IRQs and the real PCI
specific MSI-X setup part which solely depends on the index in the
descriptor. This allows to provide a common dynami allocation interface at
the MSI core level for both PCI/MSI-X and PCI/IMS.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/msi.c |   72 +++++++++++++++++++++++++++++++-------------------
 drivers/pci/msi/msi.h |    2 +
 2 files changed, 47 insertions(+), 27 deletions(-)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -569,34 +569,56 @@ static void __iomem *msix_map_region(str
 	return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
 }
 
-static int msix_setup_msi_descs(struct pci_dev *dev, void __iomem *base,
-				struct msix_entry *entries, int nvec,
-				struct irq_affinity_desc *masks)
+/**
+ * msix_prepare_msi_desc - Prepare a half initialized MSI descriptor for operation
+ * @dev:	The PCI device for which the descriptor is prepared
+ * @desc:	The MSI descriptor for preparation
+ *
+ * This is seperate from msix_setup_msi_descs() below to handle dynamic
+ * allocations for MSIX after initial enablement.
+ *
+ * Ideally the whole MSIX setup would work that way, but there is no way to
+ * support this for the legacy arch_setup_msi_irqs() mechanism and for the
+ * fake irq domains like the x86 XEN one. Sigh...
+ *
+ * The descriptor is zeroed and only @desc::msi_index and @desc::affinity
+ * are set. When called from msix_setup_msi_descs() then the is_virtual
+ * attribute is initialized as well.
+ *
+ * Fill in the rest.
+ */
+void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc)
+{
+	desc->nvec_used				= 1;
+	desc->pci.msi_attrib.is_msix		= 1;
+	desc->pci.msi_attrib.is_64		= 1;
+	desc->pci.msi_attrib.default_irq	= dev->irq;
+	desc->pci.mask_base			= dev->msix_base;
+	desc->pci.msi_attrib.can_mask		= !pci_msi_ignore_mask &&
+						  !desc->pci.msi_attrib.is_virtual;
+
+	if (desc->pci.msi_attrib.can_mask) {
+		void __iomem *addr = pci_msix_desc_addr(desc);
+
+		desc->pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
+	}
+}
+
+static int msix_setup_msi_descs(struct pci_dev *dev, struct msix_entry *entries,
+				int nvec, struct irq_affinity_desc *masks)
 {
 	int ret = 0, i, vec_count = pci_msix_vec_count(dev);
 	struct irq_affinity_desc *curmsk;
 	struct msi_desc desc;
-	void __iomem *addr;
 
 	memset(&desc, 0, sizeof(desc));
 
-	desc.nvec_used			= 1;
-	desc.pci.msi_attrib.is_msix	= 1;
-	desc.pci.msi_attrib.is_64	= 1;
-	desc.pci.msi_attrib.default_irq	= dev->irq;
-	desc.pci.mask_base		= base;
-
 	for (i = 0, curmsk = masks; i < nvec; i++, curmsk++) {
 		desc.msi_index = entries ? entries[i].entry : i;
 		desc.affinity = masks ? curmsk : NULL;
 		desc.pci.msi_attrib.is_virtual = desc.msi_index >= vec_count;
-		desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
-					       !desc.pci.msi_attrib.is_virtual;
 
-		if (desc.pci.msi_attrib.can_mask) {
-			addr = pci_msix_desc_addr(&desc);
-			desc.pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
-		}
+		msix_prepare_msi_desc(dev, &desc);
 
 		ret = msi_insert_msi_desc(&dev->dev, &desc);
 		if (ret)
@@ -629,9 +651,8 @@ static void msix_mask_all(void __iomem *
 		writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
 }
 
-static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
-				 struct msix_entry *entries, int nvec,
-				 struct irq_affinity *affd)
+static int msix_setup_interrupts(struct pci_dev *dev, struct msix_entry *entries,
+				 int nvec, struct irq_affinity *affd)
 {
 	struct irq_affinity_desc *masks = NULL;
 	int ret;
@@ -640,7 +661,7 @@ static int msix_setup_interrupts(struct
 		masks = irq_create_affinity_masks(nvec, affd);
 
 	msi_lock_descs(&dev->dev);
-	ret = msix_setup_msi_descs(dev, base, entries, nvec, masks);
+	ret = msix_setup_msi_descs(dev, entries, nvec, masks);
 	if (ret)
 		goto out_free;
 
@@ -678,7 +699,6 @@ static int msix_setup_interrupts(struct
 static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
 				int nvec, struct irq_affinity *affd)
 {
-	void __iomem *base;
 	int ret, tsize;
 	u16 control;
 
@@ -696,15 +716,13 @@ static int msix_capability_init(struct p
 	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
 	/* Request & Map MSI-X table region */
 	tsize = msix_table_size(control);
-	base = msix_map_region(dev, tsize);
-	if (!base) {
+	dev->msix_base = msix_map_region(dev, tsize);
+	if (!dev->msix_base) {
 		ret = -ENOMEM;
 		goto out_disable;
 	}
 
-	dev->msix_base = base;
-
-	ret = msix_setup_interrupts(dev, base, entries, nvec, affd);
+	ret = msix_setup_interrupts(dev, entries, nvec, affd);
 	if (ret)
 		goto out_disable;
 
@@ -719,7 +737,7 @@ static int msix_capability_init(struct p
 	 * which takes the MSI-X mask bits into account even
 	 * when MSI-X is disabled, which prevents MSI delivery.
 	 */
-	msix_mask_all(base, tsize);
+	msix_mask_all(dev->msix_base, tsize);
 	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
 
 	pcibios_free_irq(dev);
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -84,6 +84,8 @@ static inline __attribute_const__ u32 ms
 	return (1 << (1 << desc->pci.msi_attrib.multi_cap)) - 1;
 }
 
+void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc);
+
 /* Subsystem variables */
 extern int pci_msi_enable;
 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (22 preceding siblings ...)
  2022-11-11 13:58 ` [patch 23/33] PCI/MSI: Split MSIX descriptor setup Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 19:40   ` Jason Gunthorpe
  2022-11-16 20:26   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 25/33] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X Thomas Gleixner
                   ` (8 subsequent siblings)
  32 siblings, 2 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Dynamic MSI-X vector allocation post MSI-X allows to allocate vectors at a
given index or at any free index in the available table range. The latter
requires that the core code selects the index at descriptor allocation time.

This requires that the PCI/MSI-X specific setup of the MSI-X descriptor,
which is partially depending on the chosen index happens after allocation.

Implement the prepare_desc() op in the PCI/MSI-X specific msi_domain_ops
which is invoked before the core interrupt descriptor and the associated
Linux interrupt number is allocated. That callback is also provided for the
upcoming PCI/IMS implementations so the implementation specific interrupt
domain can do their domain specific initialization of the MSI descriptors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |    9 +++++++++
 1 file changed, 9 insertions(+)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -202,6 +202,14 @@ static void pci_unmask_msix(struct irq_d
 	pci_msix_unmask(irq_data_get_msi_desc(data));
 }
 
+static void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
+				  struct msi_desc *desc)
+{
+	/* Don't fiddle with preallocated MSI descriptors */
+	if (!desc->pci.mask_base)
+		msix_prepare_msi_desc(to_pci_dev(desc->dev), desc);
+}
+
 static struct msi_domain_template pci_msix_template = {
 	.chip = {
 		.name			= "PCI-MSIX",
@@ -212,6 +220,7 @@ static struct msi_domain_template pci_ms
 	},
 
 	.ops = {
+		.prepare_desc		= pci_msix_prepare_desc,
 		.set_desc		= pci_device_domain_set_desc,
 	},
 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 25/33] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (23 preceding siblings ...)
  2022-11-11 13:58 ` [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 20:19   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 26/33] x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN Thomas Gleixner
                   ` (7 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

MSI-X vectors can be allocated after the initial MSI-X enablement, but this
needs explicit support of the underlying interrupt domains.

Provide a function to query the ability and functions to allocate/free
individual vectors post-enable.

The allocation can either request a specific index in the MSI-X table or
with the index argument MSI_ANY_INDEX it allocates the next free vector.

The return value is a struct msi_map which on success contains both index
and the Linux interrupt number. In case of failure index is negative and
the Linux interrupt number is 0.

The allocation function is for a single MSI-X index at a time as that's
sufficient for the most urgent use case VFIO to get rid of the 'disable
MSI-X, reallocate, enable-MSI-X' cycle which is prone to lost interrupts
and redirections to the legacy and obviously unhandled INTx.

Also for the use cases Jason Gunthorpe pointed a single index allocation
is sufficient.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/api.c       |   67 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/msi/irqdomain.c |    3 +
 include/linux/pci.h         |    6 +++
 3 files changed, 75 insertions(+), 1 deletion(-)

--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -113,6 +113,73 @@ int pci_enable_msix_range(struct pci_dev
 EXPORT_SYMBOL(pci_enable_msix_range);
 
 /**
+ * pci_msix_can_alloc_dyn - Query whether dynamic allocation after enabling
+ *			    MSI-X is supported
+ *
+ * @dev:	PCI device to operate on
+ *
+ * Return: True if supported, false otherwise
+ */
+bool pci_msix_can_alloc_dyn(struct pci_dev *dev)
+{
+	if (!dev->msix_cap)
+		return false;
+
+	return pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX_ALLOC_DYN, DENY_LEGACY);
+}
+EXPORT_SYMBOL_GPL(pci_msix_can_alloc_dyn);
+
+/**
+ * pci_msix_alloc_irq_at - Allocate an MSI-X interrupt after enabling MSI-X
+ *			   at a given MSI-X vector index or any free vector index
+ *
+ * @dev:	PCI device to operate on
+ * @index:	Index to allocate. If @index == MSI_ANY_INDEX this allocates
+ *		the next free index in the MSI-X table
+ * @affdesc:	Optional pointer to an affinity descriptor structure. NULL otherwise
+ *
+ * Return: A struct msi_map
+ *
+ *	On success msi_map::index contains the allocated index (>= 0) and
+ *	msi_map::virq the allocated Linux interrupt number (> 0).
+ *
+ *	On fail msi_map::index contains the error code and msi_map::virq
+ *	is set to 0.
+ */
+struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
+				     const struct irq_affinity_desc *affdesc)
+{
+	struct msi_map map = { .index = -ENOTSUPP };
+
+	if (!dev->msix_enabled)
+		return map;
+
+	if (!pci_msix_can_alloc_dyn(dev))
+		return map;
+
+	return msi_domain_alloc_irq_at(&dev->dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_msix_alloc_irq_at);
+
+/**
+ * pci_msix_free_irq - Free an interrupt on a PCI/MSIX interrupt domain
+ *		      which was allocated via pci_msix_alloc_irq_at()
+ *
+ * @dev:	The PCI device to operate on
+ * @map:	A struct msi_map describing the interrupt to free
+ *		as returned from the allocation function.
+ */
+void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map)
+{
+	if (WARN_ON_ONCE(map.index < 0 || map.virq <= 0))
+		return;
+	if (WARN_ON_ONCE(!pci_msix_can_alloc_dyn(dev)))
+		return;
+	msi_domain_free_irqs_range(&dev->dev, MSI_DEFAULT_DOMAIN, map.index, map.index);
+}
+EXPORT_SYMBOL_GPL(pci_msix_free_irq);
+
+/**
  * pci_disable_msix() - Disable MSI-X interrupt mode on device
  * @dev: the PCI device to operate on
  *
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -225,7 +225,8 @@ static struct msi_domain_template pci_ms
 	},
 
 	.info = {
-		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
+		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX |
+					  MSI_FLAG_PCI_MSIX_ALLOC_DYN,
 		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSIX,
 	},
 };
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/resource_ext.h>
+#include <linux/msi_api.h>
 #include <uapi/linux/pci.h>
 
 #include <linux/pci_ids.h>
@@ -1559,6 +1560,11 @@ int pci_alloc_irq_vectors_affinity(struc
 				   unsigned int max_vecs, unsigned int flags,
 				   struct irq_affinity *affd);
 
+bool pci_msix_can_alloc_dyn(struct pci_dev *dev);
+struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
+				     const struct irq_affinity_desc *affdesc);
+void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map);
+
 void pci_free_irq_vectors(struct pci_dev *dev);
 int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
 const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 26/33] x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (24 preceding siblings ...)
  2022-11-11 13:58 ` [patch 25/33] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:58 ` [patch 27/33] genirq/msi: Provide constants for PCI/IMS support Thomas Gleixner
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

x86 MSI irqdomains can handle MSI-X allocation post MSI-X enable just out
of the box - on the vector domain and on the remapping domains,

Add the feature flag to the supported feature list

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/msi.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/include/asm/msi.h
+++ b/arch/x86/include/asm/msi.h
@@ -63,7 +63,7 @@ struct msi_msg;
 u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid);
 
 #define X86_VECTOR_MSI_FLAGS_SUPPORTED					\
-	(MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
+	(MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX | MSI_FLAG_PCI_MSIX_ALLOC_DYN)
 
 #define X86_VECTOR_MSI_FLAGS_REQUIRED					\
 	(MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 27/33] genirq/msi: Provide constants for PCI/IMS support
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (25 preceding siblings ...)
  2022-11-11 13:58 ` [patch 26/33] x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 19:54   ` Jason Gunthorpe
  2022-11-11 13:58 ` [patch 28/33] PCI/MSI: Provide IMS (Interrupt Message Store) support Thomas Gleixner
                   ` (5 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Provide the necessary constants for PCI/IMS support:

  - A new bus token for MSI irqdomain identification
  - A MSI feature flag for the MSI irqdomains to signal support
  - A secondary domain id

The latter expands the device internal domain pointer storage array from 1
to 2 entries. That extra pointer is mostly unused today, but the
alternative solutions would not be free either and would introduce more
complexity all over the place. Trade the 8bytes for simplicity.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irqdomain_defs.h |    1 +
 include/linux/msi.h            |    2 ++
 include/linux/msi_api.h        |    1 +
 3 files changed, 4 insertions(+)

--- a/include/linux/irqdomain_defs.h
+++ b/include/linux/irqdomain_defs.h
@@ -25,6 +25,7 @@ enum irq_domain_bus_token {
 	DOMAIN_BUS_PCI_DEVICE_MSIX,
 	DOMAIN_BUS_DMAR,
 	DOMAIN_BUS_AMDVI,
+	DOMAIN_BUS_PCI_DEVICE_IMS,
 };
 
 #endif /* _LINUX_IRQDOMAIN_DEFS_H */
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -525,6 +525,8 @@ enum {
 	MSI_FLAG_MSIX_CONTIGUOUS	= (1 << 19),
 	/* PCI/MSI-X vectors can be dynamically allocated/freed post MSI-X enable */
 	MSI_FLAG_PCI_MSIX_ALLOC_DYN	= (1 << 20),
+	/* Support for PCI/IMS */
+	MSI_FLAG_PCI_IMS		= (1 << 21),
 };
 
 /**
--- a/include/linux/msi_api.h
+++ b/include/linux/msi_api.h
@@ -15,6 +15,7 @@ struct device;
  */
 enum msi_domain_ids {
 	MSI_DEFAULT_DOMAIN,
+	MSI_SECONDARY_DOMAIN,
 	MSI_MAX_DEVICE_IRQDOMAINS,
 };
 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 28/33] PCI/MSI: Provide IMS (Interrupt Message Store) support
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (26 preceding siblings ...)
  2022-11-11 13:58 ` [patch 27/33] genirq/msi: Provide constants for PCI/IMS support Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 20:17   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 29/33] PCI/MSI: Provide pci_ims_alloc/free_irq() Thomas Gleixner
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

IMS (Interrupt Message Store) is a new specification which allows
implementation specific storage of MSI messages contrary to the
strict standard specified MSI and MSI-X message stores.

This requires new device specific interrupt domains to handle the
implementation defined storage which can be an array in device memory or
host/guest memory which is shared with hardware queues.

Add a function to create IMS domains for PCI devices. IMS domains are using
the new per device domain mechanism and are configured by the device driver
via a template. IMS domains are created as secondary device domains so they
work side on side with MSI[-X] on the same device.

The IMS domains have a few constraints:

  - The index space is managed by the core code.

    Device memory based IMS provides a storage array with a fixed size
    which obviously requires an index. But there is no association between
    index and functionality so the core can randomly allocate an index in
    the array.

    Queue memory based IMS does not have the concept of an index as the
    storage is somewhere in memory. In that case the index is purely
    software based to keep track of the allocations.

  - There is no requirement for consecutive index ranges

    This is currently a limitation of the MSI core and can be implemented
    if there is a justified use case by changing the internal storage from
    xarray to maple_tree. For now it's single vector allocation.

  - The interrupt chip must provide the following callbacks:

  	- irq_mask()
	- irq_unmask()
	- irq_write_msi_msg()

   - The interrupt chip must provide the following optional callbacks
     when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
     cannot operate directly on hardware, e.g. in the case that the
     interrupt message store is in queue memory:

     	- irq_bus_lock()
	- irq_bus_unlock()

     These callbacks are invoked from preemptible task context and are
     allowed to sleep. In this case the mandatory callbacks above just
     store the information. The irq_bus_unlock() callback is supposed to
     make the change effective before returning.

   - Interrupt affinity setting is handled by the underlying parent
     interrupt domain and communicated to the IMS domain via
     irq_write_msi_msg(). IMS domains cannot have a irq_set_affinity()
     callback. That's a reasonable restriction similar to the PCI/MSI
     device domain implementations.

The domain is automatically destroyed when the PCI device is removed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/irqdomain.c |   59 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h         |    5 +++
 2 files changed, 64 insertions(+)

--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -355,6 +355,65 @@ bool pci_msi_domain_supports(struct pci_
 	return (supported & feature_mask) == feature_mask;
 }
 
+/**
+ * pci_create_ims_domain - Create a secondary IMS domain for a PCI device
+ * @pdev:	The PCI device to operate on
+ * @template:	The MSI info template which describes the domain
+ * @hwsize:	The size of the hardware entry table or 0 if the domain
+ *		is purely software managed
+ * @data:	Optional pointer to domain specific data to be stored
+ *		in msi_domain_info::data
+ *
+ * Return: True on success, false otherwise
+ *
+ * A IMS domain is expected to have the following constraints:
+ *	- The index space is managed by the core code
+ *
+ *	- There is no requirement for consecutive index ranges
+ *
+ *	- The interrupt chip must provide the following callbacks:
+ *		- irq_mask()
+ *		- irq_unmask()
+ *		- irq_write_msi_msg()
+ *
+ *	- The interrupt chip must provide the following optional callbacks
+ *	  when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
+ *	  cannot operate directly on hardware, e.g. in the case that the
+ *	  interrupt message store is in queue memory:
+ *		- irq_bus_lock()
+ *		- irq_bus_unlock()
+ *
+ *	  These callbacks are invoked from preemptible task context and are
+ *	  allowed to sleep. In this case the mandatory callbacks above just
+ *	  store the information. The irq_bus_unlock() callback is supposed
+ *	  to make the change effective before returning.
+ *
+ *     - Interrupt affinity setting is handled by the underlying parent
+ *	 interrupt domain and communicated to the IMS domain via
+ *	 irq_write_msi_msg().
+ *
+ * The domain is automatically destroyed when the PCI device is removed.
+ */
+bool pci_create_ims_domain(struct pci_dev *pdev, const struct msi_domain_template *template,
+			   unsigned int hwsize, void *data)
+{
+	struct irq_domain *domain = dev_get_msi_domain(&pdev->dev);
+
+	if (!domain || !irq_domain_is_msi_parent(domain))
+		return -ENOTSUPP;
+
+	if (template->info.bus_token != DOMAIN_BUS_PCI_DEVICE_IMS ||
+	    !(template->info.flags & MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS) ||
+	    !(template->info.flags & MSI_FLAG_FREE_MSI_DESCS) ||
+	    !template->chip.irq_mask || !template->chip.irq_unmask ||
+	    !template->chip.irq_write_msi_msg || template->chip.irq_set_affinity)
+		return -EINVAL;
+
+	return msi_create_device_irq_domain(&pdev->dev, MSI_SECONDARY_DOMAIN, template,
+					    hwsize, data, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_create_ims_domain);
+
 /*
  * Users of the generic MSI infrastructure expect a device to have a single ID,
  * so with DMA aliases we have to pick the least-worst compromise. Devices with
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2481,6 +2481,11 @@ static inline bool pci_is_thunderbolt_at
 void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
 #endif
 
+struct msi_domain_template;
+
+bool pci_create_ims_domain(struct pci_dev *pdev, const struct msi_domain_template *template,
+			   unsigned int hwsize, void *data);
+
 #include <linux/dma-mapping.h>
 
 #define pci_printk(level, pdev, fmt, arg...) \


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 29/33] PCI/MSI: Provide pci_ims_alloc/free_irq()
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (27 preceding siblings ...)
  2022-11-11 13:58 ` [patch 28/33] PCI/MSI: Provide IMS (Interrupt Message Store) support Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-16 20:14   ` Bjorn Helgaas
  2022-11-11 13:58 ` [patch 30/33] x86/apic/msi: Enable PCI/IMS Thomas Gleixner
                   ` (3 subsequent siblings)
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Single vector allocation which allocates the next free index in the IMS
space. The free function releases.

All allocated vectors are released also via pci_free_vectors() which is
also releasing MSI/MSI-X vectors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/pci/msi/api.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h   |    3 +++
 2 files changed, 53 insertions(+)

--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -361,6 +361,56 @@ const struct cpumask *pci_irq_get_affini
 EXPORT_SYMBOL(pci_irq_get_affinity);
 
 /**
+ * pci_ims_alloc_irq - Allocate an interrupt on a PCI/IMS interrupt domain
+ * @dev:	The PCI device to operate on
+ * @cookie:	Pointer to an IMS implementation specific device cookie
+ *		(PASID, queue id, pointer...). The cookie content is stored
+ *		in the MSI descriptor for the interrupt chip callbacks or
+ *		domain specific setup functions
+ * @affdesc:	Optional pointer to an interrupt affinity descriptor
+ *
+ * Return: A struct msi_map
+ *
+ *	On success msi_map::index contains the allocated index (>= 0) and
+ *	msi_map::virq the allocated Linux interrupt number (> 0).
+ *
+ *	On fail msi_map::index contains the error code and msi_map::virq
+ *	is set to 0.
+ *
+ * Note: There is no index for IMS allocations as IMS is an implementation
+ *	 specific storage and does not have any direct associations between
+ *	 index, which might be a pure software construct, and device
+ *	 functionality. This association is established by the driver either
+ *	 via the index - if there is a hardware table - or in case of purely
+ *	 software managed IMS implementation the association happens via
+ *	 the irq_write_msi_msg() callback of the implementation specific
+ *	 interrupt chip, which utilizes the provided @cookie to store the MSI
+ *	 message in the appropriate place.
+ */
+struct msi_map pci_ims_alloc_irq(struct pci_dev *dev, union msi_dev_cookie *cookie,
+				 const struct irq_affinity_desc *affdesc)
+{
+	return msi_domain_alloc_irq_at(&dev->dev, MSI_SECONDARY_DOMAIN, MSI_ANY_INDEX,
+				       affdesc, cookie);
+}
+EXPORT_SYMBOL_GPL(pci_ims_alloc_irq);
+
+/**
+ * pci_ims_free_irq - Allocate an interrupt on a PCI/IMS interrupt domain
+ *		      which was allocated via pci_ims_alloc_irq()
+ * @dev:	The PCI device to operate on
+ * @map:	A struct msi_map describing the interrupt to free as
+ *		returned from pci_ims_alloc_irq()
+ */
+void pci_ims_free_irq(struct pci_dev *dev, struct msi_map map)
+{
+	if (WARN_ON_ONCE(map.index < 0 || !map.virq))
+		return;
+	msi_domain_free_irqs_range(&dev->dev, MSI_SECONDARY_DOMAIN, map.index, map.index);
+}
+EXPORT_SYMBOL_GPL(pci_ims_free_irq);
+
+/**
  * pci_free_irq_vectors() - Free previously allocated IRQs for a device
  * @dev: the PCI device to operate on
  *
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2485,6 +2485,9 @@ struct msi_domain_template;
 
 bool pci_create_ims_domain(struct pci_dev *pdev, const struct msi_domain_template *template,
 			   unsigned int hwsize, void *data);
+struct msi_map pci_ims_alloc_irq(struct pci_dev *pdev, union msi_dev_cookie *cookie,
+				 const struct irq_affinity_desc *affdesc);
+void pci_ims_free_irq(struct pci_dev *pdev, struct msi_map map);
 
 #include <linux/dma-mapping.h>
 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 30/33] x86/apic/msi: Enable PCI/IMS
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (28 preceding siblings ...)
  2022-11-11 13:58 ` [patch 29/33] PCI/MSI: Provide pci_ims_alloc/free_irq() Thomas Gleixner
@ 2022-11-11 13:58 ` Thomas Gleixner
  2022-11-11 13:59 ` [patch 31/33] iommu/vt-d: " Thomas Gleixner
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:58 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Enable IMS in the domain init and allocation mapping code, but do not
enable it on the vector domain as discussed in various threads on LKML.

The interrupt remap domains can expand this setting like they do with
PCI multi MSI.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/msi.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -184,6 +184,7 @@ static int x86_msi_prepare(struct irq_do
 		alloc->type = X86_IRQ_ALLOC_TYPE_PCI_MSI;
 		return 0;
 	case DOMAIN_BUS_PCI_DEVICE_MSIX:
+	case DOMAIN_BUS_PCI_DEVICE_IMS:
 		alloc->type = X86_IRQ_ALLOC_TYPE_PCI_MSIX;
 		return 0;
 	default:
@@ -230,6 +231,10 @@ static bool x86_init_dev_msi_info(struct
 	case DOMAIN_BUS_PCI_DEVICE_MSI:
 	case DOMAIN_BUS_PCI_DEVICE_MSIX:
 		break;
+	case DOMAIN_BUS_PCI_DEVICE_IMS:
+		if (!(pops->supported_flags & MSI_FLAG_PCI_IMS))
+			return false;
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		return false;


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 31/33] iommu/vt-d: Enable PCI/IMS
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (29 preceding siblings ...)
  2022-11-11 13:58 ` [patch 30/33] x86/apic/msi: Enable PCI/IMS Thomas Gleixner
@ 2022-11-11 13:59 ` Thomas Gleixner
  2022-11-11 13:59 ` [patch 32/33] iommu/amd: " Thomas Gleixner
  2022-11-11 13:59 ` [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver Thomas Gleixner
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

PCI/IMS works like PCI/MSI-X in the remapping. Just add the feature flag.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/iommu/intel/irq_remapping.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1429,7 +1429,9 @@ static const struct irq_domain_ops intel
 };
 
 static const struct msi_parent_ops dmar_msi_parent_ops = {
-	.supported_flags	= X86_VECTOR_MSI_FLAGS_SUPPORTED | MSI_FLAG_MULTI_PCI_MSI,
+	.supported_flags	= X86_VECTOR_MSI_FLAGS_SUPPORTED |
+				  MSI_FLAG_MULTI_PCI_MSI |
+				  MSI_FLAG_PCI_IMS,
 	.prefix			= "IR-",
 	.init_dev_msi_info	= msi_parent_init_dev_msi_info,
 };


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 32/33] iommu/amd: Enable PCI/IMS
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (30 preceding siblings ...)
  2022-11-11 13:59 ` [patch 31/33] iommu/vt-d: " Thomas Gleixner
@ 2022-11-11 13:59 ` Thomas Gleixner
  2022-11-11 13:59 ` [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver Thomas Gleixner
  32 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

PCI/IMS works like PCI/MSI-X in the remapping. Just add the feature flag.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/iommu/amd/iommu.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3649,7 +3649,9 @@ static struct irq_chip amd_ir_chip = {
 };
 
 static const struct msi_parent_ops amdvi_msi_parent_ops = {
-	.supported_flags	= X86_VECTOR_MSI_FLAGS_SUPPORTED | MSI_FLAG_MULTI_PCI_MSI,
+	.supported_flags	= X86_VECTOR_MSI_FLAGS_SUPPORTED |
+				  MSI_FLAG_MULTI_PCI_MSI |
+				  MSI_FLAG_PCI_IMS,
 	.prefix			= "IR-",
 	.init_dev_msi_info	= msi_parent_init_dev_msi_info,
 };


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver
  2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
                   ` (31 preceding siblings ...)
  2022-11-11 13:59 ` [patch 32/33] iommu/amd: " Thomas Gleixner
@ 2022-11-11 13:59 ` Thomas Gleixner
  2022-12-02 17:55   ` Reinette Chatre
  32 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-11 13:59 UTC (permalink / raw)
  To: LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Provide a driver for the Intel IDXD IMS implementation. The implementation
uses a large message store array in device memory.

The IMS domain implementation is minimal and just provides the required
irq_chip callbacks and one domain callback which prepares the MSI
descriptor which is allocated by the core for easy usage in the irq_chip
callbacks.

The necessary iobase is stored in the irqdomain and the PASID which is
required for operation is handed in via msi_dev_cookie in the allocation
function.

Not much to see here. A few lines of code and a filled in template is all
what's needed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/irqchip/Kconfig                    |    7 +
 drivers/irqchip/Makefile                   |    1 
 drivers/irqchip/irq-pci-intel-idxd.c       |  143 +++++++++++++++++++++++++++++
 include/linux/irqchip/irq-pci-intel-idxd.h |   22 ++++
 4 files changed, 173 insertions(+)

--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -695,4 +695,11 @@ config SUNPLUS_SP7021_INTC
 	  chained controller, routing all interrupt source in P-Chip to
 	  the primary controller on C-Chip.
 
+config PCI_INTEL_IDXD_IMS
+	tristate "Intel IDXD Interrupt Message Store controller"
+	depends on PCI_MSI
+	help
+	  Support for Intel IDXD IMS Interrupt Message Store controller
+	  with IMS slot storage in a slot array in device memory
+
 endmenu
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -121,3 +121,4 @@ obj-$(CONFIG_IRQ_IDT3243X)		+= irq-idt32
 obj-$(CONFIG_APPLE_AIC)			+= irq-apple-aic.o
 obj-$(CONFIG_MCHP_EIC)			+= irq-mchp-eic.o
 obj-$(CONFIG_SUNPLUS_SP7021_INTC)	+= irq-sp7021-intc.o
+obj-$(CONFIG_PCI_INTEL_IDXD_IMS)	+= irq-pci-intel-idxd.o
--- /dev/null
+++ b/drivers/irqchip/irq-pci-intel-idxd.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Interrupt chip and domain for Intel IDXD with hardware array based
+ * interrupt message store (IMS).
+ */
+#include <linux/device.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+#include <linux/msi.h>
+#include <linux/pci.h>
+
+#include <linux/irqchip/irq-pci-intel-idxd.h>
+
+MODULE_LICENSE("GPL");
+
+/**
+ * struct ims_slot - The hardware layout of a slot in the memory table
+ * @address_lo:	Lower 32bit address
+ * @address_hi:	Upper 32bit address
+ * @data:	Message data
+ * @ctrl:	Control word
+ */
+struct ims_slot {
+	u32	address_lo;
+	u32	address_hi;
+	u32	data;
+	u32	ctrl;
+} __packed;
+
+/* Bit to mask the interrupt in the control word */
+#define CTRL_VECTOR_MASKBIT	BIT(0)
+/* Bit to enable PASID in the control word */
+#define CTRL_PASID_ENABLE	BIT(3)
+/* Position of PASID.LSB in the control word */
+#define CTRL_PASID_SHIFT	12
+
+static inline void iowrite32_and_flush(u32 value, void __iomem *addr)
+{
+	iowrite32(value, addr);
+	ioread32(addr);
+}
+
+static void idxd_mask(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->data.iobase;
+	u32 cval = desc->data.cookie.value;
+
+	iowrite32_and_flush(cval | CTRL_VECTOR_MASKBIT, &slot->ctrl);
+}
+
+static void idxd_unmask(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->data.iobase;
+	u32 cval = desc->data.cookie.value;
+
+	iowrite32_and_flush(cval, &slot->ctrl);
+}
+
+static void idxd_write_msi_msg(struct irq_data *data, struct msi_msg *msg)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->data.iobase;
+
+	iowrite32(msg->address_lo, &slot->address_lo);
+	iowrite32(msg->address_hi, &slot->address_hi);
+	iowrite32_and_flush(msg->data, &slot->data);
+}
+
+static void idxd_shutdown(struct irq_data *data)
+{
+	struct msi_desc *desc = irq_data_get_msi_desc(data);
+	struct ims_slot __iomem *slot = desc->data.iobase;
+
+	iowrite32(0, &slot->address_lo);
+	iowrite32(0, &slot->address_hi);
+	iowrite32(0, &slot->data);
+	iowrite32_and_flush(CTRL_VECTOR_MASKBIT, &slot->ctrl);
+}
+
+static void idxd_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
+			      struct msi_desc *desc)
+{
+	struct msi_domain_info *info = domain->host_data;
+	struct ims_slot __iomem *slot;
+
+	/* Set up the slot address for the irq_chip callbacks */
+	slot = (__force struct ims_slot __iomem *) info->data;
+	slot += desc->msi_index;
+	desc->data.iobase = slot;
+
+	/* Mask the interrupt for paranoia sake */
+	iowrite32_and_flush(CTRL_VECTOR_MASKBIT, &slot->ctrl);
+
+	/*
+	 * The caller provided PASID. Shift it to the proper position
+	 * and set the PASID enable bit.
+	 */
+	desc->data.cookie.value <<= CTRL_PASID_SHIFT;
+	desc->data.cookie.value |= CTRL_PASID_ENABLE;
+
+	arg->hwirq = desc->msi_index;
+}
+
+static const struct msi_domain_template idxd_ims_template = {
+	.chip = {
+		.name			= "PCI-IDXD",
+		.irq_mask		= idxd_mask,
+		.irq_unmask		= idxd_unmask,
+		.irq_write_msi_msg	= idxd_write_msi_msg,
+		.irq_shutdown		= idxd_shutdown,
+		.flags			= IRQCHIP_ONESHOT_SAFE,
+	},
+
+	.ops = {
+		.prepare_desc		= idxd_prepare_desc,
+	},
+
+	.info = {
+		.flags			= MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS |
+					  MSI_FLAG_FREE_MSI_DESCS |
+					  MSI_FLAG_PCI_IMS,
+		.bus_token		= DOMAIN_BUS_PCI_DEVICE_IMS,
+	},
+};
+
+/**
+ * pci_intel_idxd_create_ims_domain - Create a IDXD IMS domain
+ * @pdev:	IDXD PCI device to operate on
+ * @slots:	Pointer to the mapped slot memory arrray
+ * @nr_slots:	The number of slots in the array
+ *
+ * Returns: True on success, false otherwise
+ *
+ * The domain is automatically destroyed when the @pdev is destroyed
+ */
+bool pci_intel_idxd_create_ims_domain(struct pci_dev *pdev, void __iomem *slots,
+				      unsigned int nr_slots)
+{
+	return pci_create_ims_domain(pdev, &idxd_ims_template, nr_slots, (__force void *)slots);
+}
+EXPORT_SYMBOL_GPL(pci_intel_idxd_create_ims_domain);
--- /dev/null
+++ b/include/linux/irqchip/irq-pci-intel-idxd.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* (C) Copyright 2022 Thomas Gleixner <tglx@linutronix.de> */
+
+#ifndef _LINUX_IRQCHIP_IRQ_PCI_INTEL_IDXD_H
+#define _LINUX_IRQCHIP_IRQ_PCI_INTEL_IDXD_H
+
+#include <linux/msi_api.h>
+#include <linux/bits.h>
+#include <linux/types.h>
+
+/*
+ * Conveniance macro to wrap the PASID for interrupt allocation
+ * via pci_ims_alloc_irq(pdev, INTEL_IDXD_DEV_COOKIE(pasid))
+ */
+#define INTEL_IDXD_DEV_COOKIE(pasid)	(union msi_dev_cookie) { .value = (pasid), }
+
+struct pci_dev;
+
+bool pci_intel_idxd_create_ims_domain(struct pci_dev *pdev, void __iomem *slots,
+				      unsigned int nr_slots);
+
+#endif


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 01/33] genirq/msi: Rearrange MSI domain flags
  2022-11-11 13:58 ` [patch 01/33] genirq/msi: Rearrange MSI domain flags Thomas Gleixner
@ 2022-11-16 18:41   ` Jason Gunthorpe
  0 siblings, 0 replies; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 18:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:12PM +0100, Thomas Gleixner wrote:
> These flags got added as necessary and have no obvious structure. For
> feature support checks and masking it's convenient to have two blocks of
> flags:
> 
>    1) Flags to control the internal behaviour like allocating/freeing
>       MSI descriptors. Those flags do not need any support from the
>       underlying MSI parent domain. They are mostly under the control
>       of the outermost domain which implements the actual MSI support.
> 
>    2) Flags to expose features, e.g. PCI multi-MSI or requirements
>       which can depend on a underlying domain.
> 
> No functional change.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/msi.h |   49 ++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 34 insertions(+), 15 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 02/33] genirq/msi: Provide struct msi_parent_ops
  2022-11-11 13:58 ` [patch 02/33] genirq/msi: Provide struct msi_parent_ops Thomas Gleixner
@ 2022-11-16 18:57   ` Jason Gunthorpe
  2022-11-17 15:58     ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 18:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:14PM +0100, Thomas Gleixner wrote:
> +/**
> + * msi_parent_init_dev_msi_info - Delegate initialization of device MSI info to parent domain
> + * @dev:		The device for which the domain should be created
> + * @domain:		The domain which delegates
> + * @real_parent:	The real parent domain of the to be initialized MSI domain
> + * @info:		The MSI domain info to initialize
> + *
> + * Return: true on success, false otherwise
> + *
> + * This is the most complex problem of per device MSI domains and the
> + * underlying interrupt domain hierarchy:
> + *
> + * The device domain to be initialized requests the broadest feature set
> + * possible and the underlying domain hierarchy puts restrictions on it.
> + *
> + * That's working perfectly fine for a strict parent->device model, but it
> + * falls apart with a root_parent->real_parent->device chain because the

This language hurt my brain :)

> +bool msi_parent_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
> +				  struct irq_domain *real_parent, struct msi_domain_info *info)

'real_parent' is global IRQ_DOMAIN_FLAG_MSI_PARENT of the dev for
which we are constructing a msi_domain_info to create a child aka
IRQ_DOMAIN_FLAG_MSI_DEVICE?

'domain' is the current step in the hierarchy as we walk up the ops
pointers?

Maybe:

@child_info: The MSI domain info of the IRQ_DOMAIN_FLAG_MSI_DEVICE
             domain to be created
@parent_domain: The IRQ_DOMAIN_FLAG_MSI_PARENT domain for the child to
                be created
@domain: The domain in the hierarchy this op is being called on

And perhaps it would be a bit clearer to put the parent_domain inside
the msi_domain_info, which is basically acting as an argument bundle
for a future allocation call?

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains
  2022-11-11 13:58 ` [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains Thomas Gleixner
@ 2022-11-16 19:13   ` Jason Gunthorpe
  2022-11-16 22:38     ` Thomas Gleixner
  2022-11-16 20:22   ` Bjorn Helgaas
  1 sibling, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 19:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:30PM +0100, Thomas Gleixner wrote:
> +static struct msi_domain_template pci_msi_template = {
> +	.chip = {
> +		.name			= "PCI-MSI",
> +		.irq_mask		= pci_mask_msi,
> +		.irq_unmask		= pci_unmask_msi,
> +		.irq_write_msi_msg	= pci_msi_domain_write_msg,
> +		.flags			= IRQCHIP_ONESHOT_SAFE,
> +	},
> +
> +	.ops = {
> +		.set_desc		= pci_device_domain_set_desc,
> +	},
> +
> +	.info = {
> +		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_MULTI_PCI_MSI,
> +		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSI,
> +	},
> +};
> +
> +static void pci_mask_msix(struct irq_data *data)
> +{
> +	pci_msix_mask(irq_data_get_msi_desc(data));
> +}
> +
> +static void pci_unmask_msix(struct irq_data *data)
> +{
> +	pci_msix_unmask(irq_data_get_msi_desc(data));
> +}
> +
> +static struct msi_domain_template pci_msix_template = {
> +	.chip = {
> +		.name			= "PCI-MSIX",
> +		.irq_mask		= pci_mask_msix,
> +		.irq_unmask		= pci_unmask_msix,
> +		.irq_write_msi_msg	= pci_msi_domain_write_msg,
> +		.flags			= IRQCHIP_ONESHOT_SAFE,
> +	},
> +
> +	.ops = {
> +		.set_desc		= pci_device_domain_set_desc,
> +	},
> +
> +	.info = {
> +		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
> +		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSIX,
> +	},
> +};

I like this splitting alot, it makes the whole thing make so much more
sense.

> +bool pci_setup_msi_device_domain(struct pci_dev *pdev)
> +{
> +	if (WARN_ON_ONCE(pdev->msix_enabled))
> +		return false;
> +
> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
> +		return true;
> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
> +		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
> +
> +	return pci_create_device_domain(pdev, &pci_msi_template, 1);

Hardwired to one 1? What about multi-msi?

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 13/33] x86/apic/vector: Provide MSI parent domain
  2022-11-11 13:58 ` [patch 13/33] x86/apic/vector: Provide MSI parent domain Thomas Gleixner
@ 2022-11-16 19:18   ` Jason Gunthorpe
  2022-11-17 20:06     ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 19:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:31PM +0100, Thomas Gleixner wrote:

> +/**
> + * x86_vector_init_dev_msi_info - Domain info setup for MSI domains
> + * @dev:		The device for which the domain should be created
> + * @domain:		The (root) domain providing this callback
> + * @real_parent:	The real parent domain of the to initialize domain
> + * @info:		The domain info for the to initialize domain
> + *
> + * This function is to be used for all types of MSI domains above the x86
> + * vector domain and any intermediates. The domain specific functionality
> + * is determined via the @real_parent.
> + */
> +static bool x86_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
> +				  struct irq_domain *real_parent, struct msi_domain_info *info)
> +{
> +	const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
> +
> +	/* MSI parent domain specific settings */
> +	switch (real_parent->bus_token) {
> +	case DOMAIN_BUS_ANY:
> +		/* Only the vector domain can have the ANY token */
> +		if (WARN_ON_ONCE(domain != real_parent))
> +			return false;
> +		info->chip->irq_set_affinity = msi_set_affinity;
> +		/* See msi_set_affinity() for the gory details */
> +		info->flags |= MSI_FLAG_NOMASK_QUIRK;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return false;
> +	}
> +
> +	/* Is the target supported? */
> +	switch(info->bus_token) {
> +	case DOMAIN_BUS_PCI_DEVICE_MSI:
> +	case DOMAIN_BUS_PCI_DEVICE_MSIX:
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return false;

Why does x86 care how the vector is ultimately programmed into the
device?

The leaking of the MSI programming model into the irq implementations
seems like there is still a troubled modularity.

I understand that some implementations rely on a hypercall/trap or
whatever and must know MSI vs MSI-X, but I'm surprised to see this
here.

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-11 13:58 ` [patch 19/33] genirq/msi: Provide msi_desc::msi_data Thomas Gleixner
@ 2022-11-16 19:28   ` Jason Gunthorpe
  2022-11-17  8:48     ` Thomas Gleixner
  2022-11-18 22:08     ` Thomas Gleixner
  0 siblings, 2 replies; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 19:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:41PM +0100, Thomas Gleixner wrote:

> +/**
> + * struct msi_desc_data - Generic MSI descriptor data
> + * @iobase:     Pointer to the IOMEM base adress for interrupt callbacks
> + * @cookie:	Device cookie provided at allocation time
> + *
> + * The content of this data is implementation defined, e.g. PCI/IMS
> + * implementations will define the meaning of the data.
> + */
> +struct msi_desc_data {
> +	void			__iomem *iobase;
> +	union msi_dev_cookie	cookie;
> +};

It would be nice to see the pci_msi_desc converted to a domain
specific storage as well.

Maybe could be written

struct msi_desc {
   u64 domain_data[2];
}

struct pci_msi_desc {
		u32 msi_mask;
		u8	multiple	: 3;
		u8	multi_cap	: 3;
		u8	can_mask	: 1;
		u8	is_64		: 1;
		u8	mask_pos;
		u16 default_irq;
}
static_assert(sizeof(struct pci_msi_desc) <= sizeof(((struct msi_desc *)0)->domain_data));

struct pci_msix_desc {
		u32 msix_ctrl;
		u8	multiple	: 3;
		u8	multi_cap	: 3;
		u8	can_mask	: 1;
		u8	is_64		: 1;
		u16 default_irq;
		void __iomem *mask_base;
}
static_assert(sizeof(struct pci_msix_desc) <= sizeof(((struct msi_desc *)0)->domain_data));

ideally hidden in the pci code with some irq_chip facing export API to
snoop in the bits a few places need

We've used 128 bits for the PCI descriptor, we might as well like
everyone have all 128 bits for whatever they want to do

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-11 13:58 ` [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at() Thomas Gleixner
@ 2022-11-16 19:36   ` Jason Gunthorpe
  2022-11-17  9:40     ` Thomas Gleixner
  2022-11-17 23:33   ` Reinette Chatre
  1 sibling, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 19:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:44PM +0100, Thomas Gleixner wrote:
> For supporting post MSI-X enable allocations and for the upcoming PCI/IMS
> support a seperate interface is required which allows not only the
> allocation of a specific index, but also the allocation of any, i.e. the
> next free index. The latter is especially required for IMS because IMS
> completely does away with index to functionality mappings which are
> often found in MSI/MSI-X implementation.
> 
> But even with MSI-X there are devices where only the first few indices have
> a fixed functionality and the rest is freely assignable by software,
> e.g. to queues.
> 
> msi_domain_alloc_irq_at() is also different from the range based interfaces
> as it always enforces that the MSI descriptor is allocated by the core code
> and not preallocated by the caller like the PCI/MSI[-X] enable code path
> does.
> 
> msi_domain_alloc_irq_at() can be invoked with the index argument set to
> MSI_ANY_INDEX which makes the core code pick the next free index. The irq
> domain can provide a prepare_desc() operation callback in its
> msi_domain_ops to do domain specific post allocation initialization before
> the actual Linux interrupt and the associated interrupt descriptor and
> hierarchy alloccations are conducted.
> 
> The function also takes an optional @cookie argument which is of type union
> msi_dev_cookie. This cookie is not used by the core code and is stored in
> the allocated msi_desc::data::cookie. The meaning of the cookie is
> completely implementation defined. In case of IMS this might be a PASID or
> a pointer to a device queue, but for the MSI core it's opaque and not used
> in any way.

To my mind it makes more sense to pass a 'void *' through from
msi_domain_alloc_irq_at() to the prepare_desc() op with the idea that
the driver calling msi_domain_alloc_irq_at() knows it is calling it
against the domain that it allocated. The prepare_desc can then use
the void * to properly initialize anything about the desc under the
right lock.

Before calling this the driver should have setup whatever thing is
going to originate the interrupt, eg allocated the HW object that
sources the interrupt and part of what the void * would convey is the
detailed information on how to program the HW object. eg IDXD is using
an iobase and an offset along with the enforcing PASID, but something
like mlx5 would probably want an object id, type, and SF ID.

This is again where I don't much like the use of an ID to refer to the
domain.

Having the driver allocate the device domain, retain a pointer to it,
and use that domain pointer with all these new APIs seems much clearer
than converting the pointer to an ID.

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 22/33] genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN
  2022-11-11 13:58 ` [patch 22/33] genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN Thomas Gleixner
@ 2022-11-16 19:36   ` Jason Gunthorpe
  0 siblings, 0 replies; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 19:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:46PM +0100, Thomas Gleixner wrote:
> Provide a new MSI feature flag in preparation for dynamic MSIX allocation
> after the initial MSI-X enable has been done.
> 
> This needs to be an explicit MSI interrupt domain feature because quite
> some implementations (both interrupt domains and legacy allocation mode)
> have clear expectations that the allocation code is only invoked when MSI-X
> is about to be enabled. They either talk to hypervisors or do some other
> work and are not prepared to be invoked on an already MSI-X enabled device.
> 
> This is also explicit MSI-X only because rewriting the size of the MSI
> entries is only possible when disabling MSI which in turn might cause lost
> interrupts on the device.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/msi.h |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op
  2022-11-11 13:58 ` [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op Thomas Gleixner
@ 2022-11-16 19:40   ` Jason Gunthorpe
  2022-11-16 20:26   ` Bjorn Helgaas
  1 sibling, 0 replies; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 19:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:49PM +0100, Thomas Gleixner wrote:
> Dynamic MSI-X vector allocation post MSI-X allows to allocate vectors at a
> given index or at any free index in the available table range. The latter
> requires that the core code selects the index at descriptor allocation time.
> 
> This requires that the PCI/MSI-X specific setup of the MSI-X descriptor,
> which is partially depending on the chosen index happens after allocation.
> 
> Implement the prepare_desc() op in the PCI/MSI-X specific msi_domain_ops
> which is invoked before the core interrupt descriptor and the associated
> Linux interrupt number is allocated. That callback is also provided for the
> upcoming PCI/IMS implementations so the implementation specific interrupt
> domain can do their domain specific initialization of the MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  drivers/pci/msi/irqdomain.c |    9 +++++++++
>  1 file changed, 9 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 27/33] genirq/msi: Provide constants for PCI/IMS support
  2022-11-11 13:58 ` [patch 27/33] genirq/msi: Provide constants for PCI/IMS support Thomas Gleixner
@ 2022-11-16 19:54   ` Jason Gunthorpe
  2022-11-17  9:46     ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-16 19:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:54PM +0100, Thomas Gleixner wrote:
> Provide the necessary constants for PCI/IMS support:
> 
>   - A new bus token for MSI irqdomain identification
>   - A MSI feature flag for the MSI irqdomains to signal support
>   - A secondary domain id
> 
> The latter expands the device internal domain pointer storage array from 1
> to 2 entries. That extra pointer is mostly unused today, but the
> alternative solutions would not be free either and would introduce more
> complexity all over the place. Trade the 8bytes for simplicity.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/irqdomain_defs.h |    1 +
>  include/linux/msi.h            |    2 ++
>  include/linux/msi_api.h        |    1 +
>  3 files changed, 4 insertions(+)
> 
> --- a/include/linux/irqdomain_defs.h
> +++ b/include/linux/irqdomain_defs.h
> @@ -25,6 +25,7 @@ enum irq_domain_bus_token {
>  	DOMAIN_BUS_PCI_DEVICE_MSIX,
>  	DOMAIN_BUS_DMAR,
>  	DOMAIN_BUS_AMDVI,
> +	DOMAIN_BUS_PCI_DEVICE_IMS,

I don't think we should call this IMS.. GENERIC maybe?

Things that can support IMS should really, IMHO, just not check for
PCI MSI/MSIX and effectively support everything. They don't override
the write_msg, and they don't care how the message is programmed.

> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -525,6 +525,8 @@ enum {
>  	MSI_FLAG_MSIX_CONTIGUOUS	= (1 << 19),
>  	/* PCI/MSI-X vectors can be dynamically allocated/freed post MSI-X enable */
>  	MSI_FLAG_PCI_MSIX_ALLOC_DYN	= (1 << 20),
> +	/* Support for PCI/IMS */
> +	MSI_FLAG_PCI_IMS		= (1 << 21),

Maybe for legacy reasons it is too complicated, but it would be so
much clearer of the special case of "I only know how to support PCI
MSI and PCI MSI-X" was called out as a special flag, and the more
general case of "any write_msg is fine by me" was left behind.

I feel like when the device domain is created in the first place the
parent domain(s) should be able to reject the creation if the
requested child domain is not one it supports. Eg the hypervisor
interactions checks if the child domain is PCI MSI or PCI MSI-X and
rejects otherwise, because that is the only thing the hypervisor knows
how to work with.

If we did that perhaps we don't even need a flag or further checks?

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 10/33] PCI/MSI: Split __pci_write_msi_msg()
  2022-11-11 13:58 ` [patch 10/33] PCI/MSI: Split __pci_write_msi_msg() Thomas Gleixner
@ 2022-11-16 20:10   ` Bjorn Helgaas
  0 siblings, 0 replies; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:27PM +0100, Thomas Gleixner wrote:
> The upcoming per device MSI domains will create different domains for MSI
> and MSI-X. Split the write message function into MSI and MSI-X helpers so
> they can be used by those new domain functions seperately.
> 
> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/msi/msi.c |  104 +++++++++++++++++++++++++-------------------------
>  1 file changed, 54 insertions(+), 50 deletions(-)
> 
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -180,6 +180,58 @@ void __pci_read_msi_msg(struct msi_desc
>  	}
>  }
>  
> +static inline void pci_write_msg_msi(struct pci_dev *dev, struct msi_desc *desc,
> +				     struct msi_msg *msg)
> +{
> +	int pos = dev->msi_cap;
> +	u16 msgctl;
> +
> +	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
> +	msgctl &= ~PCI_MSI_FLAGS_QSIZE;
> +	msgctl |= desc->pci.msi_attrib.multiple << 4;
> +	pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);
> +
> +	pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO, msg->address_lo);
> +	if (desc->pci.msi_attrib.is_64) {
> +		pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI,  msg->address_hi);
> +		pci_write_config_word(dev, pos + PCI_MSI_DATA_64, msg->data);
> +	} else {
> +		pci_write_config_word(dev, pos + PCI_MSI_DATA_32, msg->data);
> +	}
> +	/* Ensure that the writes are visible in the device */
> +	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
> +}
> +
> +static inline void pci_write_msg_msix(struct msi_desc *desc, struct msi_msg *msg)
> +{
> +	void __iomem *base = pci_msix_desc_addr(desc);
> +	u32 ctrl = desc->pci.msix_ctrl;
> +	bool unmasked = !(ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT);
> +
> +	if (desc->pci.msi_attrib.is_virtual)
> +		return;
> +	/*
> +	 * The specification mandates that the entry is masked
> +	 * when the message is modified:
> +	 *
> +	 * "If software changes the Address or Data value of an
> +	 * entry while the entry is unmasked, the result is
> +	 * undefined."
> +	 */
> +	if (unmasked)
> +		pci_msix_write_vector_ctrl(desc, ctrl | PCI_MSIX_ENTRY_CTRL_MASKBIT);
> +
> +	writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
> +	writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
> +	writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
> +
> +	if (unmasked)
> +		pci_msix_write_vector_ctrl(desc, ctrl);
> +
> +	/* Ensure that the writes are visible in the device */
> +	readl(base + PCI_MSIX_ENTRY_DATA);
> +}
> +
>  void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  {
>  	struct pci_dev *dev = msi_desc_to_pci_dev(entry);
> @@ -187,63 +239,15 @@ void __pci_write_msi_msg(struct msi_desc
>  	if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
>  		/* Don't touch the hardware now */
>  	} else if (entry->pci.msi_attrib.is_msix) {
> -		void __iomem *base = pci_msix_desc_addr(entry);
> -		u32 ctrl = entry->pci.msix_ctrl;
> -		bool unmasked = !(ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT);
> -
> -		if (entry->pci.msi_attrib.is_virtual)
> -			goto skip;
> -
> -		/*
> -		 * The specification mandates that the entry is masked
> -		 * when the message is modified:
> -		 *
> -		 * "If software changes the Address or Data value of an
> -		 * entry while the entry is unmasked, the result is
> -		 * undefined."
> -		 */
> -		if (unmasked)
> -			pci_msix_write_vector_ctrl(entry, ctrl | PCI_MSIX_ENTRY_CTRL_MASKBIT);
> -
> -		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
> -		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
> -		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
> -
> -		if (unmasked)
> -			pci_msix_write_vector_ctrl(entry, ctrl);
> -
> -		/* Ensure that the writes are visible in the device */
> -		readl(base + PCI_MSIX_ENTRY_DATA);
> +		pci_write_msg_msix(entry, msg);
>  	} else {
> -		int pos = dev->msi_cap;
> -		u16 msgctl;
> -
> -		pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
> -		msgctl &= ~PCI_MSI_FLAGS_QSIZE;
> -		msgctl |= entry->pci.msi_attrib.multiple << 4;
> -		pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);
> -
> -		pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO,
> -				       msg->address_lo);
> -		if (entry->pci.msi_attrib.is_64) {
> -			pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI,
> -					       msg->address_hi);
> -			pci_write_config_word(dev, pos + PCI_MSI_DATA_64,
> -					      msg->data);
> -		} else {
> -			pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
> -					      msg->data);
> -		}
> -		/* Ensure that the writes are visible in the device */
> -		pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
> +		pci_write_msg_msi(dev, entry, msg);
>  	}
>  
> -skip:
>  	entry->msg = *msg;
>  
>  	if (entry->write_msi_msg)
>  		entry->write_msi_msg(entry, entry->write_msi_msg_data);
> -
>  }
>  
>  void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg)
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 23/33] PCI/MSI: Split MSIX descriptor setup
  2022-11-11 13:58 ` [patch 23/33] PCI/MSI: Split MSIX descriptor setup Thomas Gleixner
@ 2022-11-16 20:13   ` Bjorn Helgaas
  0 siblings, 0 replies; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Spelled "MSI-X" elsewhere (subject line).

On Fri, Nov 11, 2022 at 02:58:47PM +0100, Thomas Gleixner wrote:
> The upcoming mechanism to allocate MSI-X vectors after enabling MSI-X needs
> to share some of the MSI-X descriptor setup.
> 
> The regular descriptor setup on enable has the following code flow:
> 
>     1) Allocate descriptor
>     2) Setup descriptor with PCI specific data
>     3) Insert descriptor
>     4) Allocate interrupts which in turn scans the inserted
>        descriptors
> 
> This cannot be easily changed because the PCI/MSI code needs to handle the
> legacy architecture specific allocation model and the irq domain model
> where quite some domains have the assumption that the above flow is how it
> works.
> 
> Ideally the code flow should look like this:
> 
>    1) Invoke allocation at the MSI core
>    2) MSI core allocates descriptor
>    3) MSI core calls back into the irq domain which fills in
>       the domain specific parts
> 
> This could be done for underlying parent MSI domains which support
> post-enable allocation/free but that would create significantly different
> code pathes for MSI/MSI-X enable.
> 
> Though for dynamic allocation which wants to share the allocation code with
> the upcoming PCI/IMS support its the right thing to do.

s/its/it's/

> Split the MSIX descriptor setup into the preallocation part which just sets

MSI-X

> the index and fills in the horrible hack of virtual IRQs and the real PCI
> specific MSI-X setup part which solely depends on the index in the
> descriptor. This allows to provide a common dynami allocation interface at

dynamic

> the MSI core level for both PCI/MSI-X and PCI/IMS.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Typos below.

> ---
>  drivers/pci/msi/msi.c |   72 +++++++++++++++++++++++++++++++-------------------
>  drivers/pci/msi/msi.h |    2 +
>  2 files changed, 47 insertions(+), 27 deletions(-)
> 
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -569,34 +569,56 @@ static void __iomem *msix_map_region(str
>  	return ioremap(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
>  }
>  
> -static int msix_setup_msi_descs(struct pci_dev *dev, void __iomem *base,
> -				struct msix_entry *entries, int nvec,
> -				struct irq_affinity_desc *masks)
> +/**
> + * msix_prepare_msi_desc - Prepare a half initialized MSI descriptor for operation
> + * @dev:	The PCI device for which the descriptor is prepared
> + * @desc:	The MSI descriptor for preparation
> + *
> + * This is seperate from msix_setup_msi_descs() below to handle dynamic

separate

> + * allocations for MSIX after initial enablement.

MSI-X (and again below)

> + * Ideally the whole MSIX setup would work that way, but there is no way to
> + * support this for the legacy arch_setup_msi_irqs() mechanism and for the
> + * fake irq domains like the x86 XEN one. Sigh...
> + *
> + * The descriptor is zeroed and only @desc::msi_index and @desc::affinity
> + * are set. When called from msix_setup_msi_descs() then the is_virtual
> + * attribute is initialized as well.
> + *
> + * Fill in the rest.
> + */
> +void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc)
> +{
> +	desc->nvec_used				= 1;
> +	desc->pci.msi_attrib.is_msix		= 1;
> +	desc->pci.msi_attrib.is_64		= 1;
> +	desc->pci.msi_attrib.default_irq	= dev->irq;
> +	desc->pci.mask_base			= dev->msix_base;
> +	desc->pci.msi_attrib.can_mask		= !pci_msi_ignore_mask &&
> +						  !desc->pci.msi_attrib.is_virtual;
> +
> +	if (desc->pci.msi_attrib.can_mask) {
> +		void __iomem *addr = pci_msix_desc_addr(desc);
> +
> +		desc->pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
> +	}
> +}
> +
> +static int msix_setup_msi_descs(struct pci_dev *dev, struct msix_entry *entries,
> +				int nvec, struct irq_affinity_desc *masks)
>  {
>  	int ret = 0, i, vec_count = pci_msix_vec_count(dev);
>  	struct irq_affinity_desc *curmsk;
>  	struct msi_desc desc;
> -	void __iomem *addr;
>  
>  	memset(&desc, 0, sizeof(desc));
>  
> -	desc.nvec_used			= 1;
> -	desc.pci.msi_attrib.is_msix	= 1;
> -	desc.pci.msi_attrib.is_64	= 1;
> -	desc.pci.msi_attrib.default_irq	= dev->irq;
> -	desc.pci.mask_base		= base;
> -
>  	for (i = 0, curmsk = masks; i < nvec; i++, curmsk++) {
>  		desc.msi_index = entries ? entries[i].entry : i;
>  		desc.affinity = masks ? curmsk : NULL;
>  		desc.pci.msi_attrib.is_virtual = desc.msi_index >= vec_count;
> -		desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
> -					       !desc.pci.msi_attrib.is_virtual;
>  
> -		if (desc.pci.msi_attrib.can_mask) {
> -			addr = pci_msix_desc_addr(&desc);
> -			desc.pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
> -		}
> +		msix_prepare_msi_desc(dev, &desc);
>  
>  		ret = msi_insert_msi_desc(&dev->dev, &desc);
>  		if (ret)
> @@ -629,9 +651,8 @@ static void msix_mask_all(void __iomem *
>  		writel(ctrl, base + PCI_MSIX_ENTRY_VECTOR_CTRL);
>  }
>  
> -static int msix_setup_interrupts(struct pci_dev *dev, void __iomem *base,
> -				 struct msix_entry *entries, int nvec,
> -				 struct irq_affinity *affd)
> +static int msix_setup_interrupts(struct pci_dev *dev, struct msix_entry *entries,
> +				 int nvec, struct irq_affinity *affd)
>  {
>  	struct irq_affinity_desc *masks = NULL;
>  	int ret;
> @@ -640,7 +661,7 @@ static int msix_setup_interrupts(struct
>  		masks = irq_create_affinity_masks(nvec, affd);
>  
>  	msi_lock_descs(&dev->dev);
> -	ret = msix_setup_msi_descs(dev, base, entries, nvec, masks);
> +	ret = msix_setup_msi_descs(dev, entries, nvec, masks);
>  	if (ret)
>  		goto out_free;
>  
> @@ -678,7 +699,6 @@ static int msix_setup_interrupts(struct
>  static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
>  				int nvec, struct irq_affinity *affd)
>  {
> -	void __iomem *base;
>  	int ret, tsize;
>  	u16 control;
>  
> @@ -696,15 +716,13 @@ static int msix_capability_init(struct p
>  	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &control);
>  	/* Request & Map MSI-X table region */
>  	tsize = msix_table_size(control);
> -	base = msix_map_region(dev, tsize);
> -	if (!base) {
> +	dev->msix_base = msix_map_region(dev, tsize);
> +	if (!dev->msix_base) {
>  		ret = -ENOMEM;
>  		goto out_disable;
>  	}
>  
> -	dev->msix_base = base;
> -
> -	ret = msix_setup_interrupts(dev, base, entries, nvec, affd);
> +	ret = msix_setup_interrupts(dev, entries, nvec, affd);
>  	if (ret)
>  		goto out_disable;
>  
> @@ -719,7 +737,7 @@ static int msix_capability_init(struct p
>  	 * which takes the MSI-X mask bits into account even
>  	 * when MSI-X is disabled, which prevents MSI delivery.
>  	 */
> -	msix_mask_all(base, tsize);
> +	msix_mask_all(dev->msix_base, tsize);
>  	pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0);
>  
>  	pcibios_free_irq(dev);
> --- a/drivers/pci/msi/msi.h
> +++ b/drivers/pci/msi/msi.h
> @@ -84,6 +84,8 @@ static inline __attribute_const__ u32 ms
>  	return (1 << (1 << desc->pci.msi_attrib.multi_cap)) - 1;
>  }
>  
> +void msix_prepare_msi_desc(struct pci_dev *dev, struct msi_desc *desc);
> +
>  /* Subsystem variables */
>  extern int pci_msi_enable;
>  
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 14/33] PCI/MSI: Remove unused pci_dev_has_special_msi_domain()
  2022-11-11 13:58 ` [patch 14/33] PCI/MSI: Remove unused pci_dev_has_special_msi_domain() Thomas Gleixner
@ 2022-11-16 20:13   ` Bjorn Helgaas
  0 siblings, 0 replies; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:33PM +0100, Thomas Gleixner wrote:
> The check for special MSI domains like VMD which prevents the interrupt
> remapping code to overwrite device::msi::domain is not longer required and
> has been replaced by an x86 specific version which is aware of MSI parent
> domains.
> 
> Remove it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/msi/irqdomain.c |   21 ---------------------
>  include/linux/msi.h         |    2 --
>  2 files changed, 23 deletions(-)
> 
> --- a/drivers/pci/msi/irqdomain.c
> +++ b/drivers/pci/msi/irqdomain.c
> @@ -414,24 +414,3 @@ struct irq_domain *pci_msi_get_device_do
>  					     DOMAIN_BUS_PCI_MSI);
>  	return dom;
>  }
> -
> -/**
> - * pci_dev_has_special_msi_domain - Check whether the device is handled by
> - *				    a non-standard PCI-MSI domain
> - * @pdev:	The PCI device to check.
> - *
> - * Returns: True if the device irqdomain or the bus irqdomain is
> - * non-standard PCI/MSI.
> - */
> -bool pci_dev_has_special_msi_domain(struct pci_dev *pdev)
> -{
> -	struct irq_domain *dom = dev_get_msi_domain(&pdev->dev);
> -
> -	if (!dom)
> -		dom = dev_get_msi_domain(&pdev->bus->dev);
> -
> -	if (!dom)
> -		return true;
> -
> -	return dom->bus_token != DOMAIN_BUS_PCI_MSI;
> -}
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -599,8 +599,6 @@ struct irq_domain *pci_msi_create_irq_do
>  					     struct irq_domain *parent);
>  u32 pci_msi_domain_get_msi_rid(struct irq_domain *domain, struct pci_dev *pdev);
>  struct irq_domain *pci_msi_get_device_domain(struct pci_dev *pdev);
> -bool pci_dev_has_special_msi_domain(struct pci_dev *pdev);
> -
>  #endif /* CONFIG_GENERIC_MSI_IRQ */
>  
>  #endif /* LINUX_MSI_H */
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 29/33] PCI/MSI: Provide pci_ims_alloc/free_irq()
  2022-11-11 13:58 ` [patch 29/33] PCI/MSI: Provide pci_ims_alloc/free_irq() Thomas Gleixner
@ 2022-11-16 20:14   ` Bjorn Helgaas
  0 siblings, 0 replies; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:57PM +0100, Thomas Gleixner wrote:
> Single vector allocation which allocates the next free index in the IMS
> space. The free function releases.
> 
> All allocated vectors are released also via pci_free_vectors() which is
> also releasing MSI/MSI-X vectors.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

I would probably capitalize "ID" in the function comment below, but
either way.

> ---
>  drivers/pci/msi/api.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h   |    3 +++
>  2 files changed, 53 insertions(+)
> 
> --- a/drivers/pci/msi/api.c
> +++ b/drivers/pci/msi/api.c
> @@ -361,6 +361,56 @@ const struct cpumask *pci_irq_get_affini
>  EXPORT_SYMBOL(pci_irq_get_affinity);
>  
>  /**
> + * pci_ims_alloc_irq - Allocate an interrupt on a PCI/IMS interrupt domain
> + * @dev:	The PCI device to operate on
> + * @cookie:	Pointer to an IMS implementation specific device cookie
> + *		(PASID, queue id, pointer...). The cookie content is stored
> + *		in the MSI descriptor for the interrupt chip callbacks or
> + *		domain specific setup functions
> + * @affdesc:	Optional pointer to an interrupt affinity descriptor
> + *
> + * Return: A struct msi_map
> + *
> + *	On success msi_map::index contains the allocated index (>= 0) and
> + *	msi_map::virq the allocated Linux interrupt number (> 0).
> + *
> + *	On fail msi_map::index contains the error code and msi_map::virq
> + *	is set to 0.
> + *
> + * Note: There is no index for IMS allocations as IMS is an implementation
> + *	 specific storage and does not have any direct associations between
> + *	 index, which might be a pure software construct, and device
> + *	 functionality. This association is established by the driver either
> + *	 via the index - if there is a hardware table - or in case of purely
> + *	 software managed IMS implementation the association happens via
> + *	 the irq_write_msi_msg() callback of the implementation specific
> + *	 interrupt chip, which utilizes the provided @cookie to store the MSI
> + *	 message in the appropriate place.
> + */
> +struct msi_map pci_ims_alloc_irq(struct pci_dev *dev, union msi_dev_cookie *cookie,
> +				 const struct irq_affinity_desc *affdesc)
> +{
> +	return msi_domain_alloc_irq_at(&dev->dev, MSI_SECONDARY_DOMAIN, MSI_ANY_INDEX,
> +				       affdesc, cookie);
> +}
> +EXPORT_SYMBOL_GPL(pci_ims_alloc_irq);
> +
> +/**
> + * pci_ims_free_irq - Allocate an interrupt on a PCI/IMS interrupt domain
> + *		      which was allocated via pci_ims_alloc_irq()
> + * @dev:	The PCI device to operate on
> + * @map:	A struct msi_map describing the interrupt to free as
> + *		returned from pci_ims_alloc_irq()
> + */
> +void pci_ims_free_irq(struct pci_dev *dev, struct msi_map map)
> +{
> +	if (WARN_ON_ONCE(map.index < 0 || !map.virq))
> +		return;
> +	msi_domain_free_irqs_range(&dev->dev, MSI_SECONDARY_DOMAIN, map.index, map.index);
> +}
> +EXPORT_SYMBOL_GPL(pci_ims_free_irq);
> +
> +/**
>   * pci_free_irq_vectors() - Free previously allocated IRQs for a device
>   * @dev: the PCI device to operate on
>   *
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2485,6 +2485,9 @@ struct msi_domain_template;
>  
>  bool pci_create_ims_domain(struct pci_dev *pdev, const struct msi_domain_template *template,
>  			   unsigned int hwsize, void *data);
> +struct msi_map pci_ims_alloc_irq(struct pci_dev *pdev, union msi_dev_cookie *cookie,
> +				 const struct irq_affinity_desc *affdesc);
> +void pci_ims_free_irq(struct pci_dev *pdev, struct msi_map map);
>  
>  #include <linux/dma-mapping.h>
>  
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 28/33] PCI/MSI: Provide IMS (Interrupt Message Store) support
  2022-11-11 13:58 ` [patch 28/33] PCI/MSI: Provide IMS (Interrupt Message Store) support Thomas Gleixner
@ 2022-11-16 20:17   ` Bjorn Helgaas
  0 siblings, 0 replies; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:55PM +0100, Thomas Gleixner wrote:
> IMS (Interrupt Message Store) is a new specification which allows
> implementation specific storage of MSI messages contrary to the
> strict standard specified MSI and MSI-X message stores.
> 
> This requires new device specific interrupt domains to handle the
> implementation defined storage which can be an array in device memory or
> host/guest memory which is shared with hardware queues.
> 
> Add a function to create IMS domains for PCI devices. IMS domains are using
> the new per device domain mechanism and are configured by the device driver
> via a template. IMS domains are created as secondary device domains so they
> work side on side with MSI[-X] on the same device.
> 
> The IMS domains have a few constraints:
> 
>   - The index space is managed by the core code.
> 
>     Device memory based IMS provides a storage array with a fixed size
>     which obviously requires an index. But there is no association between
>     index and functionality so the core can randomly allocate an index in
>     the array.
> 
>     Queue memory based IMS does not have the concept of an index as the
>     storage is somewhere in memory. In that case the index is purely
>     software based to keep track of the allocations.
> 
>   - There is no requirement for consecutive index ranges
> 
>     This is currently a limitation of the MSI core and can be implemented
>     if there is a justified use case by changing the internal storage from
>     xarray to maple_tree. For now it's single vector allocation.
> 
>   - The interrupt chip must provide the following callbacks:
> 
>   	- irq_mask()
> 	- irq_unmask()
> 	- irq_write_msi_msg()
> 
>    - The interrupt chip must provide the following optional callbacks
>      when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
>      cannot operate directly on hardware, e.g. in the case that the
>      interrupt message store is in queue memory:
> 
>      	- irq_bus_lock()
> 	- irq_bus_unlock()
> 
>      These callbacks are invoked from preemptible task context and are
>      allowed to sleep. In this case the mandatory callbacks above just
>      store the information. The irq_bus_unlock() callback is supposed to
>      make the change effective before returning.
> 
>    - Interrupt affinity setting is handled by the underlying parent
>      interrupt domain and communicated to the IMS domain via
>      irq_write_msi_msg(). IMS domains cannot have a irq_set_affinity()
>      callback. That's a reasonable restriction similar to the PCI/MSI
>      device domain implementations.
> 
> The domain is automatically destroyed when the PCI device is removed.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

A couple typos below.

> ---
>  drivers/pci/msi/irqdomain.c |   59 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h         |    5 +++
>  2 files changed, 64 insertions(+)
> 
> --- a/drivers/pci/msi/irqdomain.c
> +++ b/drivers/pci/msi/irqdomain.c
> @@ -355,6 +355,65 @@ bool pci_msi_domain_supports(struct pci_
>  	return (supported & feature_mask) == feature_mask;
>  }
>  
> +/**
> + * pci_create_ims_domain - Create a secondary IMS domain for a PCI device
> + * @pdev:	The PCI device to operate on
> + * @template:	The MSI info template which describes the domain
> + * @hwsize:	The size of the hardware entry table or 0 if the domain
> + *		is purely software managed
> + * @data:	Optional pointer to domain specific data to be stored
> + *		in msi_domain_info::data
> + *
> + * Return: True on success, false otherwise
> + *
> + * A IMS domain is expected to have the following constraints:

An IMS ...

> + *	- The index space is managed by the core code
> + *
> + *	- There is no requirement for consecutive index ranges
> + *
> + *	- The interrupt chip must provide the following callbacks:
> + *		- irq_mask()
> + *		- irq_unmask()
> + *		- irq_write_msi_msg()
> + *
> + *	- The interrupt chip must provide the following optional callbacks
> + *	  when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
> + *	  cannot operate directly on hardware, e.g. in the case that the
> + *	  interrupt message store is in queue memory:
> + *		- irq_bus_lock()
> + *		- irq_bus_unlock()
> + *
> + *	  These callbacks are invoked from preemptible task context and are
> + *	  allowed to sleep. In this case the mandatory callbacks above just
> + *	  store the information. The irq_bus_unlock() callback is supposed
> + *	  to make the change effective before returning.
> + *
> + *     - Interrupt affinity setting is handled by the underlying parent
> + *	 interrupt domain and communicated to the IMS domain via
> + *	 irq_write_msi_msg().

Different indentation than the bullet items above.

> + *
> + * The domain is automatically destroyed when the PCI device is removed.
> + */
> +bool pci_create_ims_domain(struct pci_dev *pdev, const struct msi_domain_template *template,
> +			   unsigned int hwsize, void *data)
> +{
> +	struct irq_domain *domain = dev_get_msi_domain(&pdev->dev);
> +
> +	if (!domain || !irq_domain_is_msi_parent(domain))
> +		return -ENOTSUPP;
> +
> +	if (template->info.bus_token != DOMAIN_BUS_PCI_DEVICE_IMS ||
> +	    !(template->info.flags & MSI_FLAG_ALLOC_SIMPLE_MSI_DESCS) ||
> +	    !(template->info.flags & MSI_FLAG_FREE_MSI_DESCS) ||
> +	    !template->chip.irq_mask || !template->chip.irq_unmask ||
> +	    !template->chip.irq_write_msi_msg || template->chip.irq_set_affinity)
> +		return -EINVAL;
> +
> +	return msi_create_device_irq_domain(&pdev->dev, MSI_SECONDARY_DOMAIN, template,
> +					    hwsize, data, NULL);
> +}
> +EXPORT_SYMBOL_GPL(pci_create_ims_domain);
> +
>  /*
>   * Users of the generic MSI infrastructure expect a device to have a single ID,
>   * so with DMA aliases we have to pick the least-worst compromise. Devices with
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2481,6 +2481,11 @@ static inline bool pci_is_thunderbolt_at
>  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
>  #endif
>  
> +struct msi_domain_template;
> +
> +bool pci_create_ims_domain(struct pci_dev *pdev, const struct msi_domain_template *template,
> +			   unsigned int hwsize, void *data);
> +
>  #include <linux/dma-mapping.h>
>  
>  #define pci_printk(level, pdev, fmt, arg...) \
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 25/33] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X
  2022-11-11 13:58 ` [patch 25/33] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X Thomas Gleixner
@ 2022-11-16 20:19   ` Bjorn Helgaas
  2022-11-16 22:43     ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:51PM +0100, Thomas Gleixner wrote:
> MSI-X vectors can be allocated after the initial MSI-X enablement, but this
> needs explicit support of the underlying interrupt domains.
> 
> Provide a function to query the ability and functions to allocate/free
> individual vectors post-enable.
> 
> The allocation can either request a specific index in the MSI-X table or
> with the index argument MSI_ANY_INDEX it allocates the next free vector.
> 
> The return value is a struct msi_map which on success contains both index
> and the Linux interrupt number. In case of failure index is negative and
> the Linux interrupt number is 0.
> 
> The allocation function is for a single MSI-X index at a time as that's
> sufficient for the most urgent use case VFIO to get rid of the 'disable
> MSI-X, reallocate, enable-MSI-X' cycle which is prone to lost interrupts
> and redirections to the legacy and obviously unhandled INTx.
> 
> Also for the use cases Jason Gunthorpe pointed a single index allocation
> is sufficient.

Maybe a URL or outline the use cases so this means something in a few
years?  I haven't followed this discussion, so it doesn't even mean
anything to me now :)

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/msi/api.c       |   67 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/msi/irqdomain.c |    3 +
>  include/linux/pci.h         |    6 +++
>  3 files changed, 75 insertions(+), 1 deletion(-)
> 
> --- a/drivers/pci/msi/api.c
> +++ b/drivers/pci/msi/api.c
> @@ -113,6 +113,73 @@ int pci_enable_msix_range(struct pci_dev
>  EXPORT_SYMBOL(pci_enable_msix_range);
>  
>  /**
> + * pci_msix_can_alloc_dyn - Query whether dynamic allocation after enabling
> + *			    MSI-X is supported
> + *
> + * @dev:	PCI device to operate on
> + *
> + * Return: True if supported, false otherwise
> + */
> +bool pci_msix_can_alloc_dyn(struct pci_dev *dev)
> +{
> +	if (!dev->msix_cap)
> +		return false;
> +
> +	return pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX_ALLOC_DYN, DENY_LEGACY);
> +}
> +EXPORT_SYMBOL_GPL(pci_msix_can_alloc_dyn);
> +
> +/**
> + * pci_msix_alloc_irq_at - Allocate an MSI-X interrupt after enabling MSI-X
> + *			   at a given MSI-X vector index or any free vector index
> + *
> + * @dev:	PCI device to operate on
> + * @index:	Index to allocate. If @index == MSI_ANY_INDEX this allocates
> + *		the next free index in the MSI-X table
> + * @affdesc:	Optional pointer to an affinity descriptor structure. NULL otherwise
> + *
> + * Return: A struct msi_map
> + *
> + *	On success msi_map::index contains the allocated index (>= 0) and
> + *	msi_map::virq the allocated Linux interrupt number (> 0).
> + *
> + *	On fail msi_map::index contains the error code and msi_map::virq
> + *	is set to 0.
> + */
> +struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
> +				     const struct irq_affinity_desc *affdesc)
> +{
> +	struct msi_map map = { .index = -ENOTSUPP };
> +
> +	if (!dev->msix_enabled)
> +		return map;
> +
> +	if (!pci_msix_can_alloc_dyn(dev))
> +		return map;
> +
> +	return msi_domain_alloc_irq_at(&dev->dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL);
> +}
> +EXPORT_SYMBOL_GPL(pci_msix_alloc_irq_at);
> +
> +/**
> + * pci_msix_free_irq - Free an interrupt on a PCI/MSIX interrupt domain
> + *		      which was allocated via pci_msix_alloc_irq_at()
> + *
> + * @dev:	The PCI device to operate on
> + * @map:	A struct msi_map describing the interrupt to free
> + *		as returned from the allocation function.
> + */
> +void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map)
> +{
> +	if (WARN_ON_ONCE(map.index < 0 || map.virq <= 0))
> +		return;
> +	if (WARN_ON_ONCE(!pci_msix_can_alloc_dyn(dev)))
> +		return;
> +	msi_domain_free_irqs_range(&dev->dev, MSI_DEFAULT_DOMAIN, map.index, map.index);
> +}
> +EXPORT_SYMBOL_GPL(pci_msix_free_irq);
> +
> +/**
>   * pci_disable_msix() - Disable MSI-X interrupt mode on device
>   * @dev: the PCI device to operate on
>   *
> --- a/drivers/pci/msi/irqdomain.c
> +++ b/drivers/pci/msi/irqdomain.c
> @@ -225,7 +225,8 @@ static struct msi_domain_template pci_ms
>  	},
>  
>  	.info = {
> -		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
> +		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX |
> +					  MSI_FLAG_PCI_MSIX_ALLOC_DYN,
>  		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSIX,
>  	},
>  };
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -38,6 +38,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/resource_ext.h>
> +#include <linux/msi_api.h>
>  #include <uapi/linux/pci.h>
>  
>  #include <linux/pci_ids.h>
> @@ -1559,6 +1560,11 @@ int pci_alloc_irq_vectors_affinity(struc
>  				   unsigned int max_vecs, unsigned int flags,
>  				   struct irq_affinity *affd);
>  
> +bool pci_msix_can_alloc_dyn(struct pci_dev *dev);
> +struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
> +				     const struct irq_affinity_desc *affdesc);
> +void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map);
> +
>  void pci_free_irq_vectors(struct pci_dev *dev);
>  int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
>  const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains
  2022-11-11 13:58 ` [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains Thomas Gleixner
  2022-11-16 19:13   ` Jason Gunthorpe
@ 2022-11-16 20:22   ` Bjorn Helgaas
  1 sibling, 0 replies; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

Comment below uses [-X] (not [X]).

On Fri, Nov 11, 2022 at 02:58:30PM +0100, Thomas Gleixner wrote:
> Provide a template and the necessary callbacks to create PCI/MSI and
> PCI/MSI-X domains.
> 
> The domains are created when MSI or MSI-X is enabled. The domains lifetime

domain's

> is either the device life time or in case that e.g. MSI-X was tried first

lifetime (as used above)

> and failed, then the MSI-X domain is removed and a MSI domain is created as
> both are mutually exclusive and reside in the default domain id slot of the
> per device domain pointer array.

ID?

> Also expand pci_msi_domain_supports() to handle feature checks correctly
> even in the case that the per device domain was not yet created by checking
> the features supported by the MSI parent.
> 
> Add the necessary setup calls into the MSI and MSI-X enable code path.
> These setup calls are backwards compatible. They return success when there
> is no parent domain found, which means the existing global domains or the
> legacy allocation path keep just working.
> 
> Co-developed-by: Ahmed S. Darwish <darwi@linutronix.de>
> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

A couple typos below.

> ---
>  drivers/pci/msi/irqdomain.c |  188 +++++++++++++++++++++++++++++++++++++++++++-
>  drivers/pci/msi/msi.c       |   16 +++
>  drivers/pci/msi/msi.h       |    2 
>  3 files changed, 201 insertions(+), 5 deletions(-)
> 
> --- a/drivers/pci/msi/irqdomain.c
> +++ b/drivers/pci/msi/irqdomain.c
> @@ -139,6 +139,170 @@ struct irq_domain *pci_msi_create_irq_do
>  }
>  EXPORT_SYMBOL_GPL(pci_msi_create_irq_domain);
>  
> +/*
> + * Per device MSI[-X] domain functionality
> + */
> +static void pci_device_domain_set_desc(msi_alloc_info_t *arg, struct msi_desc *desc)
> +{
> +	arg->desc = desc;
> +	arg->hwirq = desc->msi_index;
> +}
> +
> +static void pci_mask_msi(struct irq_data *data)
> +{
> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
> +
> +	pci_msi_mask(desc, BIT(data->irq - desc->irq));
> +}
> +
> +static void pci_unmask_msi(struct irq_data *data)
> +{
> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
> +
> +	pci_msi_unmask(desc, BIT(data->irq - desc->irq));
> +}
> +
> +#ifdef CONFIG_GENERIC_IRQ_RESERVATION_MODE
> +# define MSI_REACTIVATE		MSI_FLAG_MUST_REACTIVATE
> +#else
> +# define MSI_REACTIVATE		0
> +#endif
> +
> +#define MSI_COMMON_FLAGS	(MSI_FLAG_FREE_MSI_DESCS |	\
> +				 MSI_FLAG_ACTIVATE_EARLY |	\
> +				 MSI_FLAG_DEV_SYSFS |		\
> +				 MSI_REACTIVATE)
> +
> +static struct msi_domain_template pci_msi_template = {
> +	.chip = {
> +		.name			= "PCI-MSI",
> +		.irq_mask		= pci_mask_msi,
> +		.irq_unmask		= pci_unmask_msi,
> +		.irq_write_msi_msg	= pci_msi_domain_write_msg,
> +		.flags			= IRQCHIP_ONESHOT_SAFE,
> +	},
> +
> +	.ops = {
> +		.set_desc		= pci_device_domain_set_desc,
> +	},
> +
> +	.info = {
> +		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_MULTI_PCI_MSI,
> +		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSI,
> +	},
> +};
> +
> +static void pci_mask_msix(struct irq_data *data)
> +{
> +	pci_msix_mask(irq_data_get_msi_desc(data));
> +}
> +
> +static void pci_unmask_msix(struct irq_data *data)
> +{
> +	pci_msix_unmask(irq_data_get_msi_desc(data));
> +}
> +
> +static struct msi_domain_template pci_msix_template = {
> +	.chip = {
> +		.name			= "PCI-MSIX",
> +		.irq_mask		= pci_mask_msix,
> +		.irq_unmask		= pci_unmask_msix,
> +		.irq_write_msi_msg	= pci_msi_domain_write_msg,
> +		.flags			= IRQCHIP_ONESHOT_SAFE,
> +	},
> +
> +	.ops = {
> +		.set_desc		= pci_device_domain_set_desc,
> +	},
> +
> +	.info = {
> +		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
> +		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSIX,
> +	},
> +};
> +
> +static bool pci_match_device_domain(struct pci_dev *pdev, enum irq_domain_bus_token bus_token)
> +{
> +	return msi_match_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN, bus_token);
> +}
> +
> +static bool pci_create_device_domain(struct pci_dev *pdev, struct msi_domain_template *tmpl,
> +				     unsigned int hwsize)
> +{
> +	struct irq_domain *domain = dev_get_msi_domain(&pdev->dev);
> +
> +	if (!domain || !irq_domain_is_msi_parent(domain))
> +		return true;
> +
> +	return msi_create_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN, tmpl,
> +					    hwsize, NULL, NULL);
> +}
> +
> +/**
> + * pci_setup_msi_device_domain - Setup a device MSI interrupt domain
> + * @pdev:	The PCI device to create the domain on
> + *
> + * Return:
> + *  True when:
> + *	- The device does not have a MSI parent irq domain associated,
> + *	  which keeps the legacy architecture specific and the global
> + *	  PCI/MSI domain models working
> + *	- The MSI domain exists already
> + *	- The MSI domain was successfully allocated
> + *  False when:
> + *	- MSI-X is enabled
> + *	- The domain creation fails.
> + *
> + * The created MSI domain is preserved until:
> + *	- The device is removed
> + *	- MSI is disabled and a MSI-X domain is created
> + */
> +bool pci_setup_msi_device_domain(struct pci_dev *pdev)
> +{
> +	if (WARN_ON_ONCE(pdev->msix_enabled))
> +		return false;
> +
> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
> +		return true;
> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
> +		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
> +
> +	return pci_create_device_domain(pdev, &pci_msi_template, 1);
> +}
> +
> +/**
> + * pci_setup_msix_device_domain - Setup a device MSI-X interrupt domain
> + * @pdev:	The PCI device to create the domain on
> + * @hwsize:	The size of the MSI-X vector table
> + *
> + * Return:
> + *  True when:
> + *	- The device does not have a MSI parent irq domain associated,
> + *	  which keeps the legacy architecture specific and the global
> + *	  PCI/MSI domain models working
> + *	- The MSI-X domain exists already
> + *	- The MSI-X domain was successfully allocated
> + *  False when:
> + *	- MSI is enabled
> + *	- The domain creation fails.
> + *
> + * The created MSI-X domain is preserved until:
> + *	- The device is removed
> + *	- MSI-X is disabled and a MSI domain is created
> + */
> +bool pci_setup_msix_device_domain(struct pci_dev *pdev, unsigned int hwsize)
> +{
> +	if (WARN_ON_ONCE(pdev->msix_enabled))
> +		return false;
> +
> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
> +		return true;
> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
> +		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
> +
> +	return pci_create_device_domain(pdev, &pci_msix_template, hwsize);
> +}
> +
>  /**
>   * pci_msi_domain_supports - Check for support of a particular feature flag
>   * @pdev:		The PCI device to operate on
> @@ -152,13 +316,33 @@ bool pci_msi_domain_supports(struct pci_
>  {
>  	struct msi_domain_info *info;
>  	struct irq_domain *domain;
> +	unsigned int supported;
>  
>  	domain = dev_get_msi_domain(&pdev->dev);
>  
>  	if (!domain || !irq_domain_is_hierarchy(domain))
>  		return mode == ALLOW_LEGACY;
> -	info = domain->host_data;
> -	return (info->flags & feature_mask) == feature_mask;
> +
> +	if (!irq_domain_is_msi_parent(domain)) {
> +		/*
> +		 * For "global" PCI/MSI interrupt domains the associated
> +		 * msi_domain_info::flags is the authoritive source of

authoritative

> +		 * information.
> +		 */
> +		info = domain->host_data;
> +		supported = info->flags;
> +	} else {
> +		/*
> +		 * For MSI parent domains the supported feature set
> +		 * is avaliable in the parent ops. This makes checks

available

> +		 * possible before actually instantiating the
> +		 * per device domain because the parent is never
> +		 * expanding the PCI/MSI functionality.
> +		 */
> +		supported = domain->msi_parent_ops->supported_flags;
> +	}
> +
> +	return (supported & feature_mask) == feature_mask;
>  }
>  
>  /*
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -436,6 +436,9 @@ int __pci_enable_msi_range(struct pci_de
>  	if (rc)
>  		return rc;
>  
> +	if (!pci_setup_msi_device_domain(dev))
> +		return -ENODEV;
> +
>  	for (;;) {
>  		if (affd) {
>  			nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
> @@ -787,9 +790,13 @@ int __pci_enable_msix_range(struct pci_d
>  	if (!pci_msix_validate_entries(dev, entries, nvec, hwsize))
>  		return -EINVAL;
>  
> -	/* PCI_IRQ_VIRTUAL is a horrible hack! */
> -	if (nvec > hwsize && !(flags & PCI_IRQ_VIRTUAL))
> -		nvec = hwsize;
> +	if (hwsize < nvec) {
> +		/* Keep the IRQ virtual hackery working */
> +		if (flags & PCI_IRQ_VIRTUAL)
> +			hwsize = nvec;
> +		else
> +			nvec = hwsize;
> +	}
>  
>  	if (nvec < minvec)
>  		return -ENOSPC;
> @@ -798,6 +805,9 @@ int __pci_enable_msix_range(struct pci_d
>  	if (rc)
>  		return rc;
>  
> +	if (!pci_setup_msix_device_domain(dev, hwsize))
> +		return -ENODEV;
> +
>  	for (;;) {
>  		if (affd) {
>  			nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
> --- a/drivers/pci/msi/msi.h
> +++ b/drivers/pci/msi/msi.h
> @@ -105,6 +105,8 @@ enum support_mode {
>  };
>  
>  bool pci_msi_domain_supports(struct pci_dev *dev, unsigned int feature_mask, enum support_mode mode);
> +bool pci_setup_msi_device_domain(struct pci_dev *pdev);
> +bool pci_setup_msix_device_domain(struct pci_dev *pdev, unsigned int hwsize);
>  
>  /* Legacy (!IRQDOMAIN) fallbacks */
>  
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op
  2022-11-11 13:58 ` [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op Thomas Gleixner
  2022-11-16 19:40   ` Jason Gunthorpe
@ 2022-11-16 20:26   ` Bjorn Helgaas
  2022-11-16 22:42     ` Thomas Gleixner
  1 sibling, 1 reply; 86+ messages in thread
From: Bjorn Helgaas @ 2022-11-16 20:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Fri, Nov 11, 2022 at 02:58:49PM +0100, Thomas Gleixner wrote:
> Dynamic MSI-X vector allocation post MSI-X allows to allocate vectors at a
> given index or at any free index in the available table range.

Is "post MSI-X" missing something?  "post MSI-X enablement" or
something?

> The latter
> requires that the core code selects the index at descriptor allocation time.
> 
> This requires that the PCI/MSI-X specific setup of the MSI-X descriptor,
> which is partially depending on the chosen index happens after allocation.

Is there a comma missing after "index"?  I.e., setup of the descriptor
partially depends on the chosen index?  And the above requires that
setup happens after allocation?

> Implement the prepare_desc() op in the PCI/MSI-X specific msi_domain_ops
> which is invoked before the core interrupt descriptor and the associated
> Linux interrupt number is allocated. That callback is also provided for the
> upcoming PCI/IMS implementations so the implementation specific interrupt
> domain can do their domain specific initialization of the MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/msi/irqdomain.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> --- a/drivers/pci/msi/irqdomain.c
> +++ b/drivers/pci/msi/irqdomain.c
> @@ -202,6 +202,14 @@ static void pci_unmask_msix(struct irq_d
>  	pci_msix_unmask(irq_data_get_msi_desc(data));
>  }
>  
> +static void pci_msix_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
> +				  struct msi_desc *desc)
> +{
> +	/* Don't fiddle with preallocated MSI descriptors */
> +	if (!desc->pci.mask_base)
> +		msix_prepare_msi_desc(to_pci_dev(desc->dev), desc);
> +}
> +
>  static struct msi_domain_template pci_msix_template = {
>  	.chip = {
>  		.name			= "PCI-MSIX",
> @@ -212,6 +220,7 @@ static struct msi_domain_template pci_ms
>  	},
>  
>  	.ops = {
> +		.prepare_desc		= pci_msix_prepare_desc,
>  		.set_desc		= pci_device_domain_set_desc,
>  	},
>  
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains
  2022-11-16 19:13   ` Jason Gunthorpe
@ 2022-11-16 22:38     ` Thomas Gleixner
  2022-11-17  0:22       ` Jason Gunthorpe
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-16 22:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 15:13, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:30PM +0100, Thomas Gleixner wrote:
>> +	.info = {
>> +		.flags			= MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
>> +		.bus_token		= DOMAIN_BUS_PCI_DEVICE_MSIX,
>> +	},
>> +};
>
> I like this splitting alot, it makes the whole thing make so much more
> sense.

:)

>> +bool pci_setup_msi_device_domain(struct pci_dev *pdev)
>> +{
>> +	if (WARN_ON_ONCE(pdev->msix_enabled))
>> +		return false;
>> +
>> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
>> +		return true;
>> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
>> +		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
>> +
>> +	return pci_create_device_domain(pdev, &pci_msi_template, 1);
>
> Hardwired to one 1? What about multi-msi?

MSI has exactly ONE descriptor whether it's single or multi-MSI.

Multi-MSI can have several interrupts hanging off the same descriptor,
but that's not how MSI looks at it because you write ONE message and the
hardware does the substitution of the low bits depending on which vector
is raised.

I pondered to change that, but that would have required to create yet
another code path for the 20years legacy and to adjust every single
implementation of PCI/MSI domains or the underlying parents to handle
this new world order. About 5 years later we might talk about per device
domains then.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op
  2022-11-16 20:26   ` Bjorn Helgaas
@ 2022-11-16 22:42     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-16 22:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Wed, Nov 16 2022 at 14:26, Bjorn Helgaas wrote:
> On Fri, Nov 11, 2022 at 02:58:49PM +0100, Thomas Gleixner wrote:
>> Dynamic MSI-X vector allocation post MSI-X allows to allocate vectors at a
>> given index or at any free index in the available table range.
>
> Is "post MSI-X" missing something?  "post MSI-X enablement" or
> something?

Yes. That was the plan.

>> The latter
>> requires that the core code selects the index at descriptor allocation time.
>> 
>> This requires that the PCI/MSI-X specific setup of the MSI-X descriptor,
>> which is partially depending on the chosen index happens after allocation.
>
> Is there a comma missing after "index"?  I.e., setup of the descriptor
> partially depends on the chosen index?  And the above requires that
> setup happens after allocation?

Yes.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 25/33] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X
  2022-11-16 20:19   ` Bjorn Helgaas
@ 2022-11-16 22:43     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-16 22:43 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish, Reinette Chatre

On Wed, Nov 16 2022 at 14:19, Bjorn Helgaas wrote:
>> Also for the use cases Jason Gunthorpe pointed a single index allocation
>> is sufficient.
>
> Maybe a URL or outline the use cases so this means something in a few
> years?  I haven't followed this discussion, so it doesn't even mean
> anything to me now :)

Fair enough. Will add.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains
  2022-11-16 22:38     ` Thomas Gleixner
@ 2022-11-17  0:22       ` Jason Gunthorpe
  2022-11-17  8:45         ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-17  0:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16, 2022 at 11:38:52PM +0100, Thomas Gleixner wrote:

> >> +bool pci_setup_msi_device_domain(struct pci_dev *pdev)
> >> +{
> >> +	if (WARN_ON_ONCE(pdev->msix_enabled))
> >> +		return false;
> >> +
> >> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
> >> +		return true;
> >> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
> >> +		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
> >> +
> >> +	return pci_create_device_domain(pdev, &pci_msi_template, 1);
> >
> > Hardwired to one 1? What about multi-msi?
> 
> MSI has exactly ONE descriptor whether it's single or multi-MSI.
> 
> Multi-MSI can have several interrupts hanging off the same descriptor,
> but that's not how MSI looks at it because you write ONE message and the
> hardware does the substitution of the low bits depending on which vector
> is raised.

Okay, that is very clear, maybe this in a comment right here ?

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains
  2022-11-17  0:22       ` Jason Gunthorpe
@ 2022-11-17  8:45         ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-17  8:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 20:22, Jason Gunthorpe wrote:
> On Wed, Nov 16, 2022 at 11:38:52PM +0100, Thomas Gleixner wrote:
>
>> >> +bool pci_setup_msi_device_domain(struct pci_dev *pdev)
>> >> +{
>> >> +	if (WARN_ON_ONCE(pdev->msix_enabled))
>> >> +		return false;
>> >> +
>> >> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSI))
>> >> +		return true;
>> >> +	if (pci_match_device_domain(pdev, DOMAIN_BUS_PCI_DEVICE_MSIX))
>> >> +		msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
>> >> +
>> >> +	return pci_create_device_domain(pdev, &pci_msi_template, 1);
>> >
>> > Hardwired to one 1? What about multi-msi?
>> 
>> MSI has exactly ONE descriptor whether it's single or multi-MSI.
>> 
>> Multi-MSI can have several interrupts hanging off the same descriptor,
>> but that's not how MSI looks at it because you write ONE message and the
>> hardware does the substitution of the low bits depending on which vector
>> is raised.
>
> Okay, that is very clear, maybe this in a comment right here ?

Sure.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-16 19:28   ` Jason Gunthorpe
@ 2022-11-17  8:48     ` Thomas Gleixner
  2022-11-17 13:33       ` Jason Gunthorpe
  2022-11-18 22:08     ` Thomas Gleixner
  1 sibling, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-17  8:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 15:28, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:41PM +0100, Thomas Gleixner wrote:
>> +/**
>> + * struct msi_desc_data - Generic MSI descriptor data
>> + * @iobase:     Pointer to the IOMEM base adress for interrupt callbacks
>> + * @cookie:	Device cookie provided at allocation time
>> + *
>> + * The content of this data is implementation defined, e.g. PCI/IMS
>> + * implementations will define the meaning of the data.
>> + */
>> +struct msi_desc_data {
>> +	void			__iomem *iobase;
>> +	union msi_dev_cookie	cookie;
>> +};
>
> It would be nice to see the pci_msi_desc converted to a domain
> specific storage as well.
>
> Maybe could be written
>
> struct msi_desc {
>    u64 domain_data[2];
> }
>
> struct pci_msi_desc {
> 		u32 msi_mask;
> 		u8	multiple	: 3;
> 		u8	multi_cap	: 3;
> 		u8	can_mask	: 1;
> 		u8	is_64		: 1;
> 		u8	mask_pos;
> 		u16 default_irq;
> }
> static_assert(sizeof(struct pci_msi_desc) <= sizeof(((struct msi_desc *)0)->domain_data));
>
> struct pci_msix_desc {
> 		u32 msix_ctrl;
> 		u8	multiple	: 3;
> 		u8	multi_cap	: 3;
> 		u8	can_mask	: 1;
> 		u8	is_64		: 1;
> 		u16 default_irq;
> 		void __iomem *mask_base;
> }
> static_assert(sizeof(struct pci_msix_desc) <= sizeof(((struct msi_desc *)0)->domain_data));
>
> ideally hidden in the pci code with some irq_chip facing export API to
> snoop in the bits a few places need
>
> We've used 128 bits for the PCI descriptor, we might as well like
> everyone have all 128 bits for whatever they want to do

Not sure because we end up with nasty type casts for

> struct msi_desc {
>    u64 domain_data[2];
> }

Let me think about it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-16 19:36   ` Jason Gunthorpe
@ 2022-11-17  9:40     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-17  9:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 15:36, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:44PM +0100, Thomas Gleixner wrote:
>> The function also takes an optional @cookie argument which is of type union
>> msi_dev_cookie. This cookie is not used by the core code and is stored in
>> the allocated msi_desc::data::cookie. The meaning of the cookie is
>> completely implementation defined. In case of IMS this might be a PASID or
>> a pointer to a device queue, but for the MSI core it's opaque and not used
>> in any way.
>
> To my mind it makes more sense to pass a 'void *' through from
> msi_domain_alloc_irq_at() to the prepare_desc() op with the idea that
> the driver calling msi_domain_alloc_irq_at() knows it is calling it
> against the domain that it allocated. The prepare_desc can then use
> the void * to properly initialize anything about the desc under the
> right lock.

You are looking at it from one particular use case. 

> Before calling this the driver should have setup whatever thing is
> going to originate the interrupt, eg allocated the HW object that
> sources the interrupt and part of what the void * would convey is the
> detailed information on how to program the HW object. eg IDXD is using
> an iobase and an offset along with the enforcing PASID, but something
> like mlx5 would probably want an object id, type, and SF ID.

Correct, and that's why the cookie is there. You can stash your pointer
into the cookie and an IDXD user stores the PASID. The IDXD user which
allocates an interrupt does not even know about iobase and offset. It
does neither care about what the IDXD irq domain implementation does
with that cookie.

Neither should your queue code care. The queue driver code puts a
pointer to struct mlx5_voodoo into the cookie when allocating the
interrupt and then the mlx5 irqdomain code which is a complete separate
entity gets this cookie handed into prepare_desc().

struct mlx5_voodoo contains all information for the irq domain code to
set up the necessary things in the queue. That must be obviously a
contract between the queue code and the irqdomain code but that's not
any different than MSI or MSI-X. The only difference is that in the IMS
case the contract is per device and not codified in a standard.

> This is again where I don't much like the use of an ID to refer to the
> domain.
>
> Having the driver allocate the device domain, retain a pointer to it,
> and use that domain pointer with all these new APIs seems much clearer
> than converting the pointer to an ID.

You're really obsessed about this irqdomain pointer, right?

You have to differentiate between the irq domain implementation and the
actual usage sites and not conflate them into one thing.

Let's look at the usage site:

      struct cookie cookie = { .ptr = mymagicqueue, }

      pci_ims_alloc_irq(pci_dev, &cookie);

versus:

      struct cookie cookie = { .ptr = mymagicqueue, }

      ims_alloc_irq(&pci_dev->dev, mydev->ims_domain, &cookie);

Even in the unlikely case that we have more than two domains, then still
the usage site has zero interest in the domain pointer:

      struct cookie cookie = { .ptr = mymagicqueue, }

      pci_ims_alloc_irq(pci_dev, myqueue->domid, &cookie);

where the code which instantiates myqueue sets up domid.

The usage site has absolutely no business to touch irqdomain pointer or
to even know that one exists. All it needs to know is how the cookie
contract works, obviously.

Now the functions you need in your irqdomain implementation to
e.g. prepare the MSI descriptor surely need to know about the irqdomain
pointer, but that gets handed in from the allocation code so the prepare
function knows which instance it is operating on.

So what does the irqdomain pointer buy you? Exactly nothing!

Look at the IDXD reference implementation.

     The IDXD probe code which initializes the physical device
     instantiates the irq domain along with the iobase for the
     storage array.

     The actual queue (or whatever IDXD names it) setup code just sticks
     PASID into the cookie and allocates an interrupt. It gets a virtual
     irq number and requests the interrupt.

Where is the need for a pointer? The queue code does not even know about
the iobase of the storage array. It's completely irrelevant there. All
it has to know is the cookie contract, not more.

Let's take you pointer obsession to the extreme:

      struct irq_desc *desc = pci_alloc_msix_interrupt(pci_dev);

      request_irq(desc, handler, pci_dev);

versus:

      int virq = pci_alloc_msix_interrupt(pci_dev);

      request_irq(virq, handler, pci_dev);

You could argue the same way that there is no need for a Linux interrupt
number and we could just use the interrupt descriptor pointer.

Sure, you can do that, but then you violate _all_ encapsulation rules in
one go for absolutely _ZERO_ value.

Want another example based on kmalloc()?

Almost 20 years ago I did a treewide mopup of drivers which decided that
they need to fiddle in the irq descriptor for the very wrong reasons.
I had to do that to be able to do a trivial change in the core code...

C is patently bad for encapsulation, but you can make it worse by
forcefully ignoring the design patterns which allow to completely hide
implementation details of a subsystem or infrastructure.

If you look at the last commit in the ARM part of this work:

  https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/commit/?h=devmsi-arm&id=96c97746cbb431a306e95c04d6b3c75751244716

then you can see the final move to remove the visibility of
the MSI management internals.

This makes it possible to completely overhaul the inner workings of the
MSI core without having to chase abuse all over the place.

Thanks,

        tglx





   







        

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 27/33] genirq/msi: Provide constants for PCI/IMS support
  2022-11-16 19:54   ` Jason Gunthorpe
@ 2022-11-17  9:46     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-17  9:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 15:54, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:54PM +0100, Thomas Gleixner wrote:
>> +	/* Support for PCI/IMS */
>> +	MSI_FLAG_PCI_IMS		= (1 << 21),
>
> Maybe for legacy reasons it is too complicated, but it would be so
> much clearer of the special case of "I only know how to support PCI
> MSI and PCI MSI-X" was called out as a special flag, and the more
> general case of "any write_msg is fine by me" was left behind.
>
> I feel like when the device domain is created in the first place the
> parent domain(s) should be able to reject the creation if the
> requested child domain is not one it supports. Eg the hypervisor
> interactions checks if the child domain is PCI MSI or PCI MSI-X and
> rejects otherwise, because that is the only thing the hypervisor knows
> how to work with.
>
> If we did that perhaps we don't even need a flag or further checks?

It's not that simple. The flags are part of the domain creation sanity
checks and due to other constraints in our marvelous zoo of
architectures, iommus, hypervisors and whatever being explicit about
this is really required. Look at the GICv3-ITS voodoo which explicitly
needs to differentiate between PCI and non-PCI MSI. I wish we could
start from a clean slate, but that train has left the station long ago.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-17  8:48     ` Thomas Gleixner
@ 2022-11-17 13:33       ` Jason Gunthorpe
  0 siblings, 0 replies; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-17 13:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Thu, Nov 17, 2022 at 09:48:45AM +0100, Thomas Gleixner wrote:

> > We've used 128 bits for the PCI descriptor, we might as well like
> > everyone have all 128 bits for whatever they want to do
> 
> Not sure because we end up with nasty type casts for

Something like:

void *msi_desc_device_priv() {return dec->device_data;}

As netdev does it?

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 02/33] genirq/msi: Provide struct msi_parent_ops
  2022-11-16 18:57   ` Jason Gunthorpe
@ 2022-11-17 15:58     ` Thomas Gleixner
  2022-11-18 13:52       ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-17 15:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 14:57, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:14PM +0100, Thomas Gleixner wrote:
>> + * This is the most complex problem of per device MSI domains and the
>> + * underlying interrupt domain hierarchy:
>> + *
>> + * The device domain to be initialized requests the broadest feature set
>> + * possible and the underlying domain hierarchy puts restrictions on it.
>> + *
>> + * That's working perfectly fine for a strict parent->device model, but it
>> + * falls apart with a root_parent->real_parent->device chain because the
>
> This language hurt my brain :)

IKR

>> +bool msi_parent_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
>> +				  struct irq_domain *real_parent, struct msi_domain_info *info)
>
> 'real_parent' is global IRQ_DOMAIN_FLAG_MSI_PARENT of the dev for
> which we are constructing a msi_domain_info to create a child aka
> IRQ_DOMAIN_FLAG_MSI_DEVICE?
>
> 'domain' is the current step in the hierarchy as we walk up the ops
> pointers?

Yes.

> Maybe:
>
> @child_info: The MSI domain info of the IRQ_DOMAIN_FLAG_MSI_DEVICE
>              domain to be created
> @parent_domain: The IRQ_DOMAIN_FLAG_MSI_PARENT domain for the child to
>                 be created
> @domain: The domain in the hierarchy this op is being called on

Definitely better.

> And perhaps it would be a bit clearer to put the parent_domain inside
> the msi_domain_info, which is basically acting as an argument bundle
> for a future allocation call?

Maybe. Let me try.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 13/33] x86/apic/vector: Provide MSI parent domain
  2022-11-16 19:18   ` Jason Gunthorpe
@ 2022-11-17 20:06     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-17 20:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 15:18, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:31PM +0100, Thomas Gleixner wrote:
>> +static bool x86_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
>> +				  struct irq_domain *real_parent, struct msi_domain_info *info)
>> +{
>> +	const struct msi_parent_ops *pops = real_parent->msi_parent_ops;
>> +
>> +	/* MSI parent domain specific settings */
>> +	switch (real_parent->bus_token) {
>> +	case DOMAIN_BUS_ANY:
>> +		/* Only the vector domain can have the ANY token */
>> +		if (WARN_ON_ONCE(domain != real_parent))
>> +			return false;
>> +		info->chip->irq_set_affinity = msi_set_affinity;
>> +		/* See msi_set_affinity() for the gory details */
>> +		info->flags |= MSI_FLAG_NOMASK_QUIRK;
>> +		break;
>> +	default:
>> +		WARN_ON_ONCE(1);
>> +		return false;
>> +	}
>> +
>> +	/* Is the target supported? */
>> +	switch(info->bus_token) {
>> +	case DOMAIN_BUS_PCI_DEVICE_MSI:
>> +	case DOMAIN_BUS_PCI_DEVICE_MSIX:
>> +		break;
>> +	default:
>> +		WARN_ON_ONCE(1);
>> +		return false;
>
> Why does x86 care how the vector is ultimately programmed into the
> device?

That's not the point.

> The leaking of the MSI programming model into the irq implementations
> seems like there is still a troubled modularity.
>
> I understand that some implementations rely on a hypercall/trap or
> whatever and must know MSI vs MSI-X, but I'm surprised to see this
> here.

Why? It's the 'init a new per device domain' code which can rightfully
have a say whether it is willing to support something or not or to put
constraints on it. Those constraints can very much depend on the device
type or the MSI type. Creating random MSI domains seems to be pretty
envogue today and I really have no interest to deal with the fallout
once the fancy muck is merged in some random subsystem and the developer
moved on. I have no idea why everyone thinks that driver writers should
be granted the ultimate freedom to do what they want and anything which
puts an constraint on something is bad and troubled to begin with.

Since I started to strictly encapsulate and fence of things, the amount
of horrors I had to debug and then mop up has significantly decreased.
It also forces people who want to add some new fancy stuff to talk to
the infrastructure people so that the new functionality can be looked at
in the broader picture and solutions can be found upfront and not after
the fact when the resulting damage is discovered.

Quite some of the issues I discovered during last years discussions,
like the VFIO disable/enable trainwreck, the IRQ_VIRTUAL nonsense and
other random hacks could have neen avoided if people would actually talk
to each other and not just run off and hack something into place which
then gets somehow merged.

On the ARM side there is even a fundamental requirement for this today
due to the way how the existing infrastructure handles PCI/MSI[X] and
platform MSI, unless we go and rewrite half of the underlying code first
or in parallel.

It was also a migration aid to catch issues in the gradual conversion.

Again, we are not starting from a clean slate. I might be overly
cautious, but for very good reasons.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-11 13:58 ` [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at() Thomas Gleixner
  2022-11-16 19:36   ` Jason Gunthorpe
@ 2022-11-17 23:33   ` Reinette Chatre
  2022-11-18  0:58     ` Thomas Gleixner
  1 sibling, 1 reply; 86+ messages in thread
From: Reinette Chatre @ 2022-11-17 23:33 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Hi Thomas,

I am trying all three parts of this work out with some experimental
code within the IDXD driver that attempts to use IMS on the host.

In the test, pci_ims_alloc_irq() always encounters -EBUSY and it
seems that there is an attempt to insert the struct msi_desc into
the xarray twice, the second attempt encountering the -EBUSY.

While trying to understand what is going on I found myself looking
at this code area and I'll annotate this patch with what I learned.

On 11/11/2022 5:58 AM, Thomas Gleixner wrote:

...

> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -39,6 +39,7 @@ static inline int msi_sysfs_create_group
>  /* Invalid XA index which is outside of any searchable range */
>  #define MSI_XA_MAX_INDEX	(ULONG_MAX - 1)
>  #define MSI_XA_DOMAIN_SIZE	(MSI_MAX_INDEX + 1)
> +#define MSI_ANY_INDEX		UINT_MAX
>  
>  static inline void msi_setup_default_irqdomain(struct device *dev, struct msi_device_data *md)
>  {
> @@ -126,18 +127,34 @@ static int msi_insert_desc(struct device

When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being
called twice, first with index = MSI_ANY_INDEX, second with index = 0.
(domid = 1 both times)

>  	}
>  
>  	hwsize = msi_domain_get_hwsize(dev, domid);
> -	if (index >= hwsize) {
> -		ret = -ERANGE;
> -		goto fail;
> -	}
>  
> -	desc->msi_index = index;
> -	index += baseidx;
> -	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
> -	if (ret)
> -		goto fail;
> -	return 0;
> +	if (index == MSI_ANY_INDEX) {
> +		struct xa_limit limit;
> +		unsigned int index;
> +
> +		limit.min = baseidx;
> +		limit.max = baseidx + hwsize - 1;
>  
> +		/* Let the xarray allocate a free index within the limits */
> +		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
> +		if (ret)
> +			goto fail;
> +

This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc()
is called the first time and the xa_alloc() succeeds at index 65536.

> +		desc->msi_index = index;

This is problematic with desc->msi_index being a u16, assigning
65536 to it becomes 0.

> +		return 0;
> +	} else {
> +		if (index >= hwsize) {
> +			ret = -ERANGE;
> +			goto fail;
> +		}
> +
> +		desc->msi_index = index;
> +		index += baseidx;
> +		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
> +		if (ret)
> +			goto fail;

This "else" path is followed when msi_insert_desc() is called the second
time with "index = 0". The xa_insert() above fails at index 65536
(baseidx = 65536) with -EBUSY, trickling up as the return code to
pci_ims_alloc_irq().

> +		return 0;
> +	}
>  fail:
>  	msi_free_desc(desc);
>  	return ret;
> @@ -335,7 +352,7 @@ int msi_setup_device_data(struct device
>  
>  	msi_setup_default_irqdomain(dev, md);
>  
> -	xa_init(&md->__store);
> +	xa_init_flags(&md->__store, XA_FLAGS_ALLOC);
>  	mutex_init(&md->mutex);
>  	md->__iter_idx = MSI_XA_MAX_INDEX;
>  	dev->msi.data = md;
> @@ -1423,6 +1440,72 @@ int msi_domain_alloc_irqs_all_locked(str
>  	return msi_domain_alloc_locked(dev, &ctrl);
>  }
>  
> +/**
> + * msi_domain_alloc_irq_at - Allocate an interrupt from a MSI interrupt domain at
> + *			     a given index - or at the next free index
> + *
> + * @dev:	Pointer to device struct of the device for which the interrupts
> + *		are allocated
> + * @domid:	Id of the interrupt domain to operate on
> + * @index:	Index for allocation. If @index == %MSI_ANY_INDEX the allocation
> + *		uses the next free index.
> + * @affdesc:	Optional pointer to an interrupt affinity descriptor structure
> + * @cookie:	Optional pointer to a descriptor specific cookie to be stored
> + *		in msi_desc::data. Must be NULL for MSI-X allocations
> + *
> + * This requires a MSI interrupt domain which lets the core code manage the
> + * MSI descriptors.
> + *
> + * Return: struct msi_map
> + *
> + *	On success msi_map::index contains the allocated index number and
> + *	msi_map::virq the corresponding Linux interrupt number
> + *
> + *	On failure msi_map::index contains the error code and msi_map::virq
> + *	is %0.
> + */
> +struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index,
> +				       const struct irq_affinity_desc *affdesc,
> +				       union msi_dev_cookie *cookie)
> +{
> +	struct irq_domain *domain;
> +	struct msi_map map = { };
> +	struct msi_desc *desc;
> +	int ret;
> +
> +	msi_lock_descs(dev);
> +	domain = msi_get_device_domain(dev, domid);
> +	if (!domain) {
> +		map.index = -ENODEV;
> +		goto unlock;
> +	}
> +
> +	desc = msi_alloc_desc(dev, 1, affdesc);
> +	if (!desc) {
> +		map.index = -ENOMEM;
> +		goto unlock;
> +	}
> +
> +	if (cookie)
> +		desc->data.cookie = *cookie;
> +
> +	ret = msi_insert_desc(dev, desc, domid, index);
> +	if (ret) {
> +		map.index = ret;
> +		goto unlock;
> +	}

Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */)

> +
> +	map.index = desc->msi_index;

msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends
up being 0.

> +	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);

Here is where the second call to msi_insert_desc() originates:

msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \
__msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \
msi_domain_add_simple_msi_descs() -> msi_insert_desc()
		

> +	if (ret)
> +		map.index = ret;
> +	else
> +		map.virq = desc->irq;
> +unlock:
> +	msi_unlock_descs(dev);
> +	return map;
> +}
> +
>  static void __msi_domain_free_irqs(struct device *dev, struct irq_domain *domain,
>  				   struct msi_ctrl *ctrl)
>  {
> 

Reinette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-17 23:33   ` Reinette Chatre
@ 2022-11-18  0:58     ` Thomas Gleixner
  2022-11-18  9:15       ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-18  0:58 UTC (permalink / raw)
  To: Reinette Chatre, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

On Thu, Nov 17 2022 at 15:33, Reinette Chatre wrote:
> I am trying all three parts of this work out with some experimental
> code within the IDXD driver that attempts to use IMS on the host.
>
> In the test, pci_ims_alloc_irq() always encounters -EBUSY and it
> seems that there is an attempt to insert the struct msi_desc into
> the xarray twice, the second attempt encountering the -EBUSY.
>
> While trying to understand what is going on I found myself looking
> at this code area and I'll annotate this patch with what I learned.

Ok.

> When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being
> called twice, first with index = MSI_ANY_INDEX, second with index = 0.
> (domid = 1 both times)

How so?

>>  	}
>>  
>>  	hwsize = msi_domain_get_hwsize(dev, domid);
>> -	if (index >= hwsize) {
>> -		ret = -ERANGE;
>> -		goto fail;
>> -	}
>>  
>> -	desc->msi_index = index;
>> -	index += baseidx;
>> -	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>> -	if (ret)
>> -		goto fail;
>> -	return 0;
>> +	if (index == MSI_ANY_INDEX) {
>> +		struct xa_limit limit;
>> +		unsigned int index;
>> +
>> +		limit.min = baseidx;
>> +		limit.max = baseidx + hwsize - 1;
>>  
>> +		/* Let the xarray allocate a free index within the limits */
>> +		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
>> +		if (ret)
>> +			goto fail;
>> +
>
> This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc()
> is called the first time and the xa_alloc() succeeds at index 65536.
>
>> +		desc->msi_index = index;
>
> This is problematic with desc->msi_index being a u16, assigning
> 65536 to it becomes 0.

You are partially right. I need to fix that and make it explicit as it's
a "works by chance or maybe not" construct right now.

But desc->msi_index is correct to be truncated because it's the index
within the domain space which is zero based.

>> +		return 0;
>> +	} else {
>> +		if (index >= hwsize) {
>> +			ret = -ERANGE;
>> +			goto fail;
>> +		}
>> +
>> +		desc->msi_index = index;
>> +		index += baseidx;
>> +		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>> +		if (ret)
>> +			goto fail;
>
> This "else" path is followed when msi_insert_desc() is called the second
> time with "index = 0". The xa_insert() above fails at index 65536
> (baseidx = 65536) with -EBUSY, trickling up as the return code to
> pci_ims_alloc_irq().

Why is it called with index=0 the second time?
>> +	desc = msi_alloc_desc(dev, 1, affdesc);
>> +	if (!desc) {
>> +		map.index = -ENOMEM;
>> +		goto unlock;
>> +	}
>> +
>> +	if (cookie)
>> +		desc->data.cookie = *cookie;
>> +
>> +	ret = msi_insert_desc(dev, desc, domid, index);
>> +	if (ret) {
>> +		map.index = ret;
>> +		goto unlock;
>> +	}
>
> Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */)
>
>> +
>> +	map.index = desc->msi_index;
>
> msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends
> up being 0.

which is kinda correct.

>> +	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
>
> Here is where the second call to msi_insert_desc() originates:
>
> msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \
> __msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \
> msi_domain_add_simple_msi_descs() -> msi_insert_desc()

but yes, that's bogus because it tries to allocate what is allocated already.

Too tired to decode this circular dependency right now. Will stare at it
with brain awake in the morning. Duh!

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-18  0:58     ` Thomas Gleixner
@ 2022-11-18  9:15       ` Thomas Gleixner
  2022-11-18 11:05         ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-18  9:15 UTC (permalink / raw)
  To: Reinette Chatre, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote:
> On Thu, Nov 17 2022 at 15:33, Reinette Chatre wrote:
>> When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being
>> called twice, first with index = MSI_ANY_INDEX, second with index = 0.
>> (domid = 1 both times)
>
> How so?
>
>>>  	}
>>>  
>>>  	hwsize = msi_domain_get_hwsize(dev, domid);
>>> -	if (index >= hwsize) {
>>> -		ret = -ERANGE;
>>> -		goto fail;
>>> -	}
>>>  
>>> -	desc->msi_index = index;
>>> -	index += baseidx;
>>> -	ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>>> -	if (ret)
>>> -		goto fail;
>>> -	return 0;
>>> +	if (index == MSI_ANY_INDEX) {
>>> +		struct xa_limit limit;
>>> +		unsigned int index;
>>> +
>>> +		limit.min = baseidx;
>>> +		limit.max = baseidx + hwsize - 1;
>>>  
>>> +		/* Let the xarray allocate a free index within the limits */
>>> +		ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL);
>>> +		if (ret)
>>> +			goto fail;
>>> +
>>
>> This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc()
>> is called the first time and the xa_alloc() succeeds at index 65536.
>>
>>> +		desc->msi_index = index;
>>
>> This is problematic with desc->msi_index being a u16, assigning
>> 65536 to it becomes 0.
>
> You are partially right. I need to fix that and make it explicit as it's
> a "works by chance or maybe not" construct right now.
>
> But desc->msi_index is correct to be truncated because it's the index
> within the domain space which is zero based.

It should obviously do:

   desc->msi_index = index - baseidx;

>>> +		return 0;
>>> +	} else {
>>> +		if (index >= hwsize) {
>>> +			ret = -ERANGE;
>>> +			goto fail;
>>> +		}
>>> +
>>> +		desc->msi_index = index;
>>> +		index += baseidx;
>>> +		ret = xa_insert(&md->__store, index, desc, GFP_KERNEL);
>>> +		if (ret)
>>> +			goto fail;
>>
>> This "else" path is followed when msi_insert_desc() is called the second
>> time with "index = 0". The xa_insert() above fails at index 65536
>> (baseidx = 65536) with -EBUSY, trickling up as the return code to
>> pci_ims_alloc_irq().
>
> Why is it called with index=0 the second time?
>>> +	desc = msi_alloc_desc(dev, 1, affdesc);
>>> +	if (!desc) {
>>> +		map.index = -ENOMEM;
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	if (cookie)
>>> +		desc->data.cookie = *cookie;
>>> +
>>> +	ret = msi_insert_desc(dev, desc, domid, index);
>>> +	if (ret) {
>>> +		map.index = ret;
>>> +		goto unlock;
>>> +	}
>>
>> Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */)
>>
>>> +
>>> +	map.index = desc->msi_index;
>>
>> msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends
>> up being 0.
>
> which is kinda correct.
>
>>> +	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
>>
>> Here is where the second call to msi_insert_desc() originates:
>>
>> msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \
>> __msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \
>> msi_domain_add_simple_msi_descs() -> msi_insert_desc()
>
> but yes, that's bogus because it tries to allocate what is allocated already.
>
> Too tired to decode this circular dependency right now. Will stare at it
> with brain awake in the morning. Duh!

Duh. I'm a moron.

Of course I "tested" this by flipping default and secondary domain
around and doing dynamic allocations from PCI/MSI-X but that won't catch
the bug because PCI/MSI-X does not have the ALLOC_SIMPLE_DESCS flag set.

Let me fix that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-18  9:15       ` Thomas Gleixner
@ 2022-11-18 11:05         ` Thomas Gleixner
  2022-11-18 18:18           ` Reinette Chatre
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-18 11:05 UTC (permalink / raw)
  To: Reinette Chatre, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

On Fri, Nov 18 2022 at 10:15, Thomas Gleixner wrote:
> On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote:
> Of course I "tested" this by flipping default and secondary domain
> around and doing dynamic allocations from PCI/MSI-X but that won't catch
> the bug because PCI/MSI-X does not have the ALLOC_SIMPLE_DESCS flag set.
>
> Let me fix that.

Delta patch against

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-v1G-part3

below.

Thanks,

        tglx
---
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index d4f26649a185..d243ad3e5489 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
 		if (ret)
 			goto fail;
 
-		desc->msi_index = index;
+		desc->msi_index = index - baseidx;
 		return 0;
 	} else {
 		if (index >= hwsize) {
@@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
 				       const struct irq_affinity_desc *affdesc,
 				       union msi_dev_cookie *cookie)
 {
+	struct msi_ctrl ctrl = { .domid	= domid, .nirqs = 1, };
+	struct msi_domain_info *info;
 	struct irq_domain *domain;
 	struct msi_map map = { };
-	struct msi_desc *desc;
 	int ret;
 
 	msi_lock_descs(dev);
@@ -1503,12 +1504,16 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
 		goto unlock;
 	}
 
-	map.index = desc->msi_index;
-	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
-	if (ret)
+	ctrl.first = ctrl.last = desc->msi_index;
+	info = domain->host_data;
+
+	ret = __msi_domain_alloc_irqs(dev, domain, &ctrl);
+	if (ret) {
 		map.index = ret;
-	else
+		msi_domain_free_locked(dev, &ctrl);
+	} else {
 		map.virq = desc->irq;
+	}
 unlock:
 	msi_unlock_descs(dev);
 	return map;

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [patch 02/33] genirq/msi: Provide struct msi_parent_ops
  2022-11-17 15:58     ` Thomas Gleixner
@ 2022-11-18 13:52       ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-18 13:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Thu, Nov 17 2022 at 16:58, Thomas Gleixner wrote:
> On Wed, Nov 16 2022 at 14:57, Jason Gunthorpe wrote:
>
>> And perhaps it would be a bit clearer to put the parent_domain inside
>> the msi_domain_info, which is basically acting as an argument bundle
>> for a future allocation call?
>
> Maybe. Let me try.

No. That's redundant storage because the domain creation stores the
parent domain in irqdomain::parent which is what the hierarchy code
uses. That code does not know about msi_domain_info.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-18 11:05         ` Thomas Gleixner
@ 2022-11-18 18:18           ` Reinette Chatre
  2022-11-18 22:31             ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Reinette Chatre @ 2022-11-18 18:18 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Hi Thomas,

On 11/18/2022 3:05 AM, Thomas Gleixner wrote:
> On Fri, Nov 18 2022 at 10:15, Thomas Gleixner wrote:
>> On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote:

...

> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index d4f26649a185..d243ad3e5489 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>  		if (ret)
>  			goto fail;
>  
> -		desc->msi_index = index;
> +		desc->msi_index = index - baseidx;

Could msi_desc->msi_index be made bigger? The hardware I am testing
on claims to support more IMS entries than what the u16 can
accommodate.

>  		return 0;
>  	} else {
>  		if (index >= hwsize) {
> @@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
>  				       const struct irq_affinity_desc *affdesc,
>  				       union msi_dev_cookie *cookie)
>  {
> +	struct msi_ctrl ctrl = { .domid	= domid, .nirqs = 1, };
> +	struct msi_domain_info *info;
>  	struct irq_domain *domain;
>  	struct msi_map map = { };
> -	struct msi_desc *desc;

(*desc is still needed)

>  	int ret;
>  
>  	msi_lock_descs(dev);
> @@ -1503,12 +1504,16 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
>  		goto unlock;
>  	}
>  
> -	map.index = desc->msi_index;
> -	ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index);
> -	if (ret)
> +	ctrl.first = ctrl.last = desc->msi_index;
> +	info = domain->host_data;
> +
> +	ret = __msi_domain_alloc_irqs(dev, domain, &ctrl);
> +	if (ret) {
>  		map.index = ret;
> -	else
> +		msi_domain_free_locked(dev, &ctrl);
> +	} else {
>  		map.virq = desc->irq;
> +	}
>  unlock:
>  	msi_unlock_descs(dev);
>  	return map;

Thank you very much. With the above snippet it is possible to
allocate an IMS IRQ. I am not yet able to use the IRQ and I am working
on more tracing to figure out why. In the mean time, I did
just try the pci_ims_alloc_irq()/pci_ims_free_irq() flow and
pci_ims_free_irq() triggered the WARN below:

remove_proc_entry: removing non-empty directory 'irq/220', leaking at least 'idxd-portal'
WARNING: CPU: XX PID: 4322 at fs/proc/generic.c:718 remove_proc_entry+0x184/0x190

[SNIP]

RIP: 0010:remove_proc_entry+0x184/0x190
Code: a5 af 48 8d 90 68 ff ff ff 48 85 c0 48 0f 45 c2 48 8b 95 88 00 00 00 4c 8b 80 b0 00 00 00 48 8b 92 b0 00 00 00 e8 2d 67 c6 00 <0f> 0b e9 4d ff ff ff e8 a0 c1 ce 00 0f 1f 44 00 00 41 57 41 56 41
RSP: 0018:ff223b51cf947c80 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ff1b39f680241300 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffafa37b97 RDI: 00000000ffffffff
RBP: ff1b3a06666b8000 R08: ff1b3a15bdbfffe8 R09: 0000000000000003
R10: ff1b3a15bce00000 R11: ff1b3a15bd900000 R12: ff1b3a06666b8090
R13: 00000000000000dd R14: ff1b3a069237fb80 R15: 0000000000000001
FS:  00007fedd2dff000(0000) GS:ff1b3a15be940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff8e5bfc01c CR3: 000000110a138006 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
unregister_irq_proc+0xe3/0x110
free_desc+0x29/0x70
irq_free_descs+0x4b/0x80
msi_domain_free_locked.part.0+0x19b/0x1d0
msi_domain_free_irqs_range+0x67/0xb0
idxd_wq_free_irq+0x89/0x150 [idxd]
drv_disable_wq+0x5f/0x90 [idxd]
idxd_dmaengine_drv_remove+0xa3/0xc0 [idxd]
device_release_driver_internal+0x1aa/0x230
driver_detach+0x44/0x90
bus_remove_driver+0x58/0xe0
idxd_exit_module+0x18/0x3a [idxd]
__do_sys_delete_module.constprop.0+0x186/0x280
? fpregs_assert_state_consistent+0x22/0x50
? exit_to_user_mode_prepare+0x40/0x150
do_syscall_64+0x40/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fedd2526c9b
Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffca85a47d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055af0e1a3700 RCX: 00007fedd2526c9b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055af0e1a3768
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 00007fedd25beac0 R11: 0000000000000206 R12: 00007ffca85a4a30
R13: 000055af0e1a32a0 R14: 00007ffca85a58e5 R15: 000055af0e1a3700
</TASK>

Reinette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-16 19:28   ` Jason Gunthorpe
  2022-11-17  8:48     ` Thomas Gleixner
@ 2022-11-18 22:08     ` Thomas Gleixner
  2022-11-21 17:20       ` Jason Gunthorpe
  1 sibling, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-18 22:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 16 2022 at 15:28, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:58:41PM +0100, Thomas Gleixner wrote:
>
>> +/**
>> + * struct msi_desc_data - Generic MSI descriptor data
>> + * @iobase:     Pointer to the IOMEM base adress for interrupt callbacks
>> + * @cookie:	Device cookie provided at allocation time
>> + *
>> + * The content of this data is implementation defined, e.g. PCI/IMS
>> + * implementations will define the meaning of the data.
>> + */
>> +struct msi_desc_data {
>> +	void			__iomem *iobase;
>> +	union msi_dev_cookie	cookie;
>> +};
>
> It would be nice to see the pci_msi_desc converted to a domain
> specific storage as well.

I looked into this and it gets ugly very fast.

The above has two parts:

    iobase    is domain specific and setup by the domain code

    cookie    is per interrupt allocation. That's where the instance
              queue or whatever connects to the domain.

I can abuse the fields for PCI/MSI of course, but see below.

> Maybe could be written
>
> struct pci_msi_desc {
> }
> static_assert(sizeof(struct pci_msi_desc) <= sizeof(((struct msi_desc *)0)->domain_data));
>
> struct pci_msix_desc {
> }
> static_assert(sizeof(struct pci_msix_desc) <= sizeof(((struct msi_desc *)0)->domain_data));
>
> ideally hidden in the pci code with some irq_chip facing export API to
> snoop in the bits a few places need

I can't use that for the current combo legacy PCI/MSI code as I can't
split the irq chip implementations like I can with the new per device
domains.

And no, I'm not going to create separate code pathes which do the same
thing on different data structures just to pretend that it's all shiny.

> We've used 128 bits for the PCI descriptor, we might as well like
> everyone have all 128 bits for whatever they want to do

That's fine, but there are two parts to it:

   1) Domain specific data

   2) Per allocation specific data

#1 is data which the domain code puts there, e.g. in the prepare_desc()
   callback

#2 is data which the usage site hands in which gives the domain and the
   interrupt chip the information it needs

So the data structure should look like this:

struct msi_desc_data {
	union msi_domain_cookie		dom_cookie;
	union msi_instance_cookie	ins_cookie;
};

union msi_domain_cookie {
	void __iomem	*iobase;
        void		*ptr;
        u64		value;
};

union msi_instance_cookie {
        void		*ptr;
        u64		value;
};

Sure I could make both cookies plain u64, but I hate these forced type
casts and the above is simple to handle and understand.

So you get your 128 bit, but not per instance because that's a nightmare
to validate versus the allocation code which has to copy the data into
the msi descriptor, whatever it is (PASID, queue pointer ....).

Having two cookies makes a lot of sense to have a proper separation
between domain and usage site.

For IDXD the domain wants to store the iobase and needs a per allocation
PASID.

For your queue model, the domain wants a pointer to some device or
whatever specific things and the queue provides a pointer so that the
domain/chip can do the right thing for that particular instance.

For both sides, the domain and the allocation side something like the
above is sufficient.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-18 18:18           ` Reinette Chatre
@ 2022-11-18 22:31             ` Thomas Gleixner
  2022-11-18 22:59               ` Reinette Chatre
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-18 22:31 UTC (permalink / raw)
  To: Reinette Chatre, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote:
>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>>  		if (ret)
>>  			goto fail;
>>  
>> -		desc->msi_index = index;
>> +		desc->msi_index = index - baseidx;
>
> Could msi_desc->msi_index be made bigger? The hardware I am testing
> on claims to support more IMS entries than what the u16 can
> accommodate.

Sure that's trivial. How big does it claim it is?

>> @@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u
>>  				       const struct irq_affinity_desc *affdesc,
>>  				       union msi_dev_cookie *cookie)
>>  {
>> +	struct msi_ctrl ctrl = { .domid	= domid, .nirqs = 1, };
>> +	struct msi_domain_info *info;
>>  	struct irq_domain *domain;
>>  	struct msi_map map = { };
>> -	struct msi_desc *desc;
>
> (*desc is still needed)

Yes, I figured that out later :)

> Thank you very much. With the above snippet it is possible to
> allocate an IMS IRQ. I am not yet able to use the IRQ and I am working
> on more tracing to figure out why. In the mean time, I did
> just try the pci_ims_alloc_irq()/pci_ims_free_irq() flow and
> pci_ims_free_irq() triggered the WARN below:
>
> remove_proc_entry: removing non-empty directory 'irq/220', leaking at least 'idxd-portal'

Hrm, that's the irq action directory. No idea why that is not torn down.

I assume your sequence is:

  pci_ims_alloc();
  request_irq();        <- This creates it
  free_irq();           <- This removes it
  pci_ims_free();

Right?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-18 22:31             ` Thomas Gleixner
@ 2022-11-18 22:59               ` Reinette Chatre
  2022-11-19  0:19                 ` Reinette Chatre
  0 siblings, 1 reply; 86+ messages in thread
From: Reinette Chatre @ 2022-11-18 22:59 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Hi Thomas,

On 11/18/2022 2:31 PM, Thomas Gleixner wrote:
> On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote:
>>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>>>  		if (ret)
>>>  			goto fail;
>>>  
>>> -		desc->msi_index = index;
>>> +		desc->msi_index = index - baseidx;
>>
>> Could msi_desc->msi_index be made bigger? The hardware I am testing
>> on claims to support more IMS entries than what the u16 can
>> accommodate.
> 
> Sure that's trivial. How big does it claim it is?

2048

> I assume your sequence is:
> 
>   pci_ims_alloc();
>   request_irq();        <- This creates it
>   free_irq();           <- This removes it
>   pci_ims_free();
> 
> Right?

No. My mistake. Sorry for the noise.

Reinette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at()
  2022-11-18 22:59               ` Reinette Chatre
@ 2022-11-19  0:19                 ` Reinette Chatre
  0 siblings, 0 replies; 86+ messages in thread
From: Reinette Chatre @ 2022-11-19  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Hi Thomas,

On 11/18/2022 2:59 PM, Reinette Chatre wrote:
> On 11/18/2022 2:31 PM, Thomas Gleixner wrote:
>> On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote:
>>>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc,
>>>>  		if (ret)
>>>>  			goto fail;
>>>>  
>>>> -		desc->msi_index = index;
>>>> +		desc->msi_index = index - baseidx;
>>>
>>> Could msi_desc->msi_index be made bigger? The hardware I am testing
>>> on claims to support more IMS entries than what the u16 can
>>> accommodate.
>>
>> Sure that's trivial. How big does it claim it is?
> 
> 2048

Dave Jiang corrected me ... the max the hardware can support
is 16128 so the current size of msi_index is sufficient.

Reinette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-18 22:08     ` Thomas Gleixner
@ 2022-11-21 17:20       ` Jason Gunthorpe
  2022-11-21 19:40         ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-21 17:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Fri, Nov 18, 2022 at 11:08:55PM +0100, Thomas Gleixner wrote:

> I looked into this and it gets ugly very fast.
> 
> The above has two parts:
> 
>     iobase    is domain specific and setup by the domain code
> 
>     cookie    is per interrupt allocation. That's where the instance
>               queue or whatever connects to the domain.
> 
> I can abuse the fields for PCI/MSI of course, but see below.

I don't know that we need to store the second one forever in the desc.
I was thinking this information is ephemeral, just used during alloc,
and if the msi domain driver wishes some of it to be stored then it
should do so.

> Sure I could make both cookies plain u64, but I hate these forced type
> casts and the above is simple to handle and understand.

I guess, they aren't what I think of as cookies, so I wouldn't make
them u64 in the first place.

The argument to msi_domain_alloc_irq_at() ideally wants to be a
per-domain-type struct so we can folow it around more cleanly. This is
C so we have to type erase it as a void * through the core code, but
OK.

The second one is typically called "driver private data" in device
driver subsystems that can't use container_of for some reason - just a
chunk of data the driver can associate with a core owned struct.

The usual pattern for driver private data is for the core to provide
some kind of accessor void *get_priv() (think dev_get_drvdata()) or
whatever.

But I do understand your point about keeping the drivers away from
things. Maybe some other pattern is better in this case.

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-21 17:20       ` Jason Gunthorpe
@ 2022-11-21 19:40         ` Thomas Gleixner
  2022-11-22  1:52           ` Jason Gunthorpe
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-21 19:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Mon, Nov 21 2022 at 13:20, Jason Gunthorpe wrote:
> On Fri, Nov 18, 2022 at 11:08:55PM +0100, Thomas Gleixner wrote:
>> Sure I could make both cookies plain u64, but I hate these forced type
>> casts and the above is simple to handle and understand.
>
> I guess, they aren't what I think of as cookies, so I wouldn't make
> them u64 in the first place.
>
> The argument to msi_domain_alloc_irq_at() ideally wants to be a
> per-domain-type struct so we can folow it around more cleanly. This is
> C so we have to type erase it as a void * through the core code, but
> OK.

When looking at the wire to MSI abomination and also PASID there is no
real per domain struct. It's plain integer information and I hate to
store it in a pointer. Especially as the pointer width on 32bit is not
necessarily sufficient.

Allocating 8 bytes and tracking them to be freed would be an horrible
idea.

> The second one is typically called "driver private data" in device
> driver subsystems that can't use container_of for some reason - just a
> chunk of data the driver can associate with a core owned struct.
>
> The usual pattern for driver private data is for the core to provide
> some kind of accessor void *get_priv() (think dev_get_drvdata()) or
> whatever.
>
> But I do understand your point about keeping the drivers away from
> things. Maybe some other pattern is better in this case.

At least from the two examples I have (IDXD and wire2MSI) the per
instance union works perfectly fine and I can't see a reason why
e.g. for your usecase

     cookie = { .ptr = myqueue };

would not work. The meaning of the cookie is domain implementation
defined and only the actual MSI domain and the related users know
whether its a value or a pointer and what to do with this information. I
named it cookie because from the core MSI code's view it's completely
opaque and aside of the fact that the allocation function copies the
cookie into msi_desc, the core does not care at all about it. Same for
the domain one which is solely handled by the domain setup code and is
e.g. used to enable the irq chip callbacks to do what they need to do.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-21 19:40         ` Thomas Gleixner
@ 2022-11-22  1:52           ` Jason Gunthorpe
  2022-11-22 20:49             ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-22  1:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Mon, Nov 21, 2022 at 08:40:05PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 21 2022 at 13:20, Jason Gunthorpe wrote:
> > On Fri, Nov 18, 2022 at 11:08:55PM +0100, Thomas Gleixner wrote:
> >> Sure I could make both cookies plain u64, but I hate these forced type
> >> casts and the above is simple to handle and understand.
> >
> > I guess, they aren't what I think of as cookies, so I wouldn't make
> > them u64 in the first place.
> >
> > The argument to msi_domain_alloc_irq_at() ideally wants to be a
> > per-domain-type struct so we can folow it around more cleanly. This is
> > C so we have to type erase it as a void * through the core code, but
> > OK.
> 
> When looking at the wire to MSI abomination and also PASID there is no
> real per domain struct. It's plain integer information and I hate to
> store it in a pointer. Especially as the pointer width on 32bit is not
> necessarily sufficient.
> 
> Allocating 8 bytes and tracking them to be freed would be an horrible
> idea.

No, not allocation, just wrap in a stack variable:

  struct foo_bar_domain_data arg = {.pasid = XX};

  msi_domain_alloc_irq_at(..., &arg);

Then there is a great big clue right in the code who is supposed to be
consuming that opaque argument. grep the code for foo_bar_domain_data
and you can find the receiving side

> At least from the two examples I have (IDXD and wire2MSI) the per
> instance union works perfectly fine and I can't see a reason why
> e.g. for your usecase
> 
>      cookie = { .ptr = myqueue };
> 
> would not work. 

I'm not saying not work, I'm asking about the style choice

Regards,
Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-22  1:52           ` Jason Gunthorpe
@ 2022-11-22 20:49             ` Thomas Gleixner
  2022-11-23 16:58               ` Jason Gunthorpe
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-22 20:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

Jason,

On Mon, Nov 21 2022 at 21:52, Jason Gunthorpe wrote:
> On Mon, Nov 21, 2022 at 08:40:05PM +0100, Thomas Gleixner wrote:
>> When looking at the wire to MSI abomination and also PASID there is no
>> real per domain struct. It's plain integer information and I hate to
>> store it in a pointer. Especially as the pointer width on 32bit is not
>> necessarily sufficient.
>> 
>> Allocating 8 bytes and tracking them to be freed would be an horrible
>> idea.
>
> No, not allocation, just wrap in a stack variable:
>
>   struct foo_bar_domain_data arg = {.pasid = XX};
>
>   msi_domain_alloc_irq_at(..., &arg);
>
> Then there is a great big clue right in the code who is supposed to be
> consuming that opaque argument. grep the code for foo_bar_domain_data
> and you can find the receiving side

Agreed for the one or two sane people who actually will create their
data struct. The rest will just hand in a random pointer or a casted
integer, which is pretty useless for finding the counterpart.

>> At least from the two examples I have (IDXD and wire2MSI) the per
>> instance union works perfectly fine and I can't see a reason why
>> e.g. for your usecase
>> 
>>      cookie = { .ptr = myqueue };
>> 
>> would not work. 
>
> I'm not saying not work, I'm asking about the style choice

Sure. The other reason why made this choice is that for many cases it
spares a callback to actually convert the pointer into real storage,
which is necessary because the data it points to is on stack.

Just copying the data into the MSI descriptor solves this nicely without
having some extra magic.

I guess we're nearing bike shed realm by now :) Let's pick one evil and
see how it works out. Coccinelle is there to help us fixing it up when
it turns out to be the wrong evil. :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-22 20:49             ` Thomas Gleixner
@ 2022-11-23 16:58               ` Jason Gunthorpe
  2022-11-23 18:38                 ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-11-23 16:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Tue, Nov 22, 2022 at 09:49:11PM +0100, Thomas Gleixner wrote:

> I guess we're nearing bike shed realm by now :) Let's pick one evil and
> see how it works out. Coccinelle is there to help us fixing it up when
> it turns out to be the wrong evil. :)

Sure, it is all changeable

I find your perspective on driver authors as the enemy quite
interesting :)

Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-23 16:58               ` Jason Gunthorpe
@ 2022-11-23 18:38                 ` Thomas Gleixner
  2022-12-01 12:24                   ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-11-23 18:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Wed, Nov 23 2022 at 12:58, Jason Gunthorpe wrote:
> On Tue, Nov 22, 2022 at 09:49:11PM +0100, Thomas Gleixner wrote:
>> I guess we're nearing bike shed realm by now :) Let's pick one evil and
>> see how it works out. Coccinelle is there to help us fixing it up when
>> it turns out to be the wrong evil. :)
>
> Sure, it is all changeable
>
> I find your perspective on driver authors as the enemy quite
> interesting :)

I'm not seeing them as enemies. Just my expectations are rather low by
now :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-11-23 18:38                 ` Thomas Gleixner
@ 2022-12-01 12:24                   ` Thomas Gleixner
  2022-12-02  0:35                     ` Jason Gunthorpe
  0 siblings, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-12-01 12:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

Jason!

On Wed, Nov 23 2022 at 19:38, Thomas Gleixner wrote:
> On Wed, Nov 23 2022 at 12:58, Jason Gunthorpe wrote:
>> I find your perspective on driver authors as the enemy quite
>> interesting :)
>
> I'm not seeing them as enemies. Just my expectations are rather low by
> now :)

This made me think about it for a while. Let me follow up on that.

When I set out to add real-time capabilities to the kernel about 20 years
ago, I did a thorough analysis of the kernel design and code base.

It turned out that aside of well encapsulated infrastructure, e.g. mm,
vfs, scheduler, network core, quite some of the rest was consisting of
blatant layering violations held together with duct tape, super glue and
haywire-circuit.

It was immediately clear to me, that this needs a lot of consolidation
and cleanup work to get me even close to the point where RT becomes
feasible as an integral part of the kernel. But not only this became
clear, I also realized that a continuation of this model will end up in
a maintenance nightmare sooner than later.

Me and the other people interested in RT estimated back then that it'll
take 5-10 years to get this done.

Boy, we were young and naive back then and completely underestimating
the efforts required. Obviously we were also underestimating the
concurrent influx of new stuff.

Just to give you an example. Our early experiments with substituting
spinlocks was just the start of the horrors. Instead of working on the
actual substitution mechanisms and the required other modifications, we
spent a vast amount of our time chasing dead locks all over the place.
My main test machine had not a single device driver which was correct
and working out of the box. What's worse is that we had to debate with
some of the driver people about the correctness of our locking analysis
and fight for stuff getting fixed.

This ended in writing and integrating lockdep, which has thankfully
taken this burden of our plate.

When I started to look into interrupt handling to add support for
threaded interrupts, which are a fundamental prerequisite for RT, the
next nightmare started to unfold.

The "generic" core code was a skeleton and everything real was
implemented in architecture specific code in completely incompatible
ways. It was not even possible to change common data structures without
breaking the world.  What was even worse, drivers fiddled in the
interrupt descriptors just to scratch an itch.

What I learned pretty fast is that most driver writers try to work
around short-comings in common infrastructure instead of tackling the
problem at the root or talking to the developers/maintainers of that
infrastructure.

The consequence of that is: if you want to change core infrastructure
you end up mopping up the driver tree in order not to break things all
over the place. There are clearly better ways to spend your time.

So I started to encapsulate things more strictly - admittedly to make my
own life easier. But at the same time I always tried hard to make these
encapsulations easy to use, to provide common infrastructure in order to
replace boilerplate code and to help with resource management, which is
one of the common problems in driver code. I'm also quite confident that
I carefully listened to the needs of driver developers and I think the
whole discussion about IMS last year is a good example for that. I
surely have opinions, but who doesn't?

So no, I'm not seeing driver writers as enemies. I'm just accepting the
reality that quite some of the drivers are written in "get it out the
door" mode. I'm well aware that there are other folks who stay around for
a long time and do proper engineering and maintenance, but that's sadly
the minority.

Being responsible for core infrastructure is an interesting challenge
especially with the zoo of legacy to keep alive and the knowledge that
you can break the world with a trivial and obviously "correct"
change. Been there, done that. :)

Thanks,

        Thomas

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-12-01 12:24                   ` Thomas Gleixner
@ 2022-12-02  0:35                     ` Jason Gunthorpe
  2022-12-02  2:14                       ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Jason Gunthorpe @ 2022-12-02  0:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

On Thu, Dec 01, 2022 at 01:24:03PM +0100, Thomas Gleixner wrote:
> Jason!
> 
> On Wed, Nov 23 2022 at 19:38, Thomas Gleixner wrote:
> > On Wed, Nov 23 2022 at 12:58, Jason Gunthorpe wrote:
> >> I find your perspective on driver authors as the enemy quite
> >> interesting :)
> >
> > I'm not seeing them as enemies. Just my expectations are rather low by
> > now :)
> 
> This made me think about it for a while. Let me follow up on that.

I didn't intend to pick such a harsh word, I get your point.

In lands like netdev/rdma breaking driver is unhappy, but we can get
away with it because most people don't have systems that won't boot if
you break those drivers. The people responsible will eventually test a
new kernel, see the warn/lockdep/etc and can debug and fix the problem
without a huge hassle.

In the platform code, if someone's machine boots to a black, dead,
screen you are much less likely to get a helpful person able to fix
it, more likely an end user is unhappy their kernel is
busted. Especially since the chances of a bug getting past the testing
basically forces it to be on some obscure configuration.

Due to this difference I've come to appreciate much more putting
stronger guard rails and clearer design on the lower level
code. Supporting the long tail of rare platforms is difficult.

> What I learned pretty fast is that most driver writers try to work
> around short-comings in common infrastructure instead of tackling the
> problem at the root or talking to the developers/maintainers of that
> infrastructure.

Yes, few people can tackle this stuff, and there are often interesting
headwinds.

Regards,
Jason

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 19/33] genirq/msi: Provide msi_desc::msi_data
  2022-12-02  0:35                     ` Jason Gunthorpe
@ 2022-12-02  2:14                       ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-12-02  2:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman, Dave Jiang,
	Alex Williamson, Kevin Tian, Dan Williams, Logan Gunthorpe,
	Ashok Raj, Jon Mason, Allen Hubbe, Ahmed S. Darwish,
	Reinette Chatre

Jason!

On Thu, Dec 01 2022 at 20:35, Jason Gunthorpe wrote:
> On Thu, Dec 01, 2022 at 01:24:03PM +0100, Thomas Gleixner wrote:
>> On Wed, Nov 23 2022 at 19:38, Thomas Gleixner wrote:
>> > On Wed, Nov 23 2022 at 12:58, Jason Gunthorpe wrote:
>> >> I find your perspective on driver authors as the enemy quite
>> >> interesting :)
>> >
>> > I'm not seeing them as enemies. Just my expectations are rather low by
>> > now :)
>> 
>> This made me think about it for a while. Let me follow up on that.
>
> I didn't intend to pick such a harsh word, I get your point.

I didn't take that as offence at all.

It's a good thing to be enforced from time to time to reflect on what
I'm doing and why I'm doing it to make sure that I'm not completely off
track.

Thanks,

        Thomas

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver
  2022-11-11 13:59 ` [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver Thomas Gleixner
@ 2022-12-02 17:55   ` Reinette Chatre
  2022-12-02 19:51     ` Thomas Gleixner
  0 siblings, 1 reply; 86+ messages in thread
From: Reinette Chatre @ 2022-12-02 17:55 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Hi Thomas,

On 11/11/2022 5:59 AM, Thomas Gleixner wrote:
> Provide a driver for the Intel IDXD IMS implementation. The implementation
> uses a large message store array in device memory.
> 
> The IMS domain implementation is minimal and just provides the required
> irq_chip callbacks and one domain callback which prepares the MSI
> descriptor which is allocated by the core for easy usage in the irq_chip
> callbacks.
> 
> The necessary iobase is stored in the irqdomain and the PASID which is
> required for operation is handed in via msi_dev_cookie in the allocation
> function.

The use of PASID is optional for dedicated workqueues. Could this be
supported to let the irqchip support all scenarios? Since the cookie is
always provided I was wondering if an invalid PASID can be used to let
the driver disable PASID? Please see the delta snippet below in which I
primarily made such a change, but added a few more changes for
consideration.

Summary of changes:
* Use provided invalid PASID to disable PASID for the interrupt.
* Use bitmask to ensure that the cookie only contains a valid PASID.
* Modify header comment to fix typo.
* Modify header comment to reflect driver usage of macro.

With the first change I am able to test IMS on the host using devmsi-v2-part3
of the development branch. I did try to update to the most recent development
to confirm all is well but version devmsi-v3.1-part3 behaves differently
in that pci_ims_alloc_irq() returns successfully but the returned
virq is 0. This triggers a problem when request_threaded_irq() runs and
reports:
genirq: Flags mismatch irq 0. 00000000 (idxd-portal) vs. 00015a00 (timer)

Thank you very much

Reinette

---
 drivers/irqchip/irq-pci-intel-idxd.c       | 20 ++++++++++++++------
 include/linux/irqchip/irq-pci-intel-idxd.h |  4 ++--
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/irqchip/irq-pci-intel-idxd.c b/drivers/irqchip/irq-pci-intel-idxd.c
index d33c32787ad5..1b49c884bd85 100644
--- a/drivers/irqchip/irq-pci-intel-idxd.c
+++ b/drivers/irqchip/irq-pci-intel-idxd.c
@@ -4,6 +4,7 @@
  * interrupt message store (IMS).
  */
 #include <linux/device.h>
+#include <linux/ioasid.h>
 #include <linux/irq.h>
 #include <linux/irqdomain.h>
 #include <linux/msi.h>
@@ -33,6 +34,8 @@ struct ims_slot {
 #define CTRL_PASID_ENABLE	BIT(3)
 /* Position of PASID.LSB in the control word */
 #define CTRL_PASID_SHIFT	12
+/* Valid PASID is 20 bits */
+#define CTRL_PASID_VALID	GENMASK(19, 0)
 
 static inline void iowrite32_and_flush(u32 value, void __iomem *addr)
 {
@@ -93,12 +96,17 @@ static void idxd_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
 	/* Mask the interrupt for paranoia sake */
 	iowrite32_and_flush(CTRL_VECTOR_MASKBIT, &slot->ctrl);
 
-	/*
-	 * The caller provided PASID. Shift it to the proper position
-	 * and set the PASID enable bit.
-	 */
-	desc->data.icookie.value <<= CTRL_PASID_SHIFT;
-	desc->data.icookie.value |= CTRL_PASID_ENABLE;
+	if (pasid_valid((ioasid_t)desc->data.icookie.value)) {
+		/*
+		 * The caller provided PASID. Shift it to the proper position
+		 * and set the PASID enable bit.
+		 */
+		desc->data.icookie.value &= CTRL_PASID_VALID;
+		desc->data.icookie.value <<= CTRL_PASID_SHIFT;
+		desc->data.icookie.value |= CTRL_PASID_ENABLE;
+	} else {
+		desc->data.icookie.value = 0;
+	}
 
 	arg->hwirq = desc->msi_index;
 }
diff --git a/include/linux/irqchip/irq-pci-intel-idxd.h b/include/linux/irqchip/irq-pci-intel-idxd.h
index d62ef5b3285c..48c73bffbb5d 100644
--- a/include/linux/irqchip/irq-pci-intel-idxd.h
+++ b/include/linux/irqchip/irq-pci-intel-idxd.h
@@ -9,8 +9,8 @@
 #include <linux/types.h>
 
 /*
- * Conveniance macro to wrap the PASID for interrupt allocation
- * via pci_ims_alloc_irq(pdev, INTEL_IDXD_DEV_COOKIE(pasid))
+ * Convenience macro to wrap the PASID for interrupt allocation
+ * via pci_ims_alloc_irq(pdev, &INTEL_IDXD_DEV_COOKIE(pasid))
  */
 #define INTEL_IDXD_DEV_COOKIE(pasid)	(union msi_instance_cookie) { .value = (pasid), }
 
---   

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver
  2022-12-02 17:55   ` Reinette Chatre
@ 2022-12-02 19:51     ` Thomas Gleixner
  2022-12-02 21:16       ` Reinette Chatre
  2022-12-05 15:20       ` Thomas Gleixner
  0 siblings, 2 replies; 86+ messages in thread
From: Thomas Gleixner @ 2022-12-02 19:51 UTC (permalink / raw)
  To: Reinette Chatre, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Reinette!

On Fri, Dec 02 2022 at 09:55, Reinette Chatre wrote:
> On 11/11/2022 5:59 AM, Thomas Gleixner wrote:
>> The necessary iobase is stored in the irqdomain and the PASID which is
>> required for operation is handed in via msi_dev_cookie in the allocation
>> function.
>
> The use of PASID is optional for dedicated workqueues. Could this be
> supported to let the irqchip support all scenarios?

Sure. I wrote this thing mostly out of thin air based on some ancient
PoC code. :)

> Since the cookie is always provided I was wondering if an invalid
> PASID can be used to let the driver disable PASID? Please see the
> delta snippet below in which I primarily made such a change, but added
> a few more changes for consideration.

Let me check.

> With the first change I am able to test IMS on the host using devmsi-v2-part3
> of the development branch. I did try to update to the most recent development
> to confirm all is well but version devmsi-v3.1-part3 behaves differently
> in that pci_ims_alloc_irq() returns successfully but the returned
> virq is 0. This triggers a problem when request_threaded_irq() runs and
> reports:
> genirq: Flags mismatch irq 0. 00000000 (idxd-portal) vs. 00015a00 (timer)

Bah. Let me figure out what I fat-fingered there.

> @@ -33,6 +34,8 @@ struct ims_slot {
>  #define CTRL_PASID_ENABLE	BIT(3)
>  /* Position of PASID.LSB in the control word */
>  #define CTRL_PASID_SHIFT	12
> +/* Valid PASID is 20 bits */
> +#define CTRL_PASID_VALID	GENMASK(19, 0)
>  
>  static inline void iowrite32_and_flush(u32 value, void __iomem *addr)
>  {
> @@ -93,12 +96,17 @@ static void idxd_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
>  	/* Mask the interrupt for paranoia sake */
>  	iowrite32_and_flush(CTRL_VECTOR_MASKBIT, &slot->ctrl);
>  
> -	/*
> -	 * The caller provided PASID. Shift it to the proper position
> -	 * and set the PASID enable bit.
> -	 */
> -	desc->data.icookie.value <<= CTRL_PASID_SHIFT;
> -	desc->data.icookie.value |= CTRL_PASID_ENABLE;
> +	if (pasid_valid((ioasid_t)desc->data.icookie.value)) {
> +		/*
> +		 * The caller provided PASID. Shift it to the proper position
> +		 * and set the PASID enable bit.
> +		 */
> +		desc->data.icookie.value &= CTRL_PASID_VALID;
> +		desc->data.icookie.value <<= CTRL_PASID_SHIFT;
> +		desc->data.icookie.value |= CTRL_PASID_ENABLE;
> +	} else {
> +		desc->data.icookie.value = 0;
> +	}

Looks about right. But that needs some sanity measures at the call sites
so that we don't end up with an invalid PASID in cases where a valid
PASID is truly required.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver
  2022-12-02 19:51     ` Thomas Gleixner
@ 2022-12-02 21:16       ` Reinette Chatre
  2022-12-05 15:20       ` Thomas Gleixner
  1 sibling, 0 replies; 86+ messages in thread
From: Reinette Chatre @ 2022-12-02 21:16 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Hi Thomas,

On 12/2/2022 11:51 AM, Thomas Gleixner wrote:
> On Fri, Dec 02 2022 at 09:55, Reinette Chatre wrote:
>> On 11/11/2022 5:59 AM, Thomas Gleixner wrote:

>> @@ -33,6 +34,8 @@ struct ims_slot {
>>  #define CTRL_PASID_ENABLE	BIT(3)
>>  /* Position of PASID.LSB in the control word */
>>  #define CTRL_PASID_SHIFT	12
>> +/* Valid PASID is 20 bits */
>> +#define CTRL_PASID_VALID	GENMASK(19, 0)
>>  
>>  static inline void iowrite32_and_flush(u32 value, void __iomem *addr)
>>  {
>> @@ -93,12 +96,17 @@ static void idxd_prepare_desc(struct irq_domain *domain, msi_alloc_info_t *arg,
>>  	/* Mask the interrupt for paranoia sake */
>>  	iowrite32_and_flush(CTRL_VECTOR_MASKBIT, &slot->ctrl);
>>  
>> -	/*
>> -	 * The caller provided PASID. Shift it to the proper position
>> -	 * and set the PASID enable bit.
>> -	 */
>> -	desc->data.icookie.value <<= CTRL_PASID_SHIFT;
>> -	desc->data.icookie.value |= CTRL_PASID_ENABLE;
>> +	if (pasid_valid((ioasid_t)desc->data.icookie.value)) {
>> +		/*
>> +		 * The caller provided PASID. Shift it to the proper position
>> +		 * and set the PASID enable bit.
>> +		 */
>> +		desc->data.icookie.value &= CTRL_PASID_VALID;
>> +		desc->data.icookie.value <<= CTRL_PASID_SHIFT;
>> +		desc->data.icookie.value |= CTRL_PASID_ENABLE;
>> +	} else {
>> +		desc->data.icookie.value = 0;
>> +	}
> 
> Looks about right. But that needs some sanity measures at the call sites
> so that we don't end up with an invalid PASID in cases where a valid
> PASID is truly required.

I will take a closer look at this. Current call site is explicit to
set an invalid PASID when PASID use is disabled. I still need to do testing
with valid PASID to learn those flows.

Reinette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver
  2022-12-02 19:51     ` Thomas Gleixner
  2022-12-02 21:16       ` Reinette Chatre
@ 2022-12-05 15:20       ` Thomas Gleixner
  2022-12-05 17:19         ` Reinette Chatre
  1 sibling, 1 reply; 86+ messages in thread
From: Thomas Gleixner @ 2022-12-05 15:20 UTC (permalink / raw)
  To: Reinette Chatre, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

On Fri, Dec 02 2022 at 20:51, Thomas Gleixner wrote:
> On Fri, Dec 02 2022 at 09:55, Reinette Chatre wrote:
>> With the first change I am able to test IMS on the host using devmsi-v2-part3
>> of the development branch. I did try to update to the most recent development
>> to confirm all is well but version devmsi-v3.1-part3 behaves differently
>> in that pci_ims_alloc_irq() returns successfully but the returned
>> virq is 0. This triggers a problem when request_threaded_irq() runs and
>> reports:
>> genirq: Flags mismatch irq 0. 00000000 (idxd-portal) vs. 00015a00 (timer)
>
> Bah. Let me figure out what I fat-fingered there.

tag devmsi-v3.2-part3 works again.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver
  2022-12-05 15:20       ` Thomas Gleixner
@ 2022-12-05 17:19         ` Reinette Chatre
  0 siblings, 0 replies; 86+ messages in thread
From: Reinette Chatre @ 2022-12-05 17:19 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Joerg Roedel, Will Deacon, linux-pci, Bjorn Helgaas,
	Lorenzo Pieralisi, Marc Zyngier, Greg Kroah-Hartman,
	Jason Gunthorpe, Dave Jiang, Alex Williamson, Kevin Tian,
	Dan Williams, Logan Gunthorpe, Ashok Raj, Jon Mason, Allen Hubbe,
	Ahmed S. Darwish

Hi Thomas,

On 12/5/2022 7:20 AM, Thomas Gleixner wrote:
> On Fri, Dec 02 2022 at 20:51, Thomas Gleixner wrote:
>> On Fri, Dec 02 2022 at 09:55, Reinette Chatre wrote:
>>> With the first change I am able to test IMS on the host using devmsi-v2-part3
>>> of the development branch. I did try to update to the most recent development
>>> to confirm all is well but version devmsi-v3.1-part3 behaves differently
>>> in that pci_ims_alloc_irq() returns successfully but the returned
>>> virq is 0. This triggers a problem when request_threaded_irq() runs and
>>> reports:
>>> genirq: Flags mismatch irq 0. 00000000 (idxd-portal) vs. 00015a00 (timer)
>>
>> Bah. Let me figure out what I fat-fingered there.
> 
> tag devmsi-v3.2-part3 works again.

Thank you very much.

This tag is not yet available but I can confirm that the current tip of
devmsi, 6bd4ee6cb126 ("irqchip: Add IDXD Interrupt Message Store driver"),
combined with the earlier irqchip driver delta snippet passes the "dedicated
kernel work queue using host IMS" tests.

Reinette

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2022-12-05 17:21 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-11 13:58 [patch 00/33] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation Thomas Gleixner
2022-11-11 13:58 ` [patch 01/33] genirq/msi: Rearrange MSI domain flags Thomas Gleixner
2022-11-16 18:41   ` Jason Gunthorpe
2022-11-11 13:58 ` [patch 02/33] genirq/msi: Provide struct msi_parent_ops Thomas Gleixner
2022-11-16 18:57   ` Jason Gunthorpe
2022-11-17 15:58     ` Thomas Gleixner
2022-11-18 13:52       ` Thomas Gleixner
2022-11-11 13:58 ` [patch 03/33] genirq/msi: Provide data structs for per device domains Thomas Gleixner
2022-11-11 13:58 ` [patch 04/33] genirq/msi: Add size info to struct msi_domain_info Thomas Gleixner
2022-11-11 13:58 ` [patch 05/33] genirq/msi: Split msi_create_irq_domain() Thomas Gleixner
2022-11-11 13:58 ` [patch 06/33] genirq/irqdomain: Add irq_domain::dev for per device MSI domains Thomas Gleixner
2022-11-11 13:58 ` [patch 07/33] genirq/msi: Provide msi_create/free_device_irq_domain() Thomas Gleixner
2022-11-11 13:58 ` [patch 08/33] genirq/msi: Provide msi_match_device_domain() Thomas Gleixner
2022-11-11 13:58 ` [patch 09/33] genirq/msi: Add range checking to msi_insert_desc() Thomas Gleixner
2022-11-11 13:58 ` [patch 10/33] PCI/MSI: Split __pci_write_msi_msg() Thomas Gleixner
2022-11-16 20:10   ` Bjorn Helgaas
2022-11-11 13:58 ` [patch 11/33] genirq/msi: Provide BUS_DEVICE_PCI_MSI[X] Thomas Gleixner
2022-11-11 13:58 ` [patch 12/33] PCI/MSI: Add support for per device MSI[X] domains Thomas Gleixner
2022-11-16 19:13   ` Jason Gunthorpe
2022-11-16 22:38     ` Thomas Gleixner
2022-11-17  0:22       ` Jason Gunthorpe
2022-11-17  8:45         ` Thomas Gleixner
2022-11-16 20:22   ` Bjorn Helgaas
2022-11-11 13:58 ` [patch 13/33] x86/apic/vector: Provide MSI parent domain Thomas Gleixner
2022-11-16 19:18   ` Jason Gunthorpe
2022-11-17 20:06     ` Thomas Gleixner
2022-11-11 13:58 ` [patch 14/33] PCI/MSI: Remove unused pci_dev_has_special_msi_domain() Thomas Gleixner
2022-11-16 20:13   ` Bjorn Helgaas
2022-11-11 13:58 ` [patch 15/33] iommu/vt-d: Switch to MSI parent domains Thomas Gleixner
2022-11-11 13:58 ` [patch 16/33] iommu/amd: Switch to MSI base domains Thomas Gleixner
2022-11-11 13:58 ` [patch 17/33] x86/apic/msi: Remove arch_create_remap_msi_irq_domain() Thomas Gleixner
2022-11-11 13:58 ` [patch 18/33] genirq/msi: Provide struct msi_map Thomas Gleixner
2022-11-11 13:58 ` [patch 19/33] genirq/msi: Provide msi_desc::msi_data Thomas Gleixner
2022-11-16 19:28   ` Jason Gunthorpe
2022-11-17  8:48     ` Thomas Gleixner
2022-11-17 13:33       ` Jason Gunthorpe
2022-11-18 22:08     ` Thomas Gleixner
2022-11-21 17:20       ` Jason Gunthorpe
2022-11-21 19:40         ` Thomas Gleixner
2022-11-22  1:52           ` Jason Gunthorpe
2022-11-22 20:49             ` Thomas Gleixner
2022-11-23 16:58               ` Jason Gunthorpe
2022-11-23 18:38                 ` Thomas Gleixner
2022-12-01 12:24                   ` Thomas Gleixner
2022-12-02  0:35                     ` Jason Gunthorpe
2022-12-02  2:14                       ` Thomas Gleixner
2022-11-11 13:58 ` [patch 20/33] genirq/msi: Provide msi_domain_ops::prepare_desc() Thomas Gleixner
2022-11-11 13:58 ` [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at() Thomas Gleixner
2022-11-16 19:36   ` Jason Gunthorpe
2022-11-17  9:40     ` Thomas Gleixner
2022-11-17 23:33   ` Reinette Chatre
2022-11-18  0:58     ` Thomas Gleixner
2022-11-18  9:15       ` Thomas Gleixner
2022-11-18 11:05         ` Thomas Gleixner
2022-11-18 18:18           ` Reinette Chatre
2022-11-18 22:31             ` Thomas Gleixner
2022-11-18 22:59               ` Reinette Chatre
2022-11-19  0:19                 ` Reinette Chatre
2022-11-11 13:58 ` [patch 22/33] genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN Thomas Gleixner
2022-11-16 19:36   ` Jason Gunthorpe
2022-11-11 13:58 ` [patch 23/33] PCI/MSI: Split MSIX descriptor setup Thomas Gleixner
2022-11-16 20:13   ` Bjorn Helgaas
2022-11-11 13:58 ` [patch 24/33] PCI/MSI: Provide prepare_desc() MSI domain op Thomas Gleixner
2022-11-16 19:40   ` Jason Gunthorpe
2022-11-16 20:26   ` Bjorn Helgaas
2022-11-16 22:42     ` Thomas Gleixner
2022-11-11 13:58 ` [patch 25/33] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X Thomas Gleixner
2022-11-16 20:19   ` Bjorn Helgaas
2022-11-16 22:43     ` Thomas Gleixner
2022-11-11 13:58 ` [patch 26/33] x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN Thomas Gleixner
2022-11-11 13:58 ` [patch 27/33] genirq/msi: Provide constants for PCI/IMS support Thomas Gleixner
2022-11-16 19:54   ` Jason Gunthorpe
2022-11-17  9:46     ` Thomas Gleixner
2022-11-11 13:58 ` [patch 28/33] PCI/MSI: Provide IMS (Interrupt Message Store) support Thomas Gleixner
2022-11-16 20:17   ` Bjorn Helgaas
2022-11-11 13:58 ` [patch 29/33] PCI/MSI: Provide pci_ims_alloc/free_irq() Thomas Gleixner
2022-11-16 20:14   ` Bjorn Helgaas
2022-11-11 13:58 ` [patch 30/33] x86/apic/msi: Enable PCI/IMS Thomas Gleixner
2022-11-11 13:59 ` [patch 31/33] iommu/vt-d: " Thomas Gleixner
2022-11-11 13:59 ` [patch 32/33] iommu/amd: " Thomas Gleixner
2022-11-11 13:59 ` [patch 33/33] irqchip: Add IDXD Interrupt Message Store driver Thomas Gleixner
2022-12-02 17:55   ` Reinette Chatre
2022-12-02 19:51     ` Thomas Gleixner
2022-12-02 21:16       ` Reinette Chatre
2022-12-05 15:20       ` Thomas Gleixner
2022-12-05 17:19         ` Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).