dmaengine Archive on
 help / color / Atom feed
From: Thomas Gleixner <>
To: "Dey\, Megha" <>,
	Jason Gunthorpe <>,
	"gregkh\" <>
Cc: Marc Zyngier <>, "Jiang\,
	Dave" <>,
	"vkoul\" <>,
	"bhelgaas\" <>,
	"rafael\" <>,
	"hpa\" <>,
	"alex.williamson\" <>, "Pan\,
	Jacob jun" <>, "Raj\,
	Ashok" <>, "Liu\, Yi L" <>,
	"Lu\, Baolu" <>, "Tian\,
	Kevin" <>, "Kumar\,
	Sanjay K" <>, "Luck\,
	Tony" <>, "Lin\, Jing" <>,
	"Williams\, Dan J" <>,
	"kwankhede\" <>,
	"eric.auger\" <>,
	"parav\" <>, "Hansen\,
	Dave" <>,
	"netanelg\" <>,
	"shahafs\" <>,
	"yan.y.zhao\" <>,
	"pbonzini\" <>, "Ortiz\,
	Samuel" <>, "Hossain\,
	Mona" <>,
	"dmaengine\" <>,
	"linux-kernel\" <>,
	"x86\" <>,
	"linux-pci\" <>,
	"kvm\" <>
Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI irq domain
Date: Sat, 08 Aug 2020 21:47:41 +0200
Message-ID: <> (raw)
In-Reply-To: <>


"Dey, Megha" <> writes:
> On 8/7/2020 9:47 AM, Thomas Gleixner wrote:
>> I'm all for sharing code and making the life of driver writers simple
>> because that makes my life simple as well, but not by creating a layer
>> at the wrong level and then hacking it into submission until it finally
>> collapses.
>> Designing the infrastructure following the clear layering rules of
>> hierarchical domains so it works for IMS and also replaces the platform
>> MSI hack is the only sane way to go forward, not the other way round.
>  From what I've gathered, I need to:
> 1. Get rid of the mantra that "IMS" is an extension of platform-msi.
> 2. Make this new infra devoid of any platform-msi references

See below.

> 3. Come up with a ground up approach which adheres to the layering 
> constraints of the IRQ subsystem

Yes. It's something which can be used by all devices which have:

   1) A device specific irq chip implementation including a msi write function
   2) Device specific resource management (slots in the IMS case)

The infrastructure you need is basically a wrapper around the core MSI
domain (similar to PCI, platform-MSI etc,) which provides the specific
functionality to handle the above.

> 4. Have common code (drivers/irqchip maybe??) where we put in all the 
> generic ims-specific bits for the IRQ chip and domain
> which can be used by all device drivers belonging to this "IMS"class.

Yes, you can provide a common implementation for devices which share the
same irq chip and domain (slot management functionality)

> 5. Have the device driver do the rest:
>      create the chip/domain (one chip/domain per device?)
>      provide device specific callbacks for masking, unmasking, write
>  message

Correct, but you don't need any magic new data structures for that, the
existing msi_domain_info/msi_domain_ops and related structures are
either sufficient or can be extended when necessary.

So for the IDXD case you need:

  1) An irq chip with mask/unmask callbacks and a write msg function
  2) A slot allocation or association function and their 'free'
     counterpart (irq_domain_ops)

The function and struct pointers go into the appropriate
msi_info/msi_ops structures along with the correct flags to set up the
whole thing and then the infrastructure creates your domain, fills in
the shared functions and sets the whole thing up.

That's all what a device driver needs to provide, i.e. stick the device
specific functionality into right data structures and let the common
infrastructure deal with it. The rest just works and the device specific
functions are invoked from the right places when required.

> So from the hierarchical domain standpoint, we will have:
> - For DSA device: vector->intel-IR->IDXD
> - For Jason's device: root domain-> domain A-> Jason's device's IRQ domain
> - For any other intel IMS device in the future which
>      does not require interrupt remapping: vector->new device IRQ domain
>      requires interrupt remapping: vector->intel-IR->new device IRQ 
> domain (i.e. create a new domain even though IDXD is already present?)

What's special about IDXD? It's just one specific implementation of IMS
and any other device implementing IMS is completely independent and as
documented in the specification the IMS slot management and therefore
the mask/unmask functionality can and will be completely different. IDXD
has a storage array with slots, Jason's hardware puts the IMS slot into
the queue storage.

It does not matter whether a device comes from Intel or any other vendor,
it does neither matter whether the device works with direct vector
delivery or interrupt remapping.

IDXD is not any different from any other IMS capable device when you
look at it from the interrupt hierarchy. It's either:

     vector -> IR -> device
     vector -> device

The only point where this is differentiated is when the irq domain is
created. Anything else just falls into place.

To answer Jason's question: No, the parent is never the PCI/MSI irq
domain because that sits at the same level as that device
domain. Remember the scheme:

   vector --- DMAR-MSI
	  |-- ....
	  |-- IR-0 --- IO/APIC-0
	  |        | 
	  |        |-- IO/APIC-1
	  |        |
	  |        |-- PCI/MSI-0
	  |        |
	  |        |-- HPET/MSI-0
	  |        |
	  |        |-- DEV-A/MSI-0
	  |        |-- DEV-A/MSI-1
	  |        |-- DEV-B/MSI-2
	  |-- IR-1 --- PCI/MSI-1
	  |        |
	  |        |-- DEV-C/MSI-3

The PCI/MSI domain(s) are dealing solely with PCI standard compliant
MSI/MSI-X. IMS or similar (platform-MSI being one variant) sit at the
same level as the PCI/MSI domains.

Why? It's how the hardware operates.

The PCI/MSI "irq chip" is configured by the PCI/MSI domain level and it
sends its message to the interrupt parent in the hierarchy, i.e. either
to the Interrupt Remap unit or to the configured vector of the target

IMS does not send it to some magic PCI layer first at least not at the
conceptual level. The fact that the message is transported by PCIe does
not change that at all. PCIe in that case is solely the transport, but
the "irq chip" at the PCI/MSI level of the device is not involved at
all. If it were that would be a different story.

So now you might ask why we have a single PCI/MSI domain per IR unit and
why I want seperate IMS domains.

The answer is in the hardware again. PCI/MSI is uniform accross devices
so the irq chip and all of the domain functionality can be shared. But
then we have two PCI/MSI domains in the above example because again the
hardware has one connected to IR unit 0 and the other to IR unit 1.
IR 0 and IR 1 manage different resources (remap tables) so PCI/MSI-0
depends on IR-0 and PCI/MSI-1 on IR-1 which is reflected in the
parent/child relation ship of the domains.

There is another reason why we can spawn a single PCI/MSI domain per
root complex / IR unit. The PCI/MSI domains are not doing any resource
management at all. The resulting message is created from the allocated
vector (direct CPU delivery) or from the allocated Interrupt remapping
slot information. The domain just deals with the logic required to
handle PCI/MSI(X) and the necessary resources are provided by the parent
interrupt layers.

IMS is different. It needs device specific resource management to
allocate an IMS slot which is clearly part of the "irq chip" management
layer, aka. irq domain. If the IMS slot management would happen in a
global or per IR unit table and as a consequence the management, layout,
mask/unmask operations would be uniform then an IMS domain per system or
IR unit would be the right choice, but that's not how the hardware is
specified and implemented.

Now coming back to platform MSI. The way it looks is:

                                 |-- PLATFORM-DEVICE-MSI-1

PLATFORM-MSI is a common resource management which also provides a
shared interrupt chip which operates at the PLATFORM-MSI level with one

  The irq_msi_write_msg() callback has an indirection so the actual
  devices can provide their device specific msi_write_msg() function.

That's a borderline abuse of the hierarchy, but it makes sense to some
extent as the actual PLATFORM-MSI domain is a truly shared resource and
the only device specific functionality required is the message
write. But that message write is not something which has it's own
resource management, it's just a non uniform storage accessor. IOW, the
underlying PLATFORM-MSI domain does all resource management including
message creation and the quirk allows to write the message in the device
specific way. Not that I love it, but ...

That is the main difference between platform MSI and IMS. IMS is
completely non uniform and the devices do not share any common resource
or chip functionality. Each device has its own message store management,
slot allocation/assignment and a device specifc interrupt chip
functionality which goes way beyond the nasty write msg quirk.

> What I still don't understand fully is what if all the IMS devices
> need the same domain ops and chip callbacks, we will be creating
> various instances of the same IRQ chip and domain right? Is that ok?

Why would it be not ok? Are you really worried about a few hundred bytes
of memory required for this? 

Sharing an instance only makes sense if the instance handles a shared or
uniform resource space, which is clearly not the case with IMS.

We create several PCI/MSI domains and several IO/APIC domains on larger
systems. They all share the code, but they are dealing with seperate
resources so they have seperate storage. 

> Currently the creation of the IRQ domain happens at the IR level so that 
> we can reuse the same domain but if it advisable to have a per device 
> interrupt domain, I will shift this to the device driver.

Again. Look at the layering. What you created now is a pseudo shared
domain which needs

   1) An indirection layer for providing device specific functions

   2) An extra allocation layer in the device specific driver to assign
      IMS slots completely outside of the domain allocation mechanism.

   In other words you try to make things which are neither uniform nor
   share a resource space look the same way. That's the "all I have is a
   hammer so everything is a nail" approach. That never worked out well.

With a per device domain/chip approach you get one consistent domain
per device which provides

   1) The device specific resource management (i.e. slot allocation
      becomes part of the irq domain operations)

   2) The device specific irq chip functions at the correct point in the
      layering without the horrid indirections

   3) Consolidated data storage at the device level where the actual
      data is managed.

   This is of course sharing as much code as possible with the MSI core

   As a side effect any extension of this be it on the domain or the irq
   chip side is just a matter of adding the functionality to that
   particular incarnation and not by having yet another indirection
   logic at the wrong place.

The price you pay is a bit of memory but you get a clean layering and
seperation of functionality as a reward. The amount of code in the
actual IMS device driver is not going to be much more than with the
approach you have now.

The infrastructure itself is not more than a thin wrapper around the
existing msi domain infrastructure and might even share code with



  parent reply index

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-21 16:02 [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
2020-07-21 16:02 ` [PATCH RFC v2 01/18] platform-msi: Introduce platform_msi_ops Dave Jiang
2020-07-21 16:02 ` [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI irq domain Dave Jiang
2020-07-21 16:13   ` Jason Gunthorpe
2020-07-22 16:50     ` Dey, Megha
2020-07-22 18:52   ` Marc Zyngier
2020-07-22 19:59     ` Jason Gunthorpe
2020-07-23  8:51       ` Marc Zyngier
2020-07-24  0:16         ` Jason Gunthorpe
2020-07-24  0:36           ` Thomas Gleixner
2020-08-05 19:18       ` Dey, Megha
2020-08-05 22:15         ` Jason Gunthorpe
2020-08-05 22:36           ` Dey, Megha
2020-08-05 22:53             ` Jason Gunthorpe
2020-08-06  0:13               ` Dey, Megha
2020-08-06  0:19                 ` Jason Gunthorpe
2020-08-06  0:32                   ` Dey, Megha
2020-08-06  0:46                     ` Jason Gunthorpe
2020-08-06 17:10                     ` Thomas Gleixner
2020-08-06 17:58                       ` Dey, Megha
2020-08-06 20:21                         ` Thomas Gleixner
2020-08-06 22:27                           ` Dey, Megha
2020-08-07  8:48                             ` Thomas Gleixner
2020-08-07 12:06                           ` Jason Gunthorpe
2020-08-07 12:38                             ` gregkh
2020-08-07 13:34                               ` Jason Gunthorpe
2020-08-07 16:47                                 ` Thomas Gleixner
2020-08-07 17:54                                   ` Dey, Megha
2020-08-07 18:39                                     ` Jason Gunthorpe
2020-08-07 20:31                                       ` Dey, Megha
2020-08-08 19:47                                     ` Thomas Gleixner [this message]
2020-08-10 21:46                                       ` Thomas Gleixner
2020-08-11  9:53                                         ` Thomas Gleixner
2020-08-11 18:46                                           ` Dey, Megha
2020-08-11 21:25                                             ` Thomas Gleixner
2020-08-11 18:39                                       ` Dey, Megha
2020-08-11 22:39                                         ` Thomas Gleixner
2020-08-07 15:22                             ` Thomas Gleixner
2020-08-05 18:55     ` Dey, Megha
2020-07-21 16:02 ` [PATCH RFC v2 03/18] irq/dev-msi: Create IR-DEV-MSI " Dave Jiang
2020-07-21 16:21   ` Jason Gunthorpe
2020-07-22 17:03     ` Dey, Megha
2020-07-22 17:33       ` Jason Gunthorpe
2020-07-22 20:44   ` Thomas Gleixner
2020-08-05 19:02     ` Dey, Megha
2020-07-21 16:02 ` [PATCH RFC v2 04/18] irq/dev-msi: Introduce APIs to allocate/free dev-msi interrupts Dave Jiang
2020-07-21 16:25   ` Jason Gunthorpe
2020-07-22 17:05     ` Dey, Megha
2020-07-22 17:35       ` Jason Gunthorpe
2020-08-05 20:19         ` Dey, Megha
2020-07-21 16:02 ` [PATCH RFC v2 05/18] dmaengine: idxd: add support for readonly config devices Dave Jiang
2020-07-21 16:02 ` [PATCH RFC v2 06/18] dmaengine: idxd: add interrupt handle request support Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 07/18] dmaengine: idxd: add DEV-MSI support in base driver Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 08/18] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 09/18] dmaengine: idxd: add basic mdev registration and helper functions Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 10/18] dmaengine: idxd: add emulation rw routines Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 11/18] dmaengine: idxd: prep for virtual device commands Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 12/18] dmaengine: idxd: virtual device commands emulation Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 13/18] dmaengine: idxd: ims setup for the vdcm Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 14/18] dmaengine: idxd: add mdev type as a new wq type Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 15/18] dmaengine: idxd: add dedicated wq mdev type Dave Jiang
2020-07-21 16:04 ` [PATCH RFC v2 16/18] dmaengine: idxd: add new wq state for mdev Dave Jiang
2020-07-21 16:04 ` [PATCH RFC v2 17/18] dmaengine: idxd: add error notification from host driver to mediated device Dave Jiang
2020-07-21 16:04 ` [PATCH RFC v2 18/18] dmaengine: idxd: add ABI documentation for mediated device support Dave Jiang
2020-07-21 16:28 ` [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver Greg KH
2020-07-21 17:17   ` Dave Jiang
2020-07-21 21:35   ` Dan Williams
2020-07-21 16:45 ` Jason Gunthorpe
2020-07-21 18:00   ` Dave Jiang
2020-07-22 17:31     ` Dey, Megha
2020-07-22 18:16       ` Jason Gunthorpe
2020-07-21 23:54   ` Tian, Kevin
2020-07-24  0:19     ` Jason Gunthorpe
2020-08-06  1:22       ` Alex Williamson
2020-08-07 12:19         ` Jason Gunthorpe
2020-08-10  7:32           ` Tian, Kevin
2020-08-11 17:00             ` Alex Williamson
2020-08-12  1:58               ` Tian, Kevin
2020-08-12  2:36                 ` Alex Williamson
2020-08-12  3:35                   ` Tian, Kevin
2020-08-12  3:28             ` Jason Wang
2020-08-12  4:05               ` Tian, Kevin
2020-08-13  4:33                 ` Jason Wang
2020-08-13  5:26                   ` Tian, Kevin
2020-08-13  6:01                     ` Jason Wang
2020-08-14 13:23                       ` Jason Gunthorpe
2020-08-17  2:24                         ` Tian, Kevin
2020-08-14 13:35             ` Jason Gunthorpe
2020-08-17  2:12               ` Tian, Kevin
2020-08-18  0:43                 ` Jason Gunthorpe
2020-08-18  1:09                   ` Tian, Kevin
2020-08-18 11:50                     ` Jason Gunthorpe
2020-08-18 16:27                       ` Paolo Bonzini
2020-08-18 16:49                         ` Jason Gunthorpe
2020-08-18 17:05                           ` Paolo Bonzini
2020-08-18 17:18                             ` Jason Gunthorpe
2020-08-19  7:29                       ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

dmaengine Archive on

Archives are clonable:
	git clone --mirror dmaengine/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dmaengine dmaengine/ \
	public-inbox-index dmaengine

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone