dmaengine.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: "Dey\, Megha" <megha.dey@intel.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	"gregkh\@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: Marc Zyngier <maz@kernel.org>, "Jiang\,
	Dave" <dave.jiang@intel.com>,
	"vkoul\@kernel.org" <vkoul@kernel.org>,
	"bhelgaas\@google.com" <bhelgaas@google.com>,
	"rafael\@kernel.org" <rafael@kernel.org>,
	"hpa\@zytor.com" <hpa@zytor.com>,
	"alex.williamson\@redhat.com" <alex.williamson@redhat.com>, "Pan\,
	Jacob jun" <jacob.jun.pan@intel.com>, "Raj\,
	Ashok" <ashok.raj@intel.com>, "Liu\, Yi L" <yi.l.liu@intel.com>,
	"Lu\, Baolu" <baolu.lu@intel.com>, "Tian\,
	Kevin" <kevin.tian@intel.com>, "Kumar\,
	Sanjay K" <sanjay.k.kumar@intel.com>, "Luck\,
	Tony" <tony.luck@intel.com>, "Lin\, Jing" <jing.lin@intel.com>,
	"Williams\, Dan J" <dan.j.williams@intel.com>,
	"kwankhede\@nvidia.com" <kwankhede@nvidia.com>,
	"eric.auger\@redhat.com" <eric.auger@redhat.com>,
	"parav\@mellanox.com" <parav@mellanox.com>, "Hansen\,
	Dave" <dave.hansen@intel.com>,
	"netanelg\@mellanox.com" <netanelg@mellanox.com>,
	"shahafs\@mellanox.com" <shahafs@mellanox.com>,
	"yan.y.zhao\@linux.intel.com" <yan.y.zhao@linux.intel.com>,
	"pbonzini\@redhat.com" <pbonzini@redhat.com>, "Ortiz\,
	Samuel" <samuel.ortiz@intel.com>, "Hossain\,
	Mona" <mona.hossain@intel.com>,
	"dmaengine\@vger.kernel.org" <dmaengine@vger.kernel.org>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"x86\@kernel.org" <x86@kernel.org>,
	"linux-pci\@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"kvm\@vger.kernel.org" <kvm@vger.kernel.org>,
	xen-devel@lists.xenproject.org,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Juergen Gross <jgross@suse.com>
Subject: Re: [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI irq domain
Date: Tue, 11 Aug 2020 23:25:20 +0200	[thread overview]
Message-ID: <87bljg7u4f.fsf@nanos.tec.linutronix.de> (raw)
In-Reply-To: <8a8a853c-cbe6-b19c-f6ba-c8cdeda84a36@intel.com>

"Dey, Megha" <megha.dey@intel.com> writes:
> On 8/11/2020 2:53 AM, Thomas Gleixner wrote:
>>> And the annoying fact that you need XEN support which opens another can
>>> of worms...
>
> hmm I am not sure why we need Xen support... are you referring to idxd 
> using xen?

What about using IDXD when you are running on XEN? I might be missing
something and IDXD/IMS is hypervisor only, but that still does not solve
this problem on bare metal:

>> x86 still does not associate the irq domain to devices at device
>> discovery time, i.e. the device::msi_domain pointer is never
>> populated.

We can't do that right now due to the way how X86 PCI/MSI allocation
works and being able to do so would make things consistent and way
simpler even for your stuff.

>> The right thing to do is to convert XEN MSI support over to proper irq
>> domains. This allows to populate device::msi_domain which makes a lot of
>> things simpler and also more consistent.
>
> do you think this cleanup is to be a precursor to my patches? I could 
> look into it but I am not familiar with the background of Xen
>
> and this stuff. Can you please provide further guidance on where to
> look

As I said:

>> So to support this new fangled device MSI stuff we'd need yet more
>> x86/xen specific arch_*msi_irqs() indirection and hackery, which is not
>> going to happen.

  git grep arch_.*msi_irq arch/x86

This indirection prevents storing the irq_domain pointer in the device
at probe/detection time. Native code already uses irq domains for
PCI/MSI but we can't exploit the full potential because then
pci_msi_setup_msi_irqs() would never end up in arch_setup_msi_irqs()
which breaks XEN.

I was reminded of that nastiness when I was looking at sensible ways to
integrate this device MSI maze proper.

From a conceptual POV this stuff, which is not restricted to IDXD at all,
looks like this:

           ]-------------------------------------------|
PCI BUS -- | PCI device                                |
           ]-------------------|                       |
           | Physical function |                       |
           ]-------------------|                       |
           ]-------------------|----------|            |
           | Control block for subdevices |            |
           ]------------------------------|            |
           |            | <- "Subdevice BUS"           |
           |            |                              |
           |            |-- Subddevice 0               | 
           |            |-- Subddevice 1               | 
           |            |-- ...                        | 
           |            |-- Subddevice N               | 
           ]-------------------------------------------|

It does not matter whether this is IDXD with it's magic devices or a
network card with a gazillion of queues. Conceptually we need to look at
them as individual subdevices.

And obviously the above picture gives you the topology. The physical
function device belongs to PCI in all aspects including the MSI
interrupt control. The control block is part of the PCI device as well
and it even can have regular PCI/MSI interrupts for its own
purposes. There might be devices where the Physical function device does
not exist at all and the only true PCI functionality is the control
block to manage subdevices. That does not matter and does not change the
concept.

Now the subdevices belong topology wise NOT to the PCI part. PCI is just
the transport they utilize. And their irq domain is distinct from the
PCI/MSI domain for reasons I explained before.

So looking at it from a Linux perspective:

  pci-bus -> PCI device (managed by PCI/MSI domain)
               - PF device
               - CB device (hosts DEVMSI domain)
                    | "Subdevice bus"
                    | - subdevice
                    | - subdevice
                    | - subdevice

Now you would assume that figuring out the irq domain which the DEVMSI
domain serving the subdevices on the subdevice bus should take as parent
is pretty trivial when looking at the topology, right?

CB device's parent is PCI device and we know that PCI device MSI is
handled by the PCI/MSI domain which is either system wide or per IR
unit.

So getting the relevant PCI/MSI irq domain is as simple as doing:

   pcimsi_domain = pcidevice->device->msi_domain;

and then because we know that this is a hierarchy the parent domain of
pcimsi_domain is the one which is the parent of our DEVMSI domain, i.e.:

   parent = pcmsi_domain->parent;

Obvious, right?

What's not so obvious is that pcidevice->device->msi_domain is not
populated on x86 and trying to get the parent from there is a NULL
pointer dereference which does not work well.

So you surely can hack up some workaround for this, but that's just
proliferating crap. We want this to be consistent and there is
absolutely no reason why that network card with the MSI storage in the
queue data should not work on any other architecture.

We do the correct association already for IOMMU and whatever topological
stuff is attached to (PCI) devices on probe/detection time so making it
consistent for irq domains is just a logical consequence and matter of
consistency.

Back in the days when x86 was converted to hierarchical irq domains in
order to support I/O APIC hotplug this workaround was accepted to make
progress and it was meant as a transitional step. Of course after the
goal was achieved nobody @Intel cared anymore and so far this did not
cause big problems. But now it does and we really want to make this
consistent first.

And no we are not making an exception for IDXD either just because
that's Intel only. Intel is not special and not exempt from cleaning
stuff up before adding new features especially not when the stuff to
cleanup is a leftover from Intel itself. IOW, we are not adding more
crap on top of crap which should not exists anymore.

It's not rocket science to fix this. All it needs is to let XEN create
irq domains and populate them during init.

On device detection/probe the proper domain needs to be determined which
is trivial and then stored in device->msi_domain. That makes
arch_.*_msi_irq() go away and a lot of code just simpler.

Thanks,

        tglx

  reply	other threads:[~2020-08-11 21:25 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-21 16:02 [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver Dave Jiang
2020-07-21 16:02 ` [PATCH RFC v2 01/18] platform-msi: Introduce platform_msi_ops Dave Jiang
2020-07-21 16:02 ` [PATCH RFC v2 02/18] irq/dev-msi: Add support for a new DEV_MSI irq domain Dave Jiang
2020-07-21 16:13   ` Jason Gunthorpe
2020-07-22 16:50     ` Dey, Megha
2020-07-22 18:52   ` Marc Zyngier
2020-07-22 19:59     ` Jason Gunthorpe
2020-07-23  8:51       ` Marc Zyngier
2020-07-24  0:16         ` Jason Gunthorpe
2020-07-24  0:36           ` Thomas Gleixner
2020-08-05 19:18       ` Dey, Megha
2020-08-05 22:15         ` Jason Gunthorpe
2020-08-05 22:36           ` Dey, Megha
2020-08-05 22:53             ` Jason Gunthorpe
2020-08-06  0:13               ` Dey, Megha
2020-08-06  0:19                 ` Jason Gunthorpe
2020-08-06  0:32                   ` Dey, Megha
2020-08-06  0:46                     ` Jason Gunthorpe
2020-08-06 17:10                     ` Thomas Gleixner
2020-08-06 17:58                       ` Dey, Megha
2020-08-06 20:21                         ` Thomas Gleixner
2020-08-06 22:27                           ` Dey, Megha
2020-08-07  8:48                             ` Thomas Gleixner
2020-08-07 12:06                           ` Jason Gunthorpe
2020-08-07 12:38                             ` gregkh
2020-08-07 13:34                               ` Jason Gunthorpe
2020-08-07 16:47                                 ` Thomas Gleixner
2020-08-07 17:54                                   ` Dey, Megha
2020-08-07 18:39                                     ` Jason Gunthorpe
2020-08-07 20:31                                       ` Dey, Megha
2020-08-08 19:47                                     ` Thomas Gleixner
2020-08-10 21:46                                       ` Thomas Gleixner
2020-08-11  9:53                                         ` Thomas Gleixner
2020-08-11 18:46                                           ` Dey, Megha
2020-08-11 21:25                                             ` Thomas Gleixner [this message]
2020-08-11 18:39                                       ` Dey, Megha
2020-08-11 22:39                                         ` Thomas Gleixner
2020-08-07 15:22                             ` Thomas Gleixner
2020-08-05 18:55     ` Dey, Megha
2020-07-21 16:02 ` [PATCH RFC v2 03/18] irq/dev-msi: Create IR-DEV-MSI " Dave Jiang
2020-07-21 16:21   ` Jason Gunthorpe
2020-07-22 17:03     ` Dey, Megha
2020-07-22 17:33       ` Jason Gunthorpe
2020-07-22 20:44   ` Thomas Gleixner
2020-08-05 19:02     ` Dey, Megha
2020-07-21 16:02 ` [PATCH RFC v2 04/18] irq/dev-msi: Introduce APIs to allocate/free dev-msi interrupts Dave Jiang
2020-07-21 16:25   ` Jason Gunthorpe
2020-07-22 17:05     ` Dey, Megha
2020-07-22 17:35       ` Jason Gunthorpe
2020-08-05 20:19         ` Dey, Megha
2020-07-21 16:02 ` [PATCH RFC v2 05/18] dmaengine: idxd: add support for readonly config devices Dave Jiang
2020-07-21 16:02 ` [PATCH RFC v2 06/18] dmaengine: idxd: add interrupt handle request support Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 07/18] dmaengine: idxd: add DEV-MSI support in base driver Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 08/18] dmaengine: idxd: add device support functions in prep for mdev Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 09/18] dmaengine: idxd: add basic mdev registration and helper functions Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 10/18] dmaengine: idxd: add emulation rw routines Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 11/18] dmaengine: idxd: prep for virtual device commands Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 12/18] dmaengine: idxd: virtual device commands emulation Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 13/18] dmaengine: idxd: ims setup for the vdcm Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 14/18] dmaengine: idxd: add mdev type as a new wq type Dave Jiang
2020-07-21 16:03 ` [PATCH RFC v2 15/18] dmaengine: idxd: add dedicated wq mdev type Dave Jiang
2020-07-21 16:04 ` [PATCH RFC v2 16/18] dmaengine: idxd: add new wq state for mdev Dave Jiang
2020-07-21 16:04 ` [PATCH RFC v2 17/18] dmaengine: idxd: add error notification from host driver to mediated device Dave Jiang
2020-07-21 16:04 ` [PATCH RFC v2 18/18] dmaengine: idxd: add ABI documentation for mediated device support Dave Jiang
2020-07-21 16:28 ` [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver Greg KH
2020-07-21 17:17   ` Dave Jiang
2020-07-21 21:35   ` Dan Williams
2020-07-21 16:45 ` Jason Gunthorpe
2020-07-21 18:00   ` Dave Jiang
2020-07-22 17:31     ` Dey, Megha
2020-07-22 18:16       ` Jason Gunthorpe
2020-07-21 23:54   ` Tian, Kevin
2020-07-24  0:19     ` Jason Gunthorpe
2020-08-06  1:22       ` Alex Williamson
2020-08-07 12:19         ` Jason Gunthorpe
2020-08-10  7:32           ` Tian, Kevin
2020-08-11 17:00             ` Alex Williamson
2020-08-12  1:58               ` Tian, Kevin
2020-08-12  2:36                 ` Alex Williamson
2020-08-12  3:35                   ` Tian, Kevin
2020-08-12  3:28             ` Jason Wang
2020-08-12  4:05               ` Tian, Kevin
2020-08-13  4:33                 ` Jason Wang
2020-08-13  5:26                   ` Tian, Kevin
2020-08-13  6:01                     ` Jason Wang
2020-08-14 13:23                       ` Jason Gunthorpe
2020-08-17  2:24                         ` Tian, Kevin
2020-08-14 13:35             ` Jason Gunthorpe
2020-08-17  2:12               ` Tian, Kevin
2020-08-18  0:43                 ` Jason Gunthorpe
2020-08-18  1:09                   ` Tian, Kevin
2020-08-18 11:50                     ` Jason Gunthorpe
2020-08-18 16:27                       ` Paolo Bonzini
2020-08-18 16:49                         ` Jason Gunthorpe
2020-08-18 17:05                           ` Paolo Bonzini
2020-08-18 17:18                             ` Jason Gunthorpe
2020-08-19  7:29                       ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bljg7u4f.fsf@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=alex.williamson@redhat.com \
    --cc=ashok.raj@intel.com \
    --cc=baolu.lu@intel.com \
    --cc=bhelgaas@google.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dmaengine@vger.kernel.org \
    --cc=eric.auger@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hpa@zytor.com \
    --cc=jacob.jun.pan@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jgross@suse.com \
    --cc=jing.lin@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=megha.dey@intel.com \
    --cc=mona.hossain@intel.com \
    --cc=netanelg@mellanox.com \
    --cc=parav@mellanox.com \
    --cc=pbonzini@redhat.com \
    --cc=rafael@kernel.org \
    --cc=samuel.ortiz@intel.com \
    --cc=sanjay.k.kumar@intel.com \
    --cc=shahafs@mellanox.com \
    --cc=tony.luck@intel.com \
    --cc=vkoul@kernel.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=yan.y.zhao@linux.intel.com \
    --cc=yi.l.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).